cs.CL - 2023-11-23

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

paper_url: http://arxiv.org/abs/2311.14212
repo_url: https://github.com/chkern/tweet-annotation-sensitivity
paper_authors: Christoph Kern, Stephanie Eckman, Jacob Beck, Rob Chew, Bolei Ma, Frauke Kreuter
for: 这个论文主要针对的是 annotation instrument 的设计对下游模型性能和预测的影响。
methods: 这个研究使用了 five experimental conditions 来收集 hate speech 和不当语言的标注数据，然后使用 BERT 模型进行 fine-tuning，并评估模型在剩下部分的性能和预测。
results: 研究发现，不同的 annotation instrument 设计会导致不同的 hate speech 和不当语言标注分布，以及模型性能和预测的差异。这些结果告诉我们， annotation instrument 的设计对下游模型性能和预测具有重要的影响。

Abstract
When training data are collected from human annotators, the design of the annotation instrument, the instructions given to annotators, the characteristics of the annotators, and their interactions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. We introduce the term annotation sensitivity to refer to the impact of annotation data collection methods on the annotations themselves and on downstream model performance and predictions. We collect annotations of hate speech and offensive language in five experimental conditions of an annotation instrument, randomly assigning annotators to conditions. We then fine-tune BERT models on each of the five resulting datasets and evaluate model performance on a holdout portion of each condition. We find considerable differences between the conditions for 1) the share of hate speech/offensive language annotations, 2) model performance, 3) model predictions, and 4) model learning curves. Our results emphasize the crucial role played by the annotation instrument which has received little attention in the machine learning literature. We call for additional research into how and why the instrument impacts the annotations to inform the development of best practices in instrument design.

摘要
We collected annotations of hate speech and offensive language in five different conditions of an annotation instrument, and randomly assigned annotators to each condition. We then fine-tuned BERT models on each of the five datasets and evaluated their performance on a separate portion of each condition. We found significant differences between the conditions in:1. The share of hate speech/offensive language annotations2. Model performance3. Model predictions4. Model learning curvesOur results highlight the important role played by the annotation instrument, which has received little attention in the machine learning literature. We recommend further research into how and why the instrument impacts the annotations, in order to develop best practices in instrument design.

A Systematic Review of Deep Learning-based Research on Radiology Report Generation

paper_url: http://arxiv.org/abs/2311.14199
repo_url: https://github.com/synlp/rrg-review
paper_authors: Chang Liu, Yuanhe Tian, Yan Song
for: 本研究的目的是为了自动生成医疗影像中的自由文本描述，以便提高诊断效率和减轻医生的工作负担。
methods: 本研究使用深度学习方法来实现医疗影像中的自由文本描述生成，包括对图像特征、报告内容和多modal交互进行研究和优化。
results: 本研究提供了一个全面的深度学习基于医疗影像自由文本描述生成的综述，包括不同方法的研究和比较，以及未来发展趋势的分析。

Abstract
Radiology report generation (RRG) aims to automatically generate free-text descriptions from clinical radiographs, e.g., chest X-Ray images. RRG plays an essential role in promoting clinical automation and presents significant help to provide practical assistance for inexperienced doctors and alleviate radiologists' workloads. Therefore, consider these meaningful potentials, research on RRG is experiencing explosive growth in the past half-decade, especially with the rapid development of deep learning approaches. Existing studies perform RRG from the perspective of enhancing different modalities, provide insights on optimizing the report generation process with elaborated features from both visual and textual information, and further facilitate RRG with the cross-modal interactions among them. In this paper, we present a comprehensive review of deep learning-based RRG from various perspectives. Specifically, we firstly cover pivotal RRG approaches based on the task-specific features of radiographs, reports, and the cross-modal relations between them, and then illustrate the benchmark datasets conventionally used for this task with evaluation metrics, subsequently analyze the performance of different approaches and finally offer our summary on the challenges and the trends in future directions. Overall, the goal of this paper is to serve as a tool for understanding existing literature and inspiring potential valuable research in the field of RRG.

摘要
radiology report generation (RRG) 目的是自动生成医疗影像中的自由文本描述，例如胸部X射线图像。 RRG 扮演着临床自动化的重要角色，可以为不熟悉的医生提供实用的帮助，并减轻 радиологи学家的工作负担。因此，研究 RRG 在过去半个 décennial 内经受了极速增长，特别是在深度学习方法的快速发展。现有的研究从不同的角度进行 RRG，包括在不同模式下进行优化报告生成过程，以及在视觉信息和文本信息之间进行交互。在这篇论文中，我们提供了深度学习基于 RRG 的全面回顾，包括任务特定的特征、报告、影像和交互之间的跨模态关系。我们首先介绍了主要的 RRG 方法，然后介绍了通常用于这项任务的基准数据集，评估指标，分析不同方法的性能，并最后提供了未来发展的挑战和趋势。总的来说，本文的目的是为了帮助读者理解现有文献，以及激发可能有价值的研究在 RRG 领域。

Question Answering in Natural Language: the Special Case of Temporal Expressions

paper_url: http://arxiv.org/abs/2311.14087
repo_url: None
paper_authors: Armand Stricker
for: 本研究旨在使用通用问答模型来回答时间问题。
methods: 我们使用了一种常见的问答抽取方法，通过匹配模式来回答问题。
results: 我们的评估表明，通过将模式匹配技术应用于时间问题，可以准确地回答问题。

Abstract
Although general question answering has been well explored in recent years, temporal question answering is a task which has not received as much focus. Our work aims to leverage a popular approach used for general question answering, answer extraction, in order to find answers to temporal questions within a paragraph. To train our model, we propose a new dataset, inspired by SQuAD, specifically tailored to provide rich temporal information. We chose to adapt the corpus WikiWars, which contains several documents on history's greatest conflicts. Our evaluation shows that a deep learning model trained to perform pattern matching, often used in general question answering, can be adapted to temporal question answering, if we accept to ask questions whose answers must be directly present within a text.

摘要
(Simplified Chinese)尽管普通的问答任务在最近几年得到了广泛的研究，但是时间问答任务尚未得到过相应的关注。我们的工作想要利用通用的问答任务解决方案，即答案提取，以找到文本中的时间问题的答案。为了训练我们的模型，我们提出了一个新的数据集，受到SQuAD的启发，特地设计为提供丰富的时间信息。我们选择了适应 WikiWars 词库，这个词库包含了历史上最大的战争文档。我们的评估表明，通过 Pattern matching 深度学习模型，通常用于普通的问答任务，可以适应时间问答任务，只要我们问的问题的答案必须直接存在于文本中。

Searching for Snippets of Open-Domain Dialogue in Task-Oriented Dialogue Datasets

paper_url: http://arxiv.org/abs/2311.14076
repo_url: None
paper_authors: Armand Stricker, Patrick Paroubek
for: 本研究旨在 bridging task-oriented 对话和社交对话之间的差异，以提高对话系统的效果。
methods: 研究者使用了主题分析和关键词搜索，检查 Schema-Guided Dialogues 和 MultiWOZ 的训练集是否含有社交对话序列。
results: 研究发现，这些训练集已经含有社交对话序列，表明可以将社交对话纳入任务对话中以增强对话效果。

Abstract
Most existing dialogue corpora and models have been designed to fit into 2 predominant categories : task-oriented dialogues portray functional goals, such as making a restaurant reservation or booking a plane ticket, while chit-chat/open-domain dialogues focus on holding a socially engaging talk with a user. However, humans tend to seamlessly switch between modes and even use chitchat to enhance task-oriented conversations. To bridge this gap, new datasets have recently been created, blending both communication modes into conversation examples. The approaches used tend to rely on adding chit-chat snippets to pre-existing, human-generated task-oriented datasets. Given the tendencies observed in humans, we wonder however if the latter do not \textit{already} hold chit-chat sequences. By using topic modeling and searching for topics which are most similar to a set of keywords related to social talk, we explore the training sets of Schema-Guided Dialogues and MultiWOZ. Our study shows that sequences related to social talk are indeed naturally present, motivating further research on ways chitchat is combined into task-oriented dialogues.

摘要
现有大多数对话 corpus 和模型都是为两种主导类型设计的：任务强调对话，描述了做一件事情，如预订餐厅或订车票，而另一种是开放领域对话，强调与用户保持社交互动。然而，人类往往在对话中自然地轮换between these two modes，甚至使用社交交流来增强任务 oriented 对话。为了填补这个差距，新的 dataset 已经被创建，将这两种通信模式结合在一起。现有的方法通常是通过添加社交交流的精炼到 pré-existing, 人类生成的任务 oriented dataset 中。从人类的行为来看，我们始终思考是否不是 latter dataset 已经包含了社交交流序列。我们使用主题分析和使用相关于社交交流的关键词进行搜索，我们的研究表明，这些training set 中已经存在社交交流序列，这种现象激励我们进一步研究如何将社交交流与任务 oriented 对话结合在一起。

Enhancing Task-Oriented Dialogues with Chitchat: a Comparative Study Based on Lexical Diversity and Divergence

paper_url: http://arxiv.org/abs/2311.14067
repo_url: None
paper_authors: Armand Stricker, Patrick Paroubek
for: 这篇论文主要研究了如何通过增加幽默话语来提高任务对话的多样性和聊天性。
methods: 本文比较分析了三种增加幽默话语的方法，并评估了每种方法的效果。
results: 研究发现，使用不同的增加幽默话语方法可以提高任务对话的多样性和自然性，并且可以减少幽默话语和任务对话之间的重复和预测性。

Abstract
As a recent development, task-oriented dialogues (TODs) have been enriched with chitchat in an effort to make dialogues more diverse and engaging. This enhancement is particularly valuable as TODs are often confined to narrow domains, making the mitigation of repetitive and predictable responses a significant challenge. This paper presents a comparative analysis of three chitchat enhancements, aiming to identify the most effective approach in terms of diversity. Additionally, we quantify the divergence between the added chitchat, the original task-oriented language, and chitchat typically found in chitchat datasets, highlighting the top 20 divergent keywords for each comparison. Our findings drive a discussion on future enhancements for augmenting TODs, emphasizing the importance of grounding dialogues beyond the task to achieve more diverse and natural exchanges.

摘要
Recently, task-oriented dialogues (TODs) have been enriched with chitchat to make dialogues more diverse and engaging. This enhancement is particularly valuable as TODs are often confined to narrow domains, making it challenging to mitigate repetitive and predictable responses. This paper presents a comparative analysis of three chitchat enhancements to identify the most effective approach in terms of diversity. Additionally, we quantify the divergence between the added chitchat, the original task-oriented language, and chitchat typically found in chitchat datasets, highlighting the top 20 divergent keywords for each comparison. Our findings drive a discussion on future enhancements for augmenting TODs, emphasizing the importance of grounding dialogues beyond the task to achieve more diverse and natural exchanges.Here's the translation of the text in Traditional Chinese:近期，任务对话 (TOD) 已经被丰富化，以增加对话的多样性和兴趣。这个增值特别重要，因为 TOD 经常受限于狭窄的领域，导致回答很可能会很重复和预测性。这篇 paper 提供了三种增加对话的方法的比较分析，以找出最有效的方法。此外，我们也量化了增加的对话与原始任务语言之间的差异，以及对话 dataset 中常见的对话语言之间的差异，并 highlights 这 20 个最大的差异词汇。我们的发现驱动了未来对 TOD 的增值，强调需要将对话脱离任务，以 achieve 更多样化和自然的交流。

Do VSR Models Generalize Beyond LRS3?

paper_url: http://arxiv.org/abs/2311.14063
repo_url: None
paper_authors: Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Eustache Le Bihan, Haithem Boussaid, Ebtessam Almazrouei, Merouane Debbah
for: 提高视觉SpeechRecognition（VSR）模型的 Robustness，避免过拟合Lip Reading Sentences-3（LRS3）测试集的风险。
methods: 基于LRS3测试集的创建过程，建立了一个新的VSR测试集名为WildVSR，以评估当前VSR模型对新测试数据的泛化能力。
results: 对一系列公开available的VSR模型进行评估，发现其在WildVSR测试集上表现较差，相比LRS3测试集的结果显示出word Error Rates（WER）的增加，这被解释为模型无法泛化轻度更难和在野lip sequences。

Abstract
The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build a new VSR test set named WildVSR, by closely following the LRS3 dataset creation processes. We then evaluate and analyse the extent to which the current VSR models generalize to the new test data. We evaluate a broad range of publicly available VSR models and find significant drops in performance on our test set, compared to their corresponding LRS3 results. Our results suggest that the increase in word error rates is caused by the models inability to generalize to slightly harder and in the wild lip sequences than those found in the LRS3 test set. Our new test benchmark is made public in order to enable future research towards more robust VSR models.

摘要
《嘴巴读写句子-3》（LRS3）标准在过去几年内对视觉语音识别（VSR）进行了激烈的研究。由于这，存在过拟合LRS3测试集的风险增加，这个测试集只有一个小时的时长。为解决这个问题，我们创建了一个新的 VSR测试集 named WildVSR，按照LRS3数据集创建过程进行了仿真。然后，我们评估和分析当前 VSR 模型是否能够通过WildVSR测试集进行泛化。我们评估了一系列公开available的 VSR 模型，并发现它们在我们的测试集中表现出了明显的下降。我们的结果表明，这种下降是由模型无法泛化到WildVSR测试集中的轻微难度和野生嘴巴序列所致。我们新的测试 benchmark 将会公开，以便未来的研究可以帮助开发更加Robust VSR 模型。

Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

paper_url: http://arxiv.org/abs/2311.13987
repo_url: https://github.com/audioshake/alt-eval
paper_authors: Ondřej Cífka, Constantinos Dimitriou, Cheng-i Wang, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter
for: 这个论文的目的是提出一个新的歌词识别 benchmark，以便更好地评估歌词识别系统的准确性和可靠性。
methods: 这个论文使用了一种新的Annotation guide，以涵盖音乐行业的指南，包括折衣、笔迹、后台 vocals 和非语音元素。它还使用了一组新的评价指标，以取代传统的 word error rate。
results: 这个论文的研究结果表明，使用这种新的Annotation guide和评价指标可以更好地评估歌词识别系统的性能，并且可以更好地捕捉歌词中的细节，如rhythm、情感强调、rhyme 和高级结构。

Abstract
Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.

摘要
Firstly, we have revised the transcripts specifically for ALT evaluation using a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds.Secondly, we have developed a suite of evaluation metrics that are designed to capture these phenomena, unlike the traditional word error rate. We hope that the proposed benchmark will contribute to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.

Efficient Trigger Word Insertion

paper_url: http://arxiv.org/abs/2311.13957
repo_url: None
paper_authors: Yueqi Zeng, Ziqiang Li, Pengfei Xia, Lei Liu, Bin Li
for: 针对深度神经网络模型受到背门隐藏攻击的威胁，本文主要目标是降低毒素污染样本的数量，同时保持文本后门攻击的效果。
methods: 本文提出了一种高效的词trigger插入策略，包括词trigger优化和毒素污染样本选择。
results: 对于不同的数据集和模型，我们的提议方法可以显著提高文本后门攻击的效果，特别是在卫生标Setting下，只需10个毒素污染样本，可以达到高于90%的Attack Success Rate。

Abstract
With the boom in the natural language processing (NLP) field these years, backdoor attacks pose immense threats against deep neural network models. However, previous works hardly consider the effect of the poisoning rate. In this paper, our main objective is to reduce the number of poisoned samples while still achieving a satisfactory Attack Success Rate (ASR) in text backdoor attacks. To accomplish this, we propose an efficient trigger word insertion strategy in terms of trigger word optimization and poisoned sample selection. Extensive experiments on different datasets and models demonstrate that our proposed method can significantly improve attack effectiveness in text classification tasks. Remarkably, our approach achieves an ASR of over 90% with only 10 poisoned samples in the dirty-label setting and requires merely 1.5% of the training data in the clean-label setting.

摘要
随着自然语言处理（NLP）领域的发展，深度神经网络模型受到了大量攻击威胁。然而，前一些研究几乎没有考虑毒素率的影响。在这篇论文中，我们的主要目标是降低毒素样本数量，同时仍能在文本后门攻击中实现满意的攻击成功率（ASR）。为达到这一目标，我们提出了一种高效的词干插入策略，包括词干优化和毒素样本选择。经验表明，我们的提议方法可以在不同的数据集和模型上显著提高文本分类任务中的攻击效果。特别是，我们的方法可以在受损标签设定下达到ASR高于90%的目标，只需要10个毒素样本，并且只需要1.5%的训练数据。

paper_url: http://arxiv.org/abs/2311.13951
repo_url: https://github.com/freedomintelligence/mllm-bench
paper_authors: Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang
For: The paper aims to address the challenge of evaluating the efficacy of multi-modal language models (MLLMs) due to the subjective nature of tasks that lack definitive answers.* Methods: The paper introduces MLLM-Bench, a novel benchmark inspired by Vicuna, which spans a diverse array of scenarios, including Perception, Understanding, Applying, Analyzing, Evaluating, and Creation, to provide a more holistic assessment of model performance.* Results: Comparative evaluations indicate a significant performance gap between existing open-source models and GPT-4V, demonstrating the effectiveness of MLLM-Bench in assessing the capabilities of vision-language models.Here are the three key points in Simplified Chinese:* For: 这篇论文目标是解决评估多模态语言模型（MLLMs）的效果困难，由于任务缺乏准确答案而导致。* Methods: 论文提出了基于Vicuna的MLLM-Benchbenchmark，涵盖了多种场景，如感知、理解、应用、分析、评估和创作，以提供更全面的模型性能评估。* Results: 对比评估表明，现有的开源模型与GPT-4V存在显著的性能差距，证明MLLM-Bench的有效性在评估视频语言模型。

Abstract
In the pursuit of Artificial General Intelligence (AGI), the integration of vision in language models has marked a significant milestone. The advent of vision-language models (MLLMs) like GPT-4V have expanded AI applications, aligning with the multi-modal capabilities of the human brain. However, evaluating the efficacy of MLLMs poses a substantial challenge due to the subjective nature of tasks that lack definitive answers. Existing automatic evaluation methodologies on multi-modal large language models rely on objective queries that have standard answers, inadequately addressing the nuances of creative and associative multi-modal tasks. To address this, we introduce MLLM-Bench, an innovative benchmark inspired by Vicuna, spanning a diverse array of scenarios, including Perception, Understanding, Applying, Analyzing, Evaluating, and Creation along with the ethical consideration. MLLM-Bench is designed to reflect user experience more accurately and provide a more holistic assessment of model performance. Comparative evaluations indicate a significant performance gap between existing open-source models and GPT-4V. We posit that MLLM-Bench will catalyze progress in the open-source community towards developing user-centric vision-language models that meet a broad spectrum of real-world applications. See online leaderboard in \url{https://mllm-bench.llmzoo.com}.

摘要
在追求人工通用智能（AGI）的探索中，融合视觉在语言模型中的推出标志着一项重要的突破。新一代视觉语言模型（MLLMs）如GPT-4V，扩展了人工智能应用的范围，与人类大脑的多modal能力相匹配。然而，评估MLLMs的效果具有杰出的挑战，因为这些任务缺乏定义的答案。现有的自动评估方法ologies on多modal大语言模型依靠对象查询，无法准确反映用户经验，不能充分考虑创造和相关多modal任务的细节。为解决这一问题，我们介绍MLLM-Bench，一种灵感来自于Vicuna的创新性 benchmark，覆盖了多样化的场景，包括感知、理解、应用、分析、评价和创造等，同时也考虑了伦理考虑。MLLM-Bench设计用于更准确地反映用户经验，提供更全面的模型性能评估。与现有开源模型相比，GPT-4V在比较评估中表现出了显著的性能差距。我们认为MLLM-Bench将推动开源社区的进步，开发用户中心的视觉语言模型，满足广泛的实际应用需求。请参考在线排名 \url{https://mllm-bench.llmzoo.com}。

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

paper_url: http://arxiv.org/abs/2311.13937
repo_url: None
paper_authors: Daryna Dementieva, Daniil Moskovskiy, David Dale, Alexander Panchenko
for: 本研究的目的是探讨跨语言文本净化的策略，具体来说是使用一种语言的净化词汇库来为另一种语言进行净化。
methods: 本研究使用了一些已知的净化方法，包括word embeddings、BERT模型和文本翻译模型。同时，本研究还提出了一种新的同时进行文本翻译和净化的任务，并提供了多个强大的基线方案。
results: 本研究通过对多个语言的净化任务进行评估，发现了一些有效的跨语言净化策略。同时，本研究还提出了一些新的自动净化评价指标，与之前的标准净化评价指标相比，具有更高的相关性。

Abstract
Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for this task remains a challenging open problem (Moskovskiy et al., 2022). In this work, we present a large-scale study of strategies for cross-lingual text detoxification -- given a parallel detoxification corpus for one language; the goal is to transfer detoxification ability to another language for which we do not have such a corpus. Moreover, we are the first to explore a new task where text translation and detoxification are performed simultaneously, providing several strong baselines for this task. Finally, we introduce new automatic detoxification evaluation metrics with higher correlations with human judgments than previous benchmarks. We assess the most promising approaches also with manual markup, determining the answer for the best strategy to transfer the knowledge of text detoxification between languages.

摘要
文本净化任务是将敏感文本转换为中性文本。虽然在单语言设置下有许多方法可以达到有希望的结果（dale等，2021；hallinan等，2022），但跨语言传递仍然是一个困难的开放问题（moskovskiy等，2022）。在这项工作中，我们对跨语言文本净化策略进行了大规模的研究，具体来说是在一种语言上有并行净化词库的情况下，将净化能力传递到另一种语言。此外，我们还是首次探讨了同时进行文本翻译和净化的新任务，并提供了许多强大的基线方案。最后，我们还引入了新的自动净化评价指标，与人工评价更高相关性。我们评估了最有前途的方法，并通过手动标记确定了跨语言文本净化知识传递的最佳策略。

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

paper_url: http://arxiv.org/abs/2311.13921
repo_url: https://github.com/seznam/czech-semantic-embedding-models
paper_authors: Jiří Bednář, Jakub Náplava, Petra Barančíková, Ondřej Lisický
for: 这篇论文关注了小型捷克句子嵌入模型的开发和评估。
methods: 论文使用了预训练、知识传递和无监督对比练化 fine-tuning 等方法，以适应有限的捷克数据。
results: 研究人员通过了对小型模型的内在和外在分析，并证明了这些模型在与较大模型相比具有竞争力，具有约8倍小型和5倍快速速度。此外，研究人员还在 Seznam.cz 搜索引擎中应用了这些嵌入模型，提高了搜索体验，如组织搜索、特色搜索和图像搜索等。

Abstract
This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine. These models have effectively replaced previous counterparts, enhancing the overall search experience for instance, in organic search, featured snippets, and image search. This transition has yielded improved performance.

摘要
Translation notes:* "Small-sized" is translated as "小型" (xiǎo xíng) to emphasize the size of the models.* "Czech sentence embedding models" is translated as "捷克句子嵌入模型" (dì kè jù xīn zhī módelǐ) to specify the language and type of models.* "Pre-training" is translated as "预训练" (yù xùn liáo) to emphasize the training process.* "Knowledge distillation" is translated as "知识填充" (zhī shí fán chōng) to emphasize the transfer of knowledge.* "Unsupervised contrastive fine-tuning" is translated as "无监督对比细化" (wú jiān dǎo duì bǐ xiǎo huà) to emphasize the training process and the lack of supervision.* "Intrinsic and extrinsic analyses" is translated as "内在和外在分析" (nèi zài hé wài zài fān yì) to emphasize the different aspects of the models' performance.* "Conventional Base-sized models" is translated as "常规基础模型" (cháng guī jī bù módelǐ) to specify the type of models being compared.* "Practical applications" is translated as "实际应用" (shí jí yìng yòu) to emphasize the use of the models in real-world scenarios.* "Seznam.cz" is translated as "Seznam.cz" to maintain the original name of the search engine.* "Organic search" is translated as "自然搜索" (zì rán sōu zhòu) to emphasize the type of search.* "Featured snippets" is translated as "推荐剪辑" (tuī yù jiǎn piān) to emphasize the type of search result.* "Image search" is translated as "图像搜索" (tú xiàng sōu zhòu) to emphasize the type of search.

Dialogue Quality and Emotion Annotations for Customer Support Conversations

paper_url: http://arxiv.org/abs/2311.13910
repo_url: https://github.com/johndmendonca/maia-dqe
paper_authors: John Mendonça, Patrícia Pereira, Miguel Menezes, Vera Cabarrão, Ana C. Farinha, Helena Moniz, João Paulo Carvalho, Alon Lavie, Isabel Trancoso
for: 这个论文是为了提高对话应用程序中语言和领域的通用性而设计的。
methods: 这个论文使用了大语言模型（LLMs），并对多种语言和领域进行了benchmarking。
results: 这个论文提供了对话质量和情感识别的全面标注方法，并提供了一个valuable的资源 для对话应用程序的发展。

Abstract
Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.

摘要
Task-oriented conversational datasets часто lack topic variability 和 linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Taiwan, and other regions.

Grammatical Error Correction via Mixed-Grained Weighted Training

paper_url: http://arxiv.org/abs/2311.13848
repo_url: None
paper_authors: Jiahao Li, Quan Wang, Chiwei Zhu, Zhendong Mao, Yongdong Zhang
for: 提高自动修正语法错误的效果（Grammatical Error Correction）
methods: 基于数据注释精度和可能性多样性的两个方面的粒度分配学习权重，然后进行权重杂合训练。
results: 在Seq2Seq和Seq2Edit两种方式下，MainGEC实现了对两个benchmark数据集的性能的持续和 statistically significant 提高，证明了权重杂合训练的效果和优势。

Abstract
The task of Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts. Almost all previous works treat annotated training data equally, but inherent discrepancies in data are neglected. In this paper, the inherent discrepancies are manifested in two aspects, namely, accuracy of data annotation and diversity of potential annotations. To this end, we propose MainGEC, which designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation, respectively, and then conducts mixed-grained weighted training to improve the training effect for GEC. Empirical evaluation shows that whether in the Seq2Seq or Seq2Edit manner, MainGEC achieves consistent and significant performance improvements on two benchmark datasets, demonstrating the effectiveness and superiority of the mixed-grained weighted training. Further ablation experiments verify the effectiveness of designed weights of both granularities in MainGEC.

摘要
GEC任务的目标是自动 corrections grammatical errors in natural texts. Previous works almost all treat annotated training data equally, but inherent discrepancies in data are neglected. In this paper, the inherent discrepancies are manifested in two aspects, namely, accuracy of data annotation and diversity of potential annotations. To this end, we propose MainGEC, which designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation, respectively, and then conducts mixed-grained weighted training to improve the training effect for GEC. Empirical evaluation shows that whether in the Seq2Seq or Seq2Edit manner, MainGEC achieves consistent and significant performance improvements on two benchmark datasets, demonstrating the effectiveness and superiority of the mixed-grained weighted training. Further ablation experiments verify the effectiveness of designed weights of both granularities in MainGEC.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2311.13833
repo_url: https://github.com/sam-motamed/Lego
paper_authors: Saman Motamed, Danda Pani Paudel, Luc Van Gool
for: 这研究旨在开发一种基于少量示例图像的文本倒转方法，以实现个性化内容创作。
methods: 该方法使用了简单 yet effective的主题分离步骤，以及一种基于上下文损失的 Context Loss，以倒转涉及主题的单/多嵌入式概念。
results: 在详细的用户研究中，使用 Lego 方法生成的概念被比基准方法好上限70%。此外，使用大语言模型进行视觉问答也表明，Lego 生成的概念更好地与文本描述相匹配。

Abstract
Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.

摘要
Diffusion models 已经革命化了生成内容的创作，特别是文本到图像（T2I）扩散模型，它们提高了用户的创作自由度，让用户通过自然语言来Synthesize scene。T2I模型在Synthesize概念 such as nouns, appearances, and styles 方面表现出色。为了基于一些概念示例图像来生成个性化内容，方法 such as Textual Inversion 和 DreamBooth 可以将概念反转，并将其Synthesize到新的场景中。然而，通过自然语言来反转更一般的概念（例如 adjectives 和 verbs）仍然是一个挑战。两个概念的特点限制了当前反转方法的能力：1） adjectives 和 verbs 与主语（主题）相互杂糅，使得主题的外观泄露到概念嵌入中，2）描述这些概念通常超过单词嵌入（例如被冰结、走在绳子上等），当前方法无法处理。在本研究中，我们介绍了 Lego，一种特resoled textual inversion 方法，可以将主语杂糅的概念反转。Lego 使用简单 yet effective 的主题分离步骤，并使用 Context Loss 来引导反转单/多嵌入概念。在详细的用户研究中，Lego-生成的概念被比基准70%的时间更好。此外，使用大语言模型进行视觉问答还表明 Lego-生成的概念更好地与文本描述相匹配。

AdaTyper: Adaptive Semantic Column Type Detection

paper_url: http://arxiv.org/abs/2311.13806
repo_url: https://github.com/madelonhulsebos/adatyper
paper_authors: Madelon Hulsebos, Paul Groth, Çağatay Demiralp
for: 本研究旨在提高自动化数据探索和准备系统中对关系表的理解。
methods: 本文提出了一种名为AdaTyper的方法，使用弱监睹来适应新的 semantics 类型和数据分布。
results: 实验结果表明，AdaTyper 可以在新的 semantics 类型和数据分布下提高适应性，并且可以在只看到5个示例后达到0.6的准确率。

Abstract
Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.

摘要
To address this challenge, we propose AdaTyper, which uses weak-supervision to adapt a hybrid type predictor to new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor combines rule-based methods and a light machine learning model for semantic column type detection.We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.

DaG LLM ver 1.0: Pioneering Instruction-Tuned Language Modeling for Korean NLP

paper_url: http://arxiv.org/abs/2311.13784
repo_url: None
paper_authors: Dongjun Jang, Sangah Lee, Sungjoo Byun, Jinwoong Kim, Jean Seo, Minseok Kim, Soyeon Kim, Chaeyoung Oh, Jaeyoon Kim, Hyemi Jo, Hyopil Shin
for: 这篇论文是为了推出一种适用于韩语的大型语言模型（DaG LLM），并通过指令调整在13种不同类别中的41个任务上进行了精细调整。
methods: 本论文使用了大型语言模型（LLM），并通过指令调整来进行精细调整。
results: 经过调整后，DaG LLM在13种不同类别中的41个任务上都达到了优秀的成绩。

Abstract
This paper presents the DaG LLM (David and Goliath Large Language Model), a language model specialized for Korean and fine-tuned through Instruction Tuning across 41 tasks within 13 distinct categories.

摘要
这篇论文介绍了DaG LLM（大卫和果利大语言模型），这是一种专门为韩语而设计的语言模型，通过Instruction Tuning在13个不同类别中的41个任务上进行了精细调整。

Transformer-based Named Entity Recognition in Construction Supply Chain Risk Management in Australia

paper_url: http://arxiv.org/abs/2311.13755
repo_url: None
paper_authors: Milad Baghalzadeh Shishehgarkhaneh, Robert C. Moehler, Yihai Fang, Amer A. Hijazi, Hamed Aboutorab
for: 本研究旨在探讨澳大利用变换器模型进行Named Entity Recognition（NER），以提高澳大建筑供应链风险管理（SCRM）的效果。
methods: 本研究使用了不同的变换器模型，并对新闻文章进行NER训练，以识别和分类特定风险相关实体。
results: 通过分析新闻文章，变换器模型可以提取特定风险种类的实体和相关信息，为澳大建筑供应链风险管理提供更多的有价值信息。

Abstract
The construction industry in Australia is characterized by its intricate supply chains and vulnerability to myriad risks. As such, effective supply chain risk management (SCRM) becomes imperative. This paper employs different transformer models, and train for Named Entity Recognition (NER) in the context of Australian construction SCRM. Utilizing NER, transformer models identify and classify specific risk-associated entities in news articles, offering a detailed insight into supply chain vulnerabilities. By analysing news articles through different transformer models, we can extract relevant entities and insights related to specific risk taxonomies local (milieu) to the Australian construction landscape. This research emphasises the potential of NLP-driven solutions, like transformer models, in revolutionising SCRM for construction in geo-media specific contexts.

摘要
澳大利用链接模型进行供应链风险管理（SCRM）是非常重要的。这篇论文使用不同的变换器模型进行命名实体识别（NER），以识别澳大的建筑供应链中的特定风险相关实体。通过分析新闻文章，变换器模型可以提取与特定风险类别相关的实体和洞察，从而为澳大建筑供应链管理提供更加详细的洞察。本研究认为，使用自然语言处理（NLP）技术，如变换器模型，可以在媒体特定的地区（澳大建筑领域）进行革命性的SCRM改进。

2023-11-23

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

A Systematic Review of Deep Learning-based Research on Radiology Report Generation

Question Answering in Natural Language: the Special Case of Temporal Expressions

Searching for Snippets of Open-Domain Dialogue in Task-Oriented Dialogue Datasets

Enhancing Task-Oriented Dialogues with Chitchat: a Comparative Study Based on Lexical Diversity and Divergence

Do VSR Models Generalize Beyond LRS3?

Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

Efficient Trigger Word Insertion

MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

Dialogue Quality and Emotion Annotations for Customer Support Conversations

Grammatical Error Correction via Mixed-Grained Weighted Training

Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

AdaTyper: Adaptive Semantic Column Type Detection

DaG LLM ver 1.0: Pioneering Instruction-Tuned Language Modeling for Korean NLP

Transformer-based Named Entity Recognition in Construction Supply Chain Risk Management in Australia