2023-09-04

cs.CL

cs.CL - 2023-09-04

paper_url: http://arxiv.org/abs/2309.01860
repo_url: None
paper_authors: Zaber Ibn Abdul Hakim, Rasman Mubtasim Swargo, Muhammad Abdullah Adnan
for: 这个研究的目的是扩展现有的连续手语识别和翻译管道，以包含多modal信息。
methods: 这个研究使用了一种跨modal编码器，将Optical flow信息与RGB图像集成到特征集中，以增强手语识别和翻译的精度。
results: 研究表明，通过包含多modal信息，可以提高手语识别和翻译的结果。在RWTH-PHOENIX-2014数据集上进行手语识别任务，我们的方法可以降低WER值0.9。在RWTH-PHOENIX-2014T数据集上进行翻译任务，我们的方法可以提高大多数BLEU分数值0.6。

Abstract
In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.

摘要
在这篇论文中，我们提出了一种方法，用于将多modal信息添加到现有的连续手语识别和翻译管道中。在我们的过程中，我们将光流信息与RGB图像结合以增强特征中的运动相关信息。本研究证明了这种多模态包含的可行性，并使用了一个轻量级的插件，不需要在端到端方式中额外添加新模态的特征提取器。我们对手语识别和翻译进行了应用，并在每个情况下提高了结果。我们在RWTH-PHOENIX-2014 dataset上进行了手语识别任务的评估，并在RWTH-PHOENIX-2014T dataset上进行了翻译任务的评估。在识别任务中，我们的方法降低了WER值0.9，在翻译任务中，我们的方法在测试集上提高了大多数BLEU分数的0.6。

Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech

paper_url: http://arxiv.org/abs/2309.02466
repo_url: None
paper_authors: Paul Myles Eugenio
for: 这个论文是为了探讨语音演化的经济性的影响，以及这种影响如何使得人们更容易学习语音。
methods: 这篇论文使用了一种基于tensor网络的本地相关模型，这种模型利用了语音中的本地phonetic correlations来促进语音学习。
results: 研究发现，这种模型可以帮助人们更容易学习语音，并且可以生成新的语音单词，这些单词符合目标语言的phonetic规则。此外，模型还可以提供一个 hierarchical 的最likely errors 列表，用于描述在语音行为中可能出现的错误。

Abstract
Spoken language evolves constrained by the economy of speech, which depends on factors such as the structure of the human mouth. This gives rise to local phonetic correlations in spoken words. Here we demonstrate that these local correlations facilitate the learning of spoken words by reducing their information content. We do this by constructing a locally-connected tensor-network model, inspired by similar variational models used for many-body physics, which exploits these local phonetic correlations to facilitate the learning of spoken words. The model is therefore a minimal model of phonetic memory, where "learning to pronounce" and "learning a word" are one and the same. A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language; as well as providing a hierarchy of the most likely errors that could be produced during the action of speech. We test our model against Latin and Turkish words. (The code is available on GitHub.)

摘要
spoken language evolves constrained by the economy of speech, which depends on factors such as the structure of the human mouth. This gives rise to local phonetic correlations in spoken words. Here we demonstrate that these local correlations facilitate the learning of spoken words by reducing their information content. We do this by constructing a locally-connected tensor-network model, inspired by similar variational models used for many-body physics, which exploits these local phonetic correlations to facilitate the learning of spoken words. The model is therefore a minimal model of phonetic memory, where "learning to pronounce" and "learning a word" are one and the same. A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language; as well as providing a hierarchy of the most likely errors that could be produced during the action of speech. We test our model against Latin and Turkish words. (The code is available on GitHub.)Here's the translation breakdown:* "spoken language" 口语 (kǒu yǔ)* "evolves" 演化 (biǎn huà)* "constrained" 受限 (shòu jiàn)* "by the economy of speech" 由语言经济 (yǐ yǔ yán jīng jì)* "which depends on factors such as the structure of the human mouth" 即人口结构等因素 (jī rén kǒu jiégòu déng yǐng fāng)* "This gives rise to local phonetic correlations in spoken words" 这会导致 spoken words 中的地方声学相关性 (zhè huì dào cái spoken words zhōng de dì fāng shēng xué xiāng yì)* "Here we demonstrate that these local correlations facilitate the learning of spoken words" 我们在这里示出这些地方声学相关性可以促进 spoken words 的学习 (wǒ men zài zhè lǐ shì chū shēng zhī yì xiǎng xué xí)* "by reducing their information content" 通过减少信息内容 (tōng guò jiǎn shòu xìn xīn nèi zhòng)* "We do this by constructing a locally-connected tensor-network model" 我们使用一种地方连接的tensor网络模型 (wǒ men shǐ yòng yī zhōng dì fāng lián xiǎng de tensor wǎng wǎn mó del)* "inspired by similar variational models used for many-body physics" Drawing on similar variational models used in many-body physics (fāng yǐn yī xiàng zhī yì zhī shì)* "which exploits these local phonetic correlations to facilitate the learning of spoken words" 这种模型可以利用这些地方声学相关性来促进 spoken words 的学习 (zhè zhōng mó del cóu yì liǎng yòu zhī yì xiǎng yì shì)* "The model is therefore a minimal model of phonetic memory" 这种模型因此成为一种最小的声学记忆模型 (zhè zhōng mó del yìn xiàng zhī yì xiǎng yì)* "where 'learning to pronounce' and 'learning a word' are one and the same" 这种模型中，"学习发音"和"学习一个词"是一样的 (zhè zhōng mó del zhì yì, "xué xí fā yīn" yǔ "xué xí yī gè ci" shì yī yàng de)* "A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language" 这种模型的一个结果是学习出新词，这些词汇在目标语言中是声学合理的 (zhè zhōng mó del yī yì shì, zhè xiē yì shì zhī yì yì bù)* "as well as providing a hierarchy of the most likely errors that could be produced during the action of speech" 这种模型还可以提供一个词汇错误的层次结构 (zhè zhōng mó del yǐn yì shì zhī yì yì bù)* "We test our model against Latin and Turkish words" 我们对 Latin 和 Turkish 词汇进行测试 (wǒ men duì Lati n yǔ Tūrkish yì shì zhì)* "The code is available on GitHub" 代码可以在 GitHub 上找到 (fǎn yì zhī yì zhī shì)

Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts

paper_url: http://arxiv.org/abs/2309.01812
repo_url: https://github.com/ylaboratory/flambe
paper_authors: Ruth Dannenfelser, Jeffrey Zhong, Ran Zhang, Vicky Yao
for: 这篇论文的主要目的是提供一个用于生物医学领域进行过程知识抽取的数据集，以便进一步开发自然语言处理（NLP）模型。
methods: 该数据集是通过专家 manually 精心纠正的方式从生物医学领域的学术论文中提取出的过程知识，包括名实 recognize 和ambiguation 等任务。
results: 该数据集提供了一个大量的 manually 精心纠正的名实识别和ambiguation 数据集，以便进一步开发NLP模型，同时也有助于提高生物医学研究领域的重复性。

Abstract
Many of the most commonly explored natural language processing (NLP) information extraction tasks can be thought of as evaluations of declarative knowledge, or fact-based information extraction. Procedural knowledge extraction, i.e., breaking down a described process into a series of steps, has received much less attention, perhaps in part due to the lack of structured datasets that capture the knowledge extraction process from end-to-end. To address this unmet need, we present FlaMB\'e (Flow annotations for Multiverse Biological entities), a collection of expert-curated datasets across a series of complementary tasks that capture procedural knowledge in biomedical texts. This dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology. The workflows annotated in FlaMB\'e are from texts in the burgeoning field of single cell research, a research area that has become notorious for the number of software tools and complexity of workflows used. Additionally, FlaMB\'e provides, to our knowledge, the largest manually curated named entity recognition (NER) and disambiguation (NED) datasets for tissue/cell type, a fundamental biological entity that is critical for knowledge extraction in the biomedical research domain. Beyond providing a valuable dataset to enable further development of NLP models for procedural knowledge extraction, automating the process of workflow mining also has important implications for advancing reproducibility in biomedical research.

摘要
多种常见的自然语言处理（NLP）信息EXTRACTION任务可以被视为评估声明知识或基于事实的信息EXTRACTION。而过程知识EXTRACTION，即从描述过程中提取步骤，则 receiving much less attention，可能是由于缺乏结构化数据集的不足。为解决这一需求，我们提出FlaMB\'e（流动注释 для多元生物实体），这是一系列 complementary tasks 的专家修订 datasets，用于捕捉生物文献中的过程知识。这个数据集得到了学术论文中描述的过程的注释，这些过程来自生物学研究领域的快速发展领域，特别是单细胞研究。此外，FlaMB\'e 还提供了，到我们所知，生物医学研究领域中最大的手动修订名实体识别（NER）和名实体识别（NED）数据集，用于识别基本生物实体，这种实体是生物医学研究领域中知识EXTRACTION的重要基础。除了提供可用于进一步发展NLP模型的价值数据集外，自动化工作流挖掘也有重要的 reproduceability 推动生物医学研究的意义。

Are Emergent Abilities in Large Language Models just In-Context Learning?

paper_url: http://arxiv.org/abs/2309.01809
repo_url: https://github.com/ukplab/on-emergence
paper_authors: Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych
for: This paper aims to provide a comprehensive examination of the emergent abilities of large language models, specifically looking at the role of in-context learning in their performance.
methods: The authors use a set of 18 models with varying parameters (60 million to 175 billion) and test them on a set of 22 tasks. They conduct over 1,000 experiments to evaluate the models’ performance and determine the underlying mechanisms driving their emergent abilities.
results: The authors find that the emergent abilities of large language models can primarily be attributed to in-context learning, and there is no evidence for the emergence of reasoning abilities. This provides valuable insights into the use of these models and alleviates safety concerns regarding their performance.Here’s the same information in Simplified Chinese text:
for: 这篇论文目的是为了对大语言模型的突出能力进行全面的检查，特别是它们在受到不同提示下的性能如何。
methods: 作者使用了18个模型（参数量从60万到175亿），对其进行了22项任务的测试，并进行了超过1000个实验来评估模型的性能。
results: 作者发现，大语言模型的突出能力主要归因于受到提示下的学习，并没有证据表明它们具有推理能力的emergence。这些结论有助于我们更好地理解这些模型的使用，并缓解对其性能的安全性问题。

Abstract
Large language models have exhibited emergent abilities, demonstrating exceptional performance across diverse tasks for which they were not explicitly trained, including those that require complex reasoning abilities. The emergence of such abilities carries profound implications for the future direction of research in NLP, especially as the deployment of such models becomes more prevalent. However, one key challenge is that the evaluation of these abilities is often confounded by competencies that arise in models through alternative prompting techniques, such as in-context learning and instruction following, which also emerge as the models are scaled up. In this study, we provide the first comprehensive examination of these emergent abilities while accounting for various potentially biasing factors that can influence the evaluation of models. We conduct rigorous tests on a set of 18 models, encompassing a parameter range from 60 million to 175 billion parameters, across a comprehensive set of 22 tasks. Through an extensive series of over 1,000 experiments, we provide compelling evidence that emergent abilities can primarily be ascribed to in-context learning. We find no evidence for the emergence of reasoning abilities, thus providing valuable insights into the underlying mechanisms driving the observed abilities and thus alleviating safety concerns regarding their use.

摘要

An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports

paper_url: http://arxiv.org/abs/2309.01740
repo_url: None
paper_authors: Ethan Dack, Lorenzo Brigato, Matthew McMurray, Matthias Fontanellaz, Thomas Frauenfelder, Hanno Hoppe, Aristomenis Exadaktylos, Thomas Geiser, Manuela Funke-Chambour, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou
for: 这 paper 是为了研究基于 contrastive visual language learning 的零shot 多标签分类方法，以帮助 radiologist 诊断 COVID-19 和识别详细的 lung 病变。
methods: 这 paper 使用了 unstructured data 和 CT 成像来进行零shot 多标签分类，并与 human expert 合作调节模型。
results: 这 paper 的 empirical analysis 表明，零shot 多标签分类方法可以帮助 radiologist 更好地诊断 COVID-19 和识别详细的 lung 病变。

Abstract
The pandemic resulted in vast repositories of unstructured data, including radiology reports, due to increased medical examinations. Previous research on automated diagnosis of COVID-19 primarily focuses on X-ray images, despite their lower precision compared to computed tomography (CT) scans. In this work, we leverage unstructured data from a hospital and harness the fine-grained details offered by CT scans to perform zero-shot multi-label classification based on contrastive visual language learning. In collaboration with human experts, we investigate the effectiveness of multiple zero-shot models that aid radiologists in detecting pulmonary embolisms and identifying intricate lung details like ground glass opacities and consolidations. Our empirical analysis provides an overview of the possible solutions to target such fine-grained tasks, so far overlooked in the medical multimodal pretraining literature. Our investigation promises future advancements in the medical image analysis community by addressing some challenges associated with unstructured data and fine-grained multi-label classification.

摘要
pandemic 导致了庞大的不结构数据存储，包括 radiology 报告，由于增加的医疗检查。先前的自动诊断 COVID-19 研究主要集中在 X-ray 图像上，尽管它们的精度比 computed tomography (CT) 扫描低。在这项工作中，我们利用医院的不结构数据和 CT 扫描的细腻特征，实现零shot 多标签分类 based on 对比视觉语言学习。与人类专家合作，我们调查了多种零shot 模型的效果，帮助 radiologist 检测肺动脉塞和识别复杂的肺脉塞，如云彩杂色和聚集。我们的实验分析提供了targeting such fine-grained tasks 的可能解决方案的概述，在医学多Modal 预训练 литературе中被过looked。我们的调查承诺未来医学图像分析社区的发展，解决一些不结构数据和细腻多标签分类的挑战。

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

paper_url: http://arxiv.org/abs/2309.01715
repo_url: https://github.com/20001LastOrder/Taxonomy-GPT
paper_authors: Boqi Chen, Fandi Yi, Dániel Varró
for: 本研究旨在提出一种满足结构约束的自动taxonomy构建框架，以便在不同的软件模型和自然语言处理（NLP）活动中提高taxonomy的效果。
methods: 本研究使用了适当的用户输入（称为提示），将GPT-3等大语言模型（LLMs）引导到多种NLP任务中，而不需要显式（重）训练。
results: 研究发现，无需显式训练，提示方法可以在hypernym taxonomy和计算机科学 taxonomy dataset中对taxonomy进行自动构建，并且在训练集小时，提示方法的性能比 fine-tuning 方法更高。但是， fine-tuning 方法可以轻松地对生成的taxonomy进行后处理，以满足所有约束。

Abstract
Taxonomies represent hierarchical relations between entities, frequently applied in various software modeling and natural language processing (NLP) activities. They are typically subject to a set of structural constraints restricting their content. However, manual taxonomy construction can be time-consuming, incomplete, and costly to maintain. Recent studies of large language models (LLMs) have demonstrated that appropriate user inputs (called prompting) can effectively guide LLMs, such as GPT-3, in diverse NLP tasks without explicit (re-)training. However, existing approaches for automated taxonomy construction typically involve fine-tuning a language model by adjusting model parameters. In this paper, we present a general framework for taxonomy construction that takes into account structural constraints. We subsequently conduct a systematic comparison between the prompting and fine-tuning approaches performed on a hypernym taxonomy and a novel computer science taxonomy dataset. Our result reveals the following: (1) Even without explicit training on the dataset, the prompting approach outperforms fine-tuning-based approaches. Moreover, the performance gap between prompting and fine-tuning widens when the training dataset is small. However, (2) taxonomies generated by the fine-tuning approach can be easily post-processed to satisfy all the constraints, whereas handling violations of the taxonomies produced by the prompting approach can be challenging. These evaluation findings provide guidance on selecting the appropriate method for taxonomy construction and highlight potential enhancements for both approaches.

摘要
taxonomies 表示实体之间的层次关系，常用于软件建模和自然语言处理（NLP）活动中。它们通常受到一组结构约束，限制它们的内容。然而，手动构建税onomy可以是时间consuming、不完整和维护成本高。现在的研究表明，适当的用户输入（叫做提示）可以导引大语言模型（LLMs）在多种NLP任务中表现出优秀的性能，无需显式（再）训练。然而，现有的自动税onomy构建方法通常通过调整模型参数来进行细化。在这篇论文中，我们提出一种通用的税onomy构建框架，考虑结构约束。然后，我们进行了系统比较，通过对hypernym税onomy和一个新的计算机科学税onomy数据集进行提示和细化两种方法的性能。我们的结果显示以下：（1）无需显式训练数据集，提示方法比细化方法表现更好，而且当训练数据集较小时，性能差距更大。然而，（2）通过细化方法生成的税onomy可以轻松地进行后期处理，以满足所有约束，而提示方法生成的税onomy处理抵触的问题可以困难。这些评估结果为选择合适的方法提供指导，并高亮了两种方法的可能的改进。

MathAttack: Attacking Large Language Models Towards Math Solving Ability

paper_url: http://arxiv.org/abs/2309.01686
repo_url: None
paper_authors: Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, Kaizhu Huang
for: 本研究旨在检测大型自然语言模型（LLMs）在数学问题解决能力方面的安全性。
methods: 我们提出了一种名为MathAttack的模型，用于攻击数学问题样本，以保持原始数学问题的逻辑逻辑。我们首先使用逻辑存在检测来识别逻辑入口，然后使用word-level攻击者对剩下的文本进行攻击。
results: 我们的实验表明，MathAttack可以有效攻击LLMs的数学问题解决能力。我们发现：1）我们的敌意样本从高精度LLMs中生成的样本也能够攻击低精度LLMs（例如，从大到小模型或从多个步骤到零步骤提问中）；2）复杂的数学问题（例如，更多的解决步骤、更长的文本、更多的数字）更容易受到攻击；3）我们可以通过使用我们的反敌样本来提高LLMs的 robustness。

Abstract
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the security of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of security in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. We will release our code and dataset.

摘要
随着大型语言模型（LLMs）的爆发，解决数学问题（MWP）的研究已经做出了很大的进步。然而，有很少的研究检查LLMs在数学问题解决能力的安全性。而不是通过对提示的使用来进行攻击，我们提议一种名为MathAttack的模型来攻击MWP样本，这些样本更加接近安全性的解决数学问题的本质。相比传统的文本恶意攻击，在攻击MWP时更加重要是保持原始MWP的数学逻辑。为此，我们提出了逻辑实体识别，以冻结逻辑实体。然后，剩下的文本使用word-level攻击者进行攻击。此外，我们提出了一个名为RobustMath的新数据集，用于评估LLMs在数学问题解决能力的Robustness。我们在RobustMath和两个其他数学benchmark数据集GSM8K和MultiAirth上进行了广泛的实验，结果表明MathAttack可以有效攻击LLMs的数学问题解决能力。在实验中，我们注意到以下问题：1.我们从高精度LLMs中生成的恶意样本也可以有效地攻击低精度LLMs（例如，从大到小的LLMs或从多少个提示到零个提示）。2.复杂的MWP（如更多的解决步骤、更长的文本、更多的数字）更容易受到攻击。3.我们可以通过使用我们的恶意样本来提高LLMs的Robustness，特别是在几个提示中。最后，我们希望我们的实践和观察可以作为LLMs在数学问题解决能力的Robustness的重要尝试。我们将发布我们的代码和数据集。

CRUISE-Screening: Living Literature Reviews Toolbox

paper_url: http://arxiv.org/abs/2309.01684
repo_url: https://github.com/projectdossier/cruise-screening
paper_authors: Wojciech Kusa, Petr Knoth, Allan Hanbury
for: 帮助研究人员快速找到相关研究，提高生成Literature Review的效率和有效性。
methods: 使用文本分类和问答模型帮助屏选相关论文，并通过API与多个搜索引擎连接以更新搜索结果。
results: 开发了一款基于Web应用程序，可以实现活动Literature Review的屏选和搜索，可以帮助研究人员避免手动屏选和搜索，提高工作效率。

Abstract
Keeping up with research and finding related work is still a time-consuming task for academics. Researchers sift through thousands of studies to identify a few relevant ones. Automation techniques can help by increasing the efficiency and effectiveness of this task. To this end, we developed CRUISE-Screening, a web-based application for conducting living literature reviews - a type of literature review that is continuously updated to reflect the latest research in a particular field. CRUISE-Screening is connected to several search engines via an API, which allows for updating the search results periodically. Moreover, it can facilitate the process of screening for relevant publications by using text classification and question answering models. CRUISE-Screening can be used both by researchers conducting literature reviews and by those working on automating the citation screening process to validate their algorithms. The application is open-source: https://github.com/ProjectDoSSIER/cruise-screening, and a demo is available under this URL: https://citation-screening.ec.tuwien.ac.at. We discuss the limitations of our tool in Appendix A.

摘要
保持研究的最新信息和找到相关工作仍然是学术人员的时间消耗任务。研究人员需要从千余篇论文中筛选出一些相关的研究，以增加研究效率和效果。自动化技术可以帮助解决这个问题。为此，我们开发了CRUISE-Screening，一个基于网络的应用程序，用于进行生活文献评估 - 一种Periodically更新的文献评估方法，以反映最新的研究领域。CRUISE-Screening通过API与多个搜索引擎连接，以 periodic更新搜索结果。此外，它还可以通过文本分类和问答模型来帮助屏选相关的论文。CRUISE-Screening可以用于由研究人员进行文献评估，以及用于自动化引用屏选过程的验证。该应用程序是开源的，可以在 GitHub 上找到代码：https://github.com/ProjectDoSSIER/cruise-screening，并可以在以下 URL 上查看示例：https://citation-screening.ec.tuwien.ac.at。我们在 Appendix A 中讨论了我们的工具的限制。

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

paper_url: http://arxiv.org/abs/2309.01669
repo_url: None
paper_authors: Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank
for:这个论文主要针对的是如何在生成 Setting 中应用 Annotation Error Detection (AED) 方法，以提高 Large Language Models (LLMs) 的训练。methods:这篇论文使用了三个 instruction-tuning 数据集，它们都是由专家和 semi-automatic 方法进行了注释。 authors 还提出了四种 AED 基线方法，并对其进行了全面的评估。results:这篇论文发现，选择正确的 AED 方法和模型大小是非常重要，这有助于提高 instruction-tuning 的性能。 authors 还提供了一个首次的案例研究，以了解 instruction-tuning 数据集的质量如何影响下游性能。

Abstract
Instruction-tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality issues of gold-standard labels. But so far, the application of AED methods is limited to discriminative settings. It is an open question how well AED methods generalize to generative settings which are becoming widespread via generative LLMs. In this work, we present a first and new benchmark for AED on instruction-tuning data: Donkii. It encompasses three instruction-tuning datasets enriched with annotations by experts and semi-automatic methods. We find that all three datasets contain clear-cut errors that sometimes directly propagate into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them comprehensively on the newly introduced dataset. Our results demonstrate that choosing the right AED method and model size is indeed crucial, thereby deriving practical recommendations. To gain insights, we provide a first case-study to examine how the quality of the instruction-tuning datasets influences downstream performance.

摘要
<>转换文本为简化中文。<>现在，教程调整（Instruction-tuning）已成为大语言模型（LLMs）训练管道的一个重要组成部分，并且已经显示出强大的性能提升。而在平行的研究方向中，标注错误检测（AED）已经出现为检测黄金标准标签质量问题的工具。但目前，AED方法的应用仅限于推理性Setting中。因此，是一个打开的问题，AED方法在生成Setting中的普遍性如何。在这项工作中，我们提出了一个新的和第一个Benchmark дляAED在教程调整数据上：Donkii。它包括三个 instruciton-tuning 数据集，每个数据集都有专家和半自动方法对标注。我们发现所有三个数据集都含有明确的错误，这些错误直接卷入 instruciton-tuned LLMs。我们提出了四个AED基线 для生成Setting，并在新引入的数据集上进行了全面的评估。我们的结果表明，选择正确的AED方法和模型大小是非常重要，从而得到了实用的建议。为了获得更多的洞察，我们提供了一个首次的案例研究，探讨了下游性能如何受到教程调整数据的质量影响。

paper_url: http://arxiv.org/abs/2309.01659
repo_url: https://github.com/andreskarjus/evolving_divergence
paper_authors: Andres Karjus, Christine Cuskley
for: 本研究探讨了美国政治各派别之间语言分化的现象，特别是社交媒体平台上的语言使用差异。
methods: 该研究使用了社交媒体数据 mines, lexicostatistics, machine learning 和大语言模型，并采用了系统的人工注释方法，以描述和量化语言分化的现象。
results: 研究发现，美国政治各派别之间存在语言分化的现象，尤其是在话题和话语方面，与之前的研究一致。这些现象表明，美国英语在不断受到政治各派别的影响，可能会导致语言分化。

Abstract
Language change is influenced by many factors, but often starts from synchronic variation, where multiple linguistic patterns or forms coexist, or where different speech communities use language in increasingly different ways. Besides regional or economic reasons, communities may form and segregate based on political alignment. The latter, referred to as political polarization, is of growing societal concern across the world. Here we map and quantify linguistic divergence across the partisan left-right divide in the United States, using social media data. We develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. Our data consists of 1.5M short posts by 10k users (about 20M words) from the social media platform Twitter (now "X"). Delineating this sample involved mining the platform for the lists of followers (n=422M) of 72 large news media accounts. We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. We find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. Our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. In other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms.

摘要
language change 由多种因素 influencing，frequently 从同时变化开始，其中多种语言模式或形式同时存在，或者不同的语言社区使用语言在不同的方式。besides regional or economic reasons, communities may form and segregate based on political alignment. the latter, referred to as political polarization, is of growing societal concern across the world. here we map and quantify linguistic divergence across the partisan left-right divide in the united states, using social media data. we develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. our data consists of 1.5 million short posts by 10,000 users (about 20 million words) from the social media platform Twitter (now "X"). delineating this sample involved mining the platform for the lists of followers (n=422 million) of 72 large news media accounts. we quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. we find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. while US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. in other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms.

Exploring the effectiveness of ChatGPT-based feedback compared with teacher feedback and self-feedback: Evidence from Chinese to English translation

paper_url: http://arxiv.org/abs/2309.01645
repo_url: None
paper_authors: Siyi Cao, Linping Zhong
for: 本研究是为了比较中国硬件翻译硬件翻译学生在英语作为第二语言中使用ChatGPT反馈的效果，并与教师反馈（TF）和自我反馈（SF）进行比较。
methods: 本研究使用了BLEU分数来衡量翻译质量，以及Coh-Metrix来分析翻译文本中的语言特征。
results: 研究发现，TF和SF带来的翻译文本质量高于ChatGPT反馈，但ChatGPT反馈能够提高翻译文本中的词汇能力和参照相互关系。同时，TF和SF更能够提高翻译文本中的语法能力，特别是正确使用了过去分词。

Abstract
ChatGPT,a cutting-edge AI-powered Chatbot,can quickly generate responses on given commands. While it was reported that ChatGPT had the capacity to deliver useful feedback, it is still unclear about its effectiveness compared with conventional feedback approaches,such as teacher feedback (TF) and self-feedback (SF). To address this issue, this study compared the revised Chinese to English translation texts produced by Chinese Master of Translation and Interpretation (MTI) students,who learned English as a Second/Foreign Language (ESL/EFL), based on three feedback types (i.e., ChatGPT-based feedback, TF and SF). The data was analyzed using BLEU score to gauge the overall translation quality as well as Coh-Metrix to examine linguistic features across three dimensions: lexicon, syntax, and cohesion.The findings revealed that TF- and SF-guided translation texts surpassed those with ChatGPT-based feedback, as indicated by the BLEU score. In terms of linguistic features,ChatGPT-based feedback demonstrated superiority, particularly in enhancing lexical capability and referential cohesion in the translation texts. However, TF and SF proved more effective in developing syntax-related skills,as it addressed instances of incorrect usage of the passive voice. These diverse outcomes indicate ChatGPT's potential as a supplementary resource, complementing traditional teacher-led methods in translation practice.

摘要
chatGPT，一种前沿的人工智能 chatbot，可快速生成对给定命令的回应。尽管报道称 chatGPT 有可能提供有用的反馈，但是它的效果对于传统的反馈方法（如教师反馈和自我反馈）仍然不清楚。为了解决这个问题，本研究比较了由中文翻译和 intérprete 学生（学习英语为第二外语/第二外语）所制定的修改后的中英翻译文本，基于三种反馈类型（即 chatGPT 反馈、教师反馈和自我反馈）。数据分析使用 BLEU 分数来评估翻译质量的总体水平，以及 Coh-Metrix 来检查翻译文本中的三个维度：词汇、 sentence 和 cohesion。研究发现，TF 和 SF 引导的翻译文本在 BLEU 分数上胜过 chatGPT 反馈，而在语言特征方面，chatGPT 反馈表现出优势，特别是在提高翻译文本中的词汇能力和引用共识性。然而，TF 和 SF 更有效地发展了 sentence 结构相关的技能，它解决了 incorrect 使用过去分词的情况。这些多样的结果表明 chatGPT 可以作为辅助资源，与传统的教师带领方法相结合，在翻译实践中发挥作用。

Critical Behavioral Traits Foster Peer Engagement in Online Mental Health Communities

paper_url: http://arxiv.org/abs/2309.01618
repo_url: None
paper_authors: Aseem Srivastava, Tanya Gupta, Alison Cerezo, Sarah Peregrine, Lord, Md Shad Akhtar, Tanmoy Chakraborty
For: This paper aims to explore the factors that drive peer engagement within counseling threads on online mental health communities, such as Reddit, to enhance our understanding of this critical phenomenon.* Methods: The study uses a novel dataset called BeCOPE, which consists of over 10,000 posts and 58,000 comments from 21 mental health-specific subreddits, and is annotated with three major fine-grained behavior labels (intent, criticism, and readability) and emotion labels.* Results: The study finds that self-criticism is the most prevalent form of criticism expressed by help-seekers, and that individuals who explicitly express their need for help are more likely to receive assistance compared to those who present surveys or engage in rants. Additionally, the study highlights the importance of well-articulated problem descriptions in receiving support.

Abstract
Online Mental Health Communities (OMHCs), such as Reddit, have witnessed a surge in popularity as go-to platforms for seeking information and support in managing mental health needs. Platforms like Reddit offer immediate interactions with peers, granting users a vital space for seeking mental health assistance. However, the largely unregulated nature of these platforms introduces intricate challenges for both users and society at large. This study explores the factors that drive peer engagement within counseling threads, aiming to enhance our understanding of this critical phenomenon. We introduce BeCOPE, a novel behavior encoded Peer counseling dataset comprising over 10,118 posts and 58,279 comments sourced from 21 mental health-specific subreddits. The dataset is annotated using three major fine-grained behavior labels: (a) intent, (b) criticism, and (c) readability, along with the emotion labels. Our analysis indicates the prominence of ``self-criticism'' as the most prevalent form of criticism expressed by help-seekers, accounting for a significant 43% of interactions. Intriguingly, we observe that individuals who explicitly express their need for help are 18.01% more likely to receive assistance compared to those who present ``surveys'' or engage in ``rants.'' Furthermore, we highlight the pivotal role of well-articulated problem descriptions, showing that superior readability effectively doubles the likelihood of receiving the sought-after support. Our study emphasizes the essential role of OMHCs in offering personalized guidance and unveils behavior-driven engagement patterns.

摘要
We introduce BeCOPE, a novel dataset comprising over 10,118 posts and 58,279 comments sourced from 21 mental health-specific subreddits. The dataset is annotated with three major fine-grained behavior labels: (a) intent, (b) criticism, and (c) readability, as well as emotion labels. Our analysis reveals that "self-criticism" is the most prevalent form of criticism expressed by help-seekers, accounting for 43% of interactions. Interestingly, we find that individuals who explicitly express their need for help are 18.01% more likely to receive assistance compared to those who present "surveys" or engage in "rants." Furthermore, we highlight the importance of well-articulated problem descriptions, showing that superior readability effectively doubles the likelihood of receiving the sought-after support.Our study emphasizes the crucial role of OMHCs in offering personalized guidance and unveils behavior-driven engagement patterns. These findings have significant implications for the development of OMHCs and the provision of mental health support online. By understanding the factors that drive peer engagement, we can better tailor these platforms to meet the needs of users and improve the overall quality of mental health support.

Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking

paper_url: http://arxiv.org/abs/2309.01606
repo_url: None
paper_authors: Yong Cao, Ruixue Ding, Boli Chen, Xianzhi Li, Min Chen, Daniel Hershcovich, Pengjun Xie, Fei Huang
for: 该论文旨在提高中文地图搜索结果的准确率，以便为地图服务等地理相关应用提供更加有用的结果。
methods: 该论文提出了一种新的框架，即Geo-Encoder，以更好地将中文地理 semantics интегра到重新排序管道中。该方法首先使用可用的工具将文本与地理 span 相关联，然后提出了一种多任务学习模块，以同时获得一个有效的注意力矩阵，决定chunk的贡献。此外，该论文还提出了一种异步更新机制，以便指导模型更好地关注特定的chunk。
results: experiments 表明，Geo-Encoder 在两个中文地理搜索数据集上达到了显著提高，相比之前的基eline。特别是，它使得 MGEO-BERT 的 Hit@1 分数从 62.76 提高到 68.98， representing a 6.22% improvement.

Abstract
Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to extra semantic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed addition task, aiming to guide the model capable of effectively focusing on specific chunks. Experiments on two distinct Chinese geographic re-ranking datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset.

摘要

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

paper_url: http://arxiv.org/abs/2309.01576
repo_url: None
paper_authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman
for: 本研究旨在investigate the impact of different pre-trained language models (PLMs) on text-to-speech (TTS) tasks, specifically prosody prediction和pause prediction.
methods: 研究采用了15种不同的PLMs，并对其进行了训练和测试。
results: 发现PLMs的大小和质量之间存在对数关系，并且发现表达和中性表达之间存在显著的性能差异。此外，发现 pause prediction 任务对小型模型的敏感程度较低，并且发现这些语言模型的验证结果和我们的实验结果之间存在强相关性。

Abstract
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.

摘要
现代文本读取系统（TTS）已经利用预训练语言模型（PLM）提高了语调和创造出更自然的语音。然而，虽然PLM在自然语言理解（NLU）方面得到了广泛的研究，但它们在TTS方面的影响却被忽略了。在这项研究中，我们想要填补这个差距，通过对不同PLM进行比较分析，以探讨它们在两个TTS任务中的表现。首先，我们使用15种不同的PLM进行语调预测模型的训练。我们的发现表明，模型的大小和质量之间存在对数的关系，同时，中性和表达性的语调之间也存在显著的性能差异。其次，我们使用PLM进行停顿预测，发现这个任务对小型模型是更敏感的。我们还发现了这些实验结果和GLUE分数中的语言模型得到的相关性强。根据我们所知，这是第一项研究对TTS中不同PLM的影响的研究。

What are Public Concerns about ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You

paper_url: http://arxiv.org/abs/2309.01522
repo_url: None
paper_authors: Rui Wang, Xing Liu, Yanan Wang, Haiping Huang
for: 本研究的目的是挖掘对 chatGPT 的公共担忧。
methods: 本研究使用了一种名为 Self-Supervised neural Topic Model (SSTM)，它将话题化模型视为表示学习过程。
results: 实验结果显示，提posed方法可以提取高质量的公共担忧，并且具有更好的解释性和多样性，超过了现有的方法的性能。

Abstract
The recently released artificial intelligence conversational agent, ChatGPT, has gained significant attention in academia and real life. A multitude of early ChatGPT users eagerly explore its capabilities and share their opinions on it via social media. Both user queries and social media posts express public concerns regarding this advanced dialogue system. To mine public concerns about ChatGPT, a novel Self-Supervised neural Topic Model (SSTM), which formalizes topic modeling as a representation learning procedure, is proposed in this paper. Extensive experiments have been conducted on Twitter posts about ChatGPT and queries asked by ChatGPT users. And experimental results demonstrate that the proposed approach could extract higher quality public concerns with improved interpretability and diversity, surpassing the performance of state-of-the-art approaches.

摘要
Recently released artificial intelligence conversational agent ChatGPT 已经吸引了大量学术和实际应用的关注。许多早期 ChatGPT 用户积极探索其能力并分享他们对其的看法 via 社交媒体。用户提问和社交媒体文章表达了公众对 ChatGPT 的担忧。为了挖掘公众对 ChatGPT 的担忧，本文提出了一种新的 Self-Supervised neural Topic Model (SSTM)，它将话题模型化为表示学习过程的形式。对 Twitter 上关于 ChatGPT 的文章和用户提问进行了广泛的实验，并实验结果表明，提出的方法可以提取更高质量的公众担忧，并且可以提高解释性和多样性，超过了现有的方法的性能。

LLM and Infrastructure as a Code use case

paper_url: http://arxiv.org/abs/2309.01456
repo_url: None
paper_authors: Thibault Chanus, Michael Aubertin
for: This paper aims to explore the use of Generative LLMs (Language Models) to generate and manage Ansible YAML roles and playbooks, with a focus on identifying potential directions and industrial applications.
methods: The paper employs the use of Ansible and YAML, alongside Generative LLMs, to automate systems administration tasks and translate human descriptions into code.
results: The paper outlines the potential of using Generative LLMs in this context, with the potential for improved efficiency and accuracy in generating and managing Ansible YAML roles and playbooks.Here’s the Chinese translation of the three points:
for: 这篇论文旨在探讨使用生成式LLM（语言模型）来生成和管理 Ansible YAML 角色和执行脚本，注重发现可能的方向和工业应用。
methods: 论文使用 Ansible 和 YAML，并与生成式LLM 结合，自动化系统管理任务，将人类描述转化为代码。
results: 论文强调使用生成式LLM 在这个上下文中的潜在优势，包括提高代码生成和管理 Ansible YAML 角色和执行脚本的效率和准确性。

Abstract
Cloud computing and the evolution of management methodologies such as Lean Management or Agile entail a profound transformation in both system construction and maintenance approaches. These practices are encompassed within the term "DevOps." This descriptive approach to an information system or application, alongside the configuration of its constituent components, has necessitated the development of descriptive languages paired with specialized engines for automating systems administration tasks. Among these, the tandem of Ansible (engine) and YAML (descriptive language) stands out as the two most prevalent tools in the market, facing notable competition mainly from Terraform. The current document presents an inquiry into a solution for generating and managing Ansible YAML roles and playbooks, utilizing Generative LLMs (Language Models) to translate human descriptions into code. Our efforts are focused on identifying plausible directions and outlining the potential industrial applications. Note: For the purpose of this experiment, we have opted against the use of Ansible Lightspeed. This is due to its reliance on an IBM Watson model, for which we have not found any publicly available references. Comprehensive information regarding this remarkable technology can be found directly on our partner RedHat's website, https://www.redhat.com/en/about/press-releases/red-hat-introduces-ansible-lightspeed-ai-driven-it-automation

摘要
云计算和流程管理方法的发展，如Lean Management或Agile，对系统建构和维护方法进行了深刻的变革。这些实践被称为“DevOps”。这种描述性方法， alongside the configuration of its constituent components, has led to the development of descriptive languages and specialized engines for automating systems administration tasks. Among these, the pairing of Ansible (engine) and YAML (descriptive language) is the most prevalent in the market, facing significant competition from Terraform.本文档讨论了一种解决方案，使用生成型LLM（语言模型）将人类描述翻译成代码，以便生成和管理Ansible YAML角色和执行脚本。我们的努力专注于找到可能的方向和详细描述相关的工业应用。注意：在这个实验中，我们选择不使用Ansible Lightspeed，因为它基于IBM Watson模型，而我们没有找到任何公开可用的参考。如果您想了解更多关于这一技术的信息，请参考我们的合作伙伴Red Hat的官方网站，https://www.redhat.com/en/about/press-releases/red-hat-introduces-ansible-lightspeed-ai-driven-it-automation。

NumHG: A Dataset for Number-Focused Headline Generation

paper_url: http://arxiv.org/abs/2309.01455
repo_url: None
paper_authors: Jian-Tao Huang, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
for: 本研究的目的是提高Headline Generation task的 numeral accuracy, 通过引入新的数据集NumHG，并对5种之前的模型进行人工评估。
methods: 本研究使用了Encoder-Decoder模型，并对数据集进行细致的标注，以便更好地学习和评估 numeral generation。
results: 研究发现，现有的模型在numeral generation方面存在缺陷，特别是在数字的准确性方面。 NumHG数据集可以帮助解决这个问题，并且可以驱动进一步的研究和讨论。

Abstract
Headline generation, a key task in abstractive summarization, strives to condense a full-length article into a succinct, single line of text. Notably, while contemporary encoder-decoder models excel based on the ROUGE metric, they often falter when it comes to the precise generation of numerals in headlines. We identify the lack of datasets providing fine-grained annotations for accurate numeral generation as a major roadblock. To address this, we introduce a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news articles for detailed investigation. Further, we evaluate five well-performing models from previous headline generation tasks using human evaluation in terms of numerical accuracy, reasonableness, and readability. Our study reveals a need for improvement in numerical accuracy, demonstrating the potential of the NumHG dataset to drive progress in number-focused headline generation and stimulate further discussions in numeral-focused text generation.

摘要
摘要生成，摘要文本生成中的一项关键任务，旨在将全文短化为精炼的一行文本。尤其是当今的编码-解码模型在ROUGE指标上表现出色，但它们在精确生成数字的问题上经常困难。我们认为缺乏精细标注数据为准确数字生成带来了重大障碍。为了解决这一问题，我们提出了一个新的数据集，即NumHG，并为其进行了27,000多个精心标注的新闻文章的 исследова。此外，我们使用人类评估来评估五种以前的摘要生成模型，以确定它们在数字准确性、合理性和可读性方面的表现。我们的研究发现，现有的模型在数字准确性方面存在改进的需求，这也证明了NumHG数据集的潜在作用力，以及数字专注的文本生成领域的进一步探讨。

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

paper_url: http://arxiv.org/abs/2309.01446
repo_url: None
paper_authors: Raz Lapid, Ron Langberg, Moshe Sipper
for: 这篇论文旨在提供一种用于 manipulate 大型自然语言模型（LLM）的方法，以便在模型结构和参数不可访问的情况下实现不良目的。
methods: 这篇论文使用了一种基于遗传算法（GA）的方法，通过优化一个通用对抗提示来让模型偏离它的对抗目标，从而导致模型生成不当的输出。
results: 通过广泛的实验，这篇论文证明了这种方法的有效性，从而为负责任AI开发提供了一种诊断工具，用于评估和提高 LLM 的对人意图的Alignment。这是我们所知道的第一个自动化的通用黑盒子监狱攻击。

Abstract
Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack.

摘要
大型语言模型（LLM）通常采用对齐技术来与用户意图和社会准则进行Alignment。 unfortunately，这种对齐可以被黑客利用，以达到不良目的。在这篇论文中，我们介绍了一种新的方法，使用进化算法（GA）来控制LLM，当模型结构和参数不可访问时。GA攻击工作通过优化通用对抗提示，使模型与用户的查询相结合，导致模型的对齐受到干扰，从而导致不良和可能有害的输出。我们的新方法可以系统地揭示模型的局限和漏洞，通过找到模型的响应与预期行为不符的情况。经过广泛的实验，我们证明了我们的技术的有效性，因此贡献到负责AI开发的讨论中，提供了对LLM的对齐评估和提高的 диагности工具。到我们所知，这是第一个自动化的通用黑盒子监狱攻击。

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

paper_url: http://arxiv.org/abs/2309.02459
repo_url: None
paper_authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng
for: 提高新频谱频率识别（ASR）性能，使用文本数据进行频率适应。
methods: 使用文本数据进行频率适应，通过下采样音频表示来匹配文本表示的长度。
results: 实验结果表明，提出的方法可以更好地学习两个Modalities的统一表示，从而提高ASR性能。

Abstract
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

摘要
将两种modalities，语音和文本，映射到共享表示空间是一个研究话题，用文本只数据提高端到端自动语音识别（ASR）性能在新领域。然而，语音表示长度和文本表示长度不一致。遗传方法通过升降样本来匹配语音模式，但可能不匹配预期的实际持续时间。在这篇论文中，我们提出了一种新的匹配策略，通过降低音频表示来匹配文本模式。通过引入一个连续整合和点火（CIF）模块生成匹配语音表示，我们的ASR模型可以更好地从两个modalities中学习统一表示，允许通过文本只数据进行领域适应。实验结果表明我们的方法的效果。

SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge

paper_url: http://arxiv.org/abs/2309.01437
repo_url: None
paper_authors: Jiaxu Zhu, Changhe Song, Zhiyong Wu, Helen Meng
for: 提高语音识别的效果，尤其是在域外和长尾数据上
methods: 利用sememe知识来增强语音识别模型
results: 实验表明，sememe知识可以提高语音识别的效果，并且可以提高模型对域外和长尾数据的识别能力

Abstract
Recently, excellent progress has been made in speech recognition. However, pure data-driven approaches have struggled to solve the problem in domain-mismatch and long-tailed data. Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). Sememe, according to the linguistic definition, is the minimum semantic unit in a language and is able to represent the implicit semantic information behind each word very well. Our experiments show that the introduction of sememe information can improve the effectiveness of speech recognition. In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data and enhance the model's domain generalization ability.

摘要

Benchmarking Large Language Models in Retrieval-Augmented Generation

paper_url: http://arxiv.org/abs/2309.01431
repo_url: https://github.com/chen700564/RGB
paper_authors: Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun
for: This paper aims to evaluate the impact of Retrieval-Augmented Generation (RAG) on large language models (LLMs) and identify potential bottlenecks in their capabilities.
methods: The paper uses a systematic approach to investigate the impact of RAG on LLMs, including the establishment of a new corpus (RGB) and the evaluation of 6 representative LLMs on RGB.
results: The evaluation reveals that while LLMs exhibit some degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information, indicating that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Abstract
Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

摘要
大量语言模型（LLM）的幻觉可以通过 Retrieval-Augmented Generation（RAG）方法进行缓解。然而，现有的研究缺乏对不同的大量语言模型RAG的精心评估，这使得了解RAG对不同LLM的可能的瓶颈困难。在这篇论文中，我们系统地研究了RAG对大量语言模型的影响。我们分析了不同的大量语言模型在4种基本能力上的表现，包括噪声抵抗、负面排斥、信息集成和Counterfactual Robustness。为此，我们创建了一个新的评估 benchmark，即 Retrieval-Augmented Generation Benchmark（RGB），该benchmark包含了4种分别测试基础能力的测试床。然后，我们评估了6种代表性强的LLM在RGB上，以诊断当前LLM在RAG应用中的挑战。评估结果表明，虽然LLM具有一定的噪声抵抗能力，但仍然在负面排斥、信息集成和面对假信息方面存在显著的困难。上述评估结果表明，在RAG应用中，还有一定的征途需要进行。

Hateful Messages: A Conversational Data Set of Hate Speech produced by Adolescents on Discord

paper_url: http://arxiv.org/abs/2309.01413
repo_url: None
paper_authors: Jan Fillies, Silvio Peikert, Adrian Paschke
For: The paper aims to address the bias of youth language within hate speech classification and provide a modern and anonymized hate speech youth language data set.* Methods: The research uses a self-developed annotation schema to classify publicly available online messages from the chat platform Discord, with 6.42% of the messages classified as hate speech. The data set includes age annotations for 35,553 messages, with an average author age of under 20 years old.* Results: The paper provides a modern and anonymized hate speech youth language data set consisting of 88,395 annotated chat messages, which can be used to improve the generalizability and performance of automated hate speech classification systems.

Abstract
With the rise of social media, a rise of hateful content can be observed. Even though the understanding and definitions of hate speech varies, platforms, communities, and legislature all acknowledge the problem. Therefore, adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. ~6,42% of the messages were classified by a self-developed annotation schema as hate speech. For 35.553 messages, the user profiles provided age annotations setting the average author age to under 20 years old.

摘要

Zero-shot information extraction from radiological reports using ChatGPT

paper_url: http://arxiv.org/abs/2309.01398
repo_url: None
paper_authors: Danqing Hu, Bing Liu, Xiaofeng Zhu, Xudong Lu, Nan Wu
for: 抽象出 radiological report 中有用信息，以便进行第二次分析。
methods: 使用 ChatGPT 大语言模型进行零参数信息提取，不需要注释数据进行模型参数优化。
results: ChatGPT 可以在 847 份 CT 报告中提取有用信息，但还需要进一步改进以提高性能。

Abstract
Electronic health records contain an enormous amount of valuable information, but many are recorded in free text. Information extraction is the strategy to transform the sequence of characters into structured data, which can be employed for secondary analysis. However, the traditional information extraction components, such as named entity recognition and relation extraction, require annotated data to optimize the model parameters, which has become one of the major bottlenecks in building information extraction systems. With the large language models achieving good performances on various downstream NLP tasks without parameter tuning, it becomes possible to use large language models for zero-shot information extraction. In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract useful information from the radiological reports. We first design the prompt template for the interested information in the CT reports. Then, we generate the prompts by combining the prompt template with the CT reports as the inputs of ChatGPT to obtain the responses. A post-processing module is developed to transform the responses into structured extraction results. We conducted the experiments with 847 CT reports collected from Peking University Cancer Hospital. The experimental results indicate that ChatGPT can achieve competitive performances for some extraction tasks compared with the baseline information extraction system, but some limitations need to be further improved.

摘要
电子健康记录包含巨量有价值信息，但许多都记录在自由文本中。信息提取是将字符串序列转换为结构化数据的策略，可以用于次要分析。然而，传统信息提取组件，如命名实体识别和关系EXTRACTION，需要标注数据来优化模型参数，这成为了建立信息提取系统的一个主要瓶颈。随着大语言模型在多个下游NLP任务中达到好表现，不需要参数调整，因此可以使用大语言模型进行零shot信息提取。在这项研究中，我们想要探索是否可以使用最受欢迎的大语言模型ChatGPT提取CT报告中的有用信息。我们首先设计了关注信息的模板Prompt，然后将Prompt与CT报告结合使用ChatGPT来获取响应。我们还开发了一个后处程模块，用于将响应转换为结构化提取结果。我们对847份CT报告进行了实验，结果表明ChatGPT可以与基准信息提取系统相比，在某些提取任务中达到竞争性表现，但还需要进一步改进。

2023-09-04

Attention-Driven Multi-Modal Fusion: Enhancing Sign Language Recognition and Translation

Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech

Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts

Are Emergent Abilities in Large Language Models just In-Context Learning?

An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

MathAttack: Attacking Large Language Models Towards Math Solving Ability

CRUISE-Screening: Living Literature Reviews Toolbox

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Evolving linguistic divergence on polarizing social media

Exploring the effectiveness of ChatGPT-based feedback compared with teacher feedback and self-feedback: Evidence from Chinese to English translation

Critical Behavioral Traits Foster Peer Engagement in Online Mental Health Communities

Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

What are Public Concerns about ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You

LLM and Infrastructure as a Code use case

NumHG: A Dataset for Number-Focused Headline Generation

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge

Benchmarking Large Language Models in Retrieval-Augmented Generation

Hateful Messages: A Conversational Data Set of Hate Speech produced by Adolescents on Discord

Zero-shot information extraction from radiological reports using ChatGPT