cs.CL - 2023-09-13

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

paper_url: http://arxiv.org/abs/2309.07311
repo_url: None
paper_authors: Angelica Chen, Ravid Schwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, Naomi Saphra
for: 本研究旨在探讨masked language model（MLM）在训练过程中的语法学习过程，以及这种过程对模型行为的影响。
methods: 本研究使用了一种case study的方法，通过分析MLM的训练过程中的 interpretable artifacts（可解释的特征）的演变，深入了解模型的emergent behavior（产生的行为）。
results: 研究发现，在训练过程中，MLM会自然地形成一种叫做Syntactic Attention Structure（语法注意结构），这种结构使得特定的Transformer heads（转换器头）对特定的语法关系进行专注。研究发现，在训练过程中有一个短暂的窗口时间，当models suddenly acquire SAS时，并且这个窗口时间与损失函数的快速下降相关。此外，SAS还促进了语言功能的后续学习。通过对SAS的 manipulate during training，研究发现SAS对语言功能的发展具有必要的作用，但同时也会与其他有利的特征和功能竞争。

Abstract
Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. In this paper, we present a case study of syntax acquisition in masked language models (MLMs). Our findings demonstrate how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in training when models abruptly acquire SAS and find that this window is concurrent with a steep drop in loss. Moreover, SAS precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by introducing a regularizer to manipulate SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits and capabilities during training, and that briefly suppressing SAS can improve model quality. These findings reveal a real-world example of the relationship between disadvantageous simplicity bias and interpretable breakthrough training dynamics.

摘要
大多数NLP解释研究都是关注已经训练完成的模型的行为和特征。然而，某些模型行为的理解可能只能通过观察训练过程的趋势来获得。在这篇论文中，我们提供了一个案例研究，探讨 маSKed语言模型（MLM）的 syntax acquisition。我们的发现表明，在训练过程中分析解释性artefact的演变可以深入了解模型的emergent行为。特别是，我们研究了在 MLM 中自然出现的语法注意结构（SAS），其中特定的 transformer 头部倾向于关注特定的语法关系。我们发现，在训练过程中模型突然获得 SAS 的时间fenomenon 和损失下降峰值是一样的。此外，SAS 对语言功能的掌握产生了先驱作用。我们然后对 SAS 的 causal 作用进行了检验，并证明了 SAS 是语言功能的必需条件。我们还发现，在训练过程中 SAS 与其他有利特征和能力的竞争存在，并且短暂地压制 SAS 可以提高模型质量。这些发现表明了实际中解释性研究中的简单偏好与可观察的训练动态之间的关系。

In-Contextual Bias Suppression for Large Language Models

paper_url: http://arxiv.org/abs/2309.07251
repo_url: None
paper_authors: Daisuke Oba, Masahiro Kaneko, Danushka Bollegala
for: 降低语言模型中的性别偏见
methods: 使用文本基础模板生成 counterfactual 语句，以及使用 descriptive 语句来描述职业，以降低语言模型中的性别偏见
results: 通过这些方法，可以减少语言模型中的性别偏见，而不需要访问模型参数，并且不会影响下游任务的性能。

Abstract
Despite their impressive performance in a wide range of NLP tasks, Large Language Models (LLMs) have been reported to encode worrying-levels of gender bias. Prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of the LLMs, which are computationally costly. Moreover, one might not even have access to the internal parameters for performing debiasing such as in the case of commercially available LLMs such as GPT-4. To address this challenge we propose bias suppression, a novel alternative to debiasing that does not require access to model parameters. We show that text-based preambles, generated from manually designed templates covering counterfactual statements, can accurately suppress gender biases in LLMs. Moreover, we find that descriptive sentences for occupations can further suppress gender biases. Interestingly, we find that bias suppression has a minimal adverse effect on downstream task performance, while effectively mitigating the gender biases.

摘要
尽管大语言模型（LLMs）在各种自然语言处理任务中表现出色，但它们仍然存在害让的性别偏见。先前的工作已经提议了对偏见的修正方法，需要人工标注示例、数据扩展和模型练习，这些方法都是 computationally costly。此外，在商业可用的 LLMs 中，可能无法获取内部参数，进行修正。为解决这个挑战，我们提出了偏见抑制，一种不需要模型参数的新方法。我们表明，基于手动设计的模板，生成的文本引言可以准确地抑制 LLMs 中的性别偏见。此外，我们发现，对职业描述句的使用可以进一步抑制性别偏见。意外地，我们发现，偏见抑制并没有对下游任务性能产生明显的负面影响，同时有效地 Mitigating 性别偏见。

RAIN: Your Language Models Can Align Themselves without Finetuning

paper_url: http://arxiv.org/abs/2309.07124
repo_url: None
paper_authors: Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang
for: 本研究旨在实现AI安全，将预训语言模型（LLM）与人类偏好集成。
methods: 本研究使用了自我评估和回溯机制，将预训LLM直接调整为人类偏好。新引入的推论方法称为自回推导（RAIN），可以让预训LLM评估自己的生成结果，并使用评估结果导引后向回溯和前进生成，以确保AI安全。
results: 实验结果显示，RAIN可以将预训LLM的伤害率（HH dataset）提高至97%，保持有用性率不变。另外，在领导性攻击llm-attacks中，RAIN成功降低了攻击成功率从94%下降至19%。

Abstract
Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, the so-called finetuning step. In contrast, aligning frozen LLMs without any extra data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide backward rewind and forward generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates; during the self-evaluation phase, the model receives guidance on which human preference to align with through a fixed-template prompt, eliminating the need to modify the initial prompt. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B over vanilla inference from 82% to 97%, while maintaining the helpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna 33B, RAIN establishes a new defense baseline by reducing the attack success rate from 94% to 19%.

摘要
大型语言模型（LLM）经常表现出人类 предпочтения不一致。过去的研究通过强化学习或指令调整，将预训练模型与人类 предпочтеences进行对应，这被称为辐射步骤。而不需要额外数据，直接将冻结LLMs进行对应，更加吸引人。这项工作探索了后者的可能性。我们发现，通过结合自我评估和rewind机制，无需额外数据的LLMs可以通过自我推动直接生成与人类 предпочтеences相符的回答。我们提出了一种新的推理方法，即逆时间推理自动回归推理（RAIN），允许预训练LLMs在自我评估期间评估自己的生成，并使用评估结果导向后向推动和前向生成，以保障人工智能安全。值得注意的是，RAIN不需要额外数据进行模型对齐，也不需要任何训练、梯度计算或参数更新；在自我评估阶段，模型通过固定模板提示来获得需要与人类 predicates 对齐的指导。实验结果通过GPT-4和人类评价表明，RAIN可以有效地提高LLaMA 30B中的无害率，从82%提高到97%，同时保持有用率不变。在llm-attacks攻击下，RAIN成功地提出了一个新的防御基准，从94%降低到19%。

Mitigating Hallucinations and Off-target Machine Translation with Source-Contrastive and Language-Contrastive Decoding

paper_url: http://arxiv.org/abs/2309.07098
repo_url: https://github.com/zurichnlp/contradecode
paper_authors: Rico Sennrich, Jannis Vamvas, Alireza Mohammadshahi
for: 本文针对机器翻译中的幻视和目标翻译问题进行解决，特别是在低资源语言和巨大多语言模型中。
methods: 本文提出了一种 modificated decoding 目标，以避免需要重训或外部模型。具体来说，在源 contrasional decoding 中寻找一个翻译，它在正确的输入下是可能的，但是在随机输入段下是不可能的，这样幻视和目标翻译将会具有相似的可能性。在语言 contrasional decoding 中寻找一个翻译，它在正确的语言指示符token下是可能的，但是在错误的语言指示符token下是不可能的。
results: 在 M2M-100 (418M) 和 SMaLL-100 上进行实验，发现这些方法可以有效地抑制幻视和目标翻译，提高 chrF2 的分数在57个翻译方向上平均提高1.7和1.4分。在英文–德文的证明中，我们也显示了使用 Llama 2 chat 模型时可以抑制错误翻译，这证明了这些方法的应用性。我们在 GitHub 上发布了源代码，请参考 https://github.com/ZurichNLP/ContraDecode。

Abstract
Hallucinations and off-target translation remain unsolved problems in machine translation, especially for low-resource languages and massively multilingual models. In this paper, we introduce methods to mitigate both failure cases with a modified decoding objective, without requiring retraining or external models. In source-contrastive decoding, we search for a translation that is probable given the correct input, but improbable given a random input segment, hypothesising that hallucinations will be similarly probable given either. In language-contrastive decoding, we search for a translation that is probable, but improbable given the wrong language indicator token. In experiments on M2M-100 (418M) and SMaLL-100, we find that these methods effectively suppress hallucinations and off-target translations, improving chrF2 by 1.7 and 1.4 points on average across 57 tested translation directions. In a proof of concept on English--German, we also show that we can suppress off-target translations with the Llama 2 chat models, demonstrating the applicability of the method to machine translation with LLMs. We release our source code at https://github.com/ZurichNLP/ContraDecode.

摘要
幻觉和目标翻译仍然是机器翻译中的未解决问题，特别是对于low-resource语言和大量多语言模型。在这篇论文中，我们介绍了一些方法来减少这两种失败情况，无需重新训练或外部模型。在源相似搜索中，我们寻找一个翻译，可以在正确的输入下被证明，但是在随机输入段下被证明是不可能的，我们假设幻觉将会与随机输入段一样可能。在语言相似搜索中，我们寻找一个翻译，可以在正确的语言指示符下被证明，但是在错误的语言指示符下被证明是不可能的。在M2M-100（418M）和SMaLL-100上进行了实验，我们发现这些方法可以有效地降低幻觉和目标翻译，提高chrF2的平均分数为1.7和1.4点。在一个证明性的英文-德语实验中，我们也示出了使用LLMs（大量多语言模型）可以降低幻觉翻译。我们在 GitHub 上发布了我们的源代码，请参考https://github.com/ZurichNLP/ContraDecode。

Can Whisper perform speech-based in-context learning

paper_url: http://arxiv.org/abs/2309.07081
repo_url: None
paper_authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
for: 这个论文 investigate OpenAI Whisper 自动语音识别（ASR）模型的上下文学习能力。
methods: 提出了一种基于语音的上下文学习（SICL）方法，可以在测试时进行适应，并且只需要一小部分标注的语音样本，不需要梯度下降。
results: 对中文方言进行语言级别适应实验，结果显示可以在隔离单词ASR中实现了 considable 的相对WRR（词错率）减少，最高达36.4%。

Abstract
This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances.

摘要

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

paper_url: http://arxiv.org/abs/2309.07045
repo_url: https://github.com/thu-coai/safetybench
paper_authors: Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang
for: The paper is written to evaluate the safety of Large Language Models (LLMs) and to provide a comprehensive benchmark for assessing their safety.
methods: The paper presents SafetyBench, a benchmark that includes 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns, both in Chinese and English.
results: The paper reports the results of extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings, showing a substantial performance advantage for GPT-4 over its counterparts, and highlighting the need for further improvement in the safety of current LLMs.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了评估大语言模型（LLMs）的安全性而写的。
methods: 论文提出了 SafetyBench，一个包含11,435个多选问题，涵盖7种安全问题类型的完整的标准套件。
results: 论文公布了25种中文和英文 LLMS 在零shot和几 shot设置下的广泛测试结果，显示 GPT-4 在其他模型中表现出优异的性能，并且现有 LLMS 的安全性还有很大的改进空间。

Abstract
With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We believe SafetyBench will enable fast and comprehensive evaluation of LLMs' safety, and foster the development of safer LLMs. Data and evaluation guidelines are available at https://github.com/thu-coai/SafetyBench. Submission entrance and leaderboard are available at https://llmbench.ai/safety.

摘要
随着大型语言模型（LLMs）的快速发展，关注其安全问题的注意力也在不断增加。因此，评估LLMs的安全性成为了推广其应用的重要任务。然而，没有全面的安全评估标准，使得评估和改进LLMs的安全性受到了很大的阻碍。在这篇文章中，我们提出了一个名为SafetyBench的全面的安全评估标准，它包含11,435个多选问题，涵盖了7种安全问题的多样化分类。另外，SafetyBench还包含了中文和英文数据，以便在两种语言下进行评估。我们对25个流行的中文和英文LLMs进行了zero-shot和few-shot测试，发现GPT-4在这些测试中表现出了显著的性能优势，而目前的LLMs仍有很大的改进空间。我们认为SafetyBench将为LLMs的安全评估提供快速和全面的评估机制，并促进更安全的LLMs的发展。数据和评估指南可以在https://github.com/thu-coai/SafetyBench中找到，评估入口和排名可以在https://llmbench.ai/safety中找到。

Beyond original Research Articles Categorization via NLP

paper_url: http://arxiv.org/abs/2309.07020
repo_url: https://github.com/rturrisige/textclassification
paper_authors: Rosanna Turrisi
for: 这个研究是为了提出一种新的文本分类方法，用于科学文献中未知类别的识别，使用自然语言处理技术。
methods: 研究使用了预训练语言模型SciBERT，从ArXiv数据集中提取了抽象的有意义表示。文本分类使用K-Means算法，并根据Silhouette分数确定最佳分数。
results: 结果表明，提出的方法可以更有效地捕捉摘要中的主题信息，与传统的arXiv标签系统相比，导致改进的文本分类。该方法可能用于更好地导航和推荐科研论文。

Abstract
This work proposes a novel approach to text categorization -- for unknown categories -- in the context of scientific literature, using Natural Language Processing techniques. The study leverages the power of pre-trained language models, specifically SciBERT, to extract meaningful representations of abstracts from the ArXiv dataset. Text categorization is performed using the K-Means algorithm, and the optimal number of clusters is determined based on the Silhouette score. The results demonstrate that the proposed approach captures subject information more effectively than the traditional arXiv labeling system, leading to improved text categorization. The approach offers potential for better navigation and recommendation systems in the rapidly growing landscape of scientific research literature.

摘要
这个研究提出了一种新的文本分类方法，用于未知类别的科学文献中，使用自然语言处理技术。研究利用了预训练语言模型，具体来说是 SciBERT，来提取抽象文本中意义的表示。文本分类使用K-Means算法，并根据Silhouette分数确定最佳分区数。结果表明，提出的方法能够更有效地捕捉报告中的主题信息，从而改善文本分类。该方法可能为科学研究文献领域中的浏览和推荐系统带来改善。

OYXOY: A Modern NLP Test Suite for Modern Greek

paper_url: http://arxiv.org/abs/2309.07009
repo_url: None
paper_authors: Konstantinos Kogkalidis, Stergios Chatzikyriakidis, Eirini Chrysovalantou Giannikouri, Vassiliki Katsouli, Christina Klironomou, Christina Koula, Dimitris Papadakis, Thelka Pasparaki, Erofili Psaltaki, Efthymia Sakellariou, Hara Soupiona
for: 这paper的目的是为希腊自然语言处理（Greek NLP）领域开发一个语言学原则驱动的技术相关评估集。
methods: 这paper使用了两个创新，即在推理任务中标记所有可能的推理标签，以及使用ChatGPT作为语言中立解析器将字典式现代希腊语转换成结构化格式，从而生成其他三个任务。
results: 这paper的实验结果表明了这些任务的挑战性，并证明了希腊NPLEcosystem需要迅速进步，以与当代主流研究保持 pace。

Abstract
This paper serves as a foundational step towards the development of a linguistically motivated and technically relevant evaluation suite for Greek NLP. We initiate this endeavor by introducing four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation (through example comparison or sense selection) and metaphor detection. More than language-adapted replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community. Firstly, our inference dataset is the first of its kind, marking not just \textit{one}, but rather \textit{all} possible inference labels, accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. Using ChatGPT as a language-neutral parser, we transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections. Alongside each task, we conduct experiments using currently available state of the art machinery. Our experimental baselines affirm the challenging nature of our tasks and highlight the need for expedited progress in order for the Greek NLP ecosystem to keep pace with contemporary mainstream research.

摘要
Firstly, our inference dataset is the first of its kind, marking all possible inference labels, taking into account possible shifts due to ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. We use ChatGPT as a language-neutral parser to transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections.We conduct experiments using currently available state-of-the-art machinery alongside each task. Our experimental baselines show that the tasks are challenging and highlight the need for expedited progress in the Greek NLP ecosystem to keep pace with contemporary mainstream research.

Unsupervised Contrast-Consistent Ranking with Language Models

paper_url: http://arxiv.org/abs/2309.06991
repo_url: None
paper_authors: Niklas Stoehr, Pengxiang Cheng, Jing Wang, Daniel Preotiuc-Pietro, Rajarshi Bhowmik
for: 本研究旨在探讨语言模型中的排序知识，并研究一种基于冲突相关搜寻的排序技术。
methods: 本研究使用了对比稳定搜寻（CCS）方法，并对现有排序方法进行了修改，以使其更适合语言模型的排序任务。
results: 研究发现，使用CCR搜寻技术可以比使用提问技术更好地推理语言模型中的排序知识，并且可以与更大的语言模型进行比较。

Abstract
Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank reviews by sentiment. Recent work focuses on pairwise, pointwise, and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce. This motivates us to explore an alternative approach that is inspired by an unsupervised probing method called Contrast-Consistent Search (CCS). The idea is to train a probing model guided by a logical constraint: a model's representation of a statement and its negation must be mapped to contrastive true-false poles consistently across multiple statements. We hypothesize that similar constraints apply to ranking tasks where all items are related via consistent pairwise or listwise comparisons. To this end, we extend the binary CCS method to Contrast-Consistent Ranking (CCR) by adapting existing ranking methods such as the Max-Margin Loss, Triplet Loss, and Ordinal Regression objective. Our results confirm that, for the same language model, CCR probing outperforms prompting and even performs on a par with prompting much larger language models.

摘要
language models 包含排名知识，是解决 context 排名任务的强大解决方案。例如，它们可能具有国家大小的参数知识或可以根据 sentiment 排序评论。 current work 注重 pairwise、pointwise 和 listwise 提问技术来引出语言模型的排名知识。然而，我们发现，即使通过精心调整和限定解码，提问技术可能不一定是自我一致的。这种情况 motivates us 寻找一种不同的方法，它是基于无监督探测方法 called Contrast-Consistent Search (CCS)。这个想法是训练一个探测模型，使其的表示符和其否则都被映射到冲突真假极点一致的多个声明中。我们假设这种限制也适用于排名任务，其中所有项都是通过一致的多个比较关系联系在一起。为此，我们将 binary CCS 方法扩展到 Contrast-Consistent Ranking (CCR)，并采用现有的排名方法，如 Max-Margin Loss、Triplet Loss 和 Ordinal Regression 目标。我们的结果表明，使用 CCR probing，Language model 的表达能力相对较强，甚至可以与更大的语言模型相比。

Remote Inference of Cognitive Scores in ALS Patients Using a Picture Description

paper_url: http://arxiv.org/abs/2309.06989
repo_url: None
paper_authors: Carla Agurto, Guillermo Cecchi, Bo Wen, Ernest Fraenkel, James Berry, Indu Navar, Raquel Norel
For: The paper focuses on detecting cognitive impairment in individuals with Amyotrophic Lateral Sclerosis (ALS) using a digital version of the Edinburgh Cognitive and Behavioral ALS Screen (ECAS) test.* Methods: The study uses a remote testing approach where participants (ALS and non-ALS) describe a pool of pictures with complex scenes on their computer at home. The study extracts linguistic and acoustic features from the speech samples and inputs them into linear regression models to predict ECAS sub-scores and the total score.* Results: The study finds that speech samples from the picture description are reliable enough to predict the ECAS subs-scores, achieving statistically significant Spearman correlation values between 0.32 and 0.51 for the model’s performance using 10-fold cross-validation.Here is the information in Simplified Chinese text:* For: 这个研究旨在检测阿尔茨海默病患者的认知障碍。* Methods: 研究使用了远程测试方法，参与者（阿尔茨海默病患者和非阿尔茨海默病患者）在家中用计算机描述复杂场景的图片。研究提取了语言和听音特征，并将其输入到线性回归模型中预测ECAS子分数和总分数。* Results: 研究发现，图片描述的speech样本可靠地预测ECAS子分数，实现了 statistically significant的Spearman相关值（0.32-0.51）using 10-fold cross-validation。

Abstract
Amyotrophic lateral sclerosis is a fatal disease that not only affects movement, speech, and breath but also cognition. Recent studies have focused on the use of language analysis techniques to detect ALS and infer scales for monitoring functional progression. In this paper, we focused on another important aspect, cognitive impairment, which affects 35-50% of the ALS population. In an effort to reach the ALS population, which frequently exhibits mobility limitations, we implemented the digital version of the Edinburgh Cognitive and Behavioral ALS Screen (ECAS) test for the first time. This test which is designed to measure cognitive impairment was remotely performed by 56 participants from the EverythingALS Speech Study. As part of the study, participants (ALS and non-ALS) were asked to describe weekly one picture from a pool of many pictures with complex scenes displayed on their computer at home. We analyze the descriptions performed within +/- 60 days from the day the ECAS test was administered and extract different types of linguistic and acoustic features. We input those features into linear regression models to infer 5 ECAS sub-scores and the total score. Speech samples from the picture description are reliable enough to predict the ECAS subs-scores, achieving statistically significant Spearman correlation values between 0.32 and 0.51 for the model's performance using 10-fold cross-validation.

摘要
在这个研究中，参与者（ALS和非ALS）被要求在家中电脑上显示出复杂的场景中的一幅照片，然后描述这幅照片。我们分析了这些描述的语言和声音特征，并将其输入到线性回授模型中，以预测ECAS子Score和总分。我们发现，这些语言和声音特征可以预测ECAS子Score的表现，得到了 statistically significant 的Spearman相关值（0.32-0.51），使用10-fold横推验。

Auto-Regressive Next-Token Predictors are Universal Learners

paper_url: http://arxiv.org/abs/2309.06979
repo_url: None
paper_authors: Eran Malach
for: 这个论文主要研究了自然语言处理领域中的语言模型，具体来说是通过预测下一个字符来实现语言模型的逻辑和数学逻辑能力。
methods: 该论文使用了自然语言处理中常见的预测下一个字符的方法，并通过分析这些方法的复杂性来理解语言模型的能力。
results: 实验结果表明，使用了线性预测和浅层多层感知机（MLP）等简单方法可以实现非rivial的文本生成和数学任务，而这些方法都是通过预测下一个字符来训练的。这些结果表明，自然语言处理领域中语言模型的能力主要归功于自动预测下一个字符的训练方法，而不是特定的架构选择。

Abstract
Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.

摘要
大型语言模型显示了惊人的逻辑和数学逻辑能力，允许它们解决复杂任务。有趣的是，这些能力来自于在简单任务上训练的下一个元素预测器。在这项工作中，我们提出了一个理论框架，用于研究自动递归下一个元素预测器。我们示出了，即使是简单的模型，如线性下一个元素预测器，也可以高效地计算任何由图灵机计算的函数。我们引入了一个新的复杂度度量——链接复杂度，用于度量在链式思维（CoT）序列中需要用于逼近某个目标函数的中间元素数量，并分析了这种复杂度与其他复杂度之间的交互。最后，我们通过实验表明，简单的下一个元素预测器，如线性网络和浅层多层感知机（MLP），在文本生成和数学任务上表现出非常有趣的表现。我们的结果表明，语言模型的力量可以归结到自动递归下一个元素训练方案，而不是特定的架构选择。

Dynamic Causal Disentanglement Model for Dialogue Emotion Detection

paper_url: http://arxiv.org/abs/2309.06928
repo_url: None
paper_authors: Yuting Su, Yichen Wei, Weizhi Nie, Sicheng Zhao, Anan Liu
for: 本研究旨在提高对对话中的情感识别精度，通过将各种隐藏变量分离并分析对话内容的时间accumulation。
methods: 本研究提出了一种基于隐藏变量分离的动态 causal disentanglement模型，利用causal directed acyclic graph (DAG)建立隐藏情感信息与其他观察到的元素之间的相关性。
results: 在对话情感识别任务中，该模型的实验结果表明其比传统方法更高的精度，能够更好地识别对话中的情感变化。

Abstract
Emotion detection is a critical technology extensively employed in diverse fields. While the incorporation of commonsense knowledge has proven beneficial for existing emotion detection methods, dialogue-based emotion detection encounters numerous difficulties and challenges due to human agency and the variability of dialogue content.In dialogues, human emotions tend to accumulate in bursts. However, they are often implicitly expressed. This implies that many genuine emotions remain concealed within a plethora of unrelated words and dialogues.In this paper, we propose a Dynamic Causal Disentanglement Model based on hidden variable separation, which is founded on the separation of hidden variables. This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions, thereby enabling more precise emotion recognition. First, we introduce a novel Causal Directed Acyclic Graph (DAG) to establish the correlation between hidden emotional information and other observed elements. Subsequently, our approach utilizes pre-extracted personal attributes and utterance topics as guiding factors for the distribution of hidden variables, aiming to separate irrelevant ones. Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables, enabling the accumulation of emotion-related information throughout the conversation. To guide this disentanglement process, we leverage the ChatGPT-4.0 and LSTM networks to extract utterance topics and personal attributes as observed information.Finally, we test our approach on two popular datasets in dialogue emotion detection and relevant experimental results verified the model's superiority.

摘要
互动感知是一种 kritical technology 在多个领域中广泛应用。然而，对话式感知探测遇到了许多问题和挑战，主要是因为人类自由和对话内容的变化。在对话中，人类情感倾向于累累发展，但它们通常是隐藏的，这意味着许多真正的情感还未被发现。在这篇文章中，我们提出了一个基于隐藏变量分离的动态 causal disentanglement 模型，可以对对话内容进行分解和探索，从而提高感知识别精度。首先，我们提出了一个新的 causal directed acyclic graph (DAG)，用于建立隐藏情感信息和其他观察到的元素之间的相互作用。接着，我们的方法利用预先提取的人际特征和话题的对话概率，将隐藏变量分离为无关的一部分。特别是，我们提出了一个动态时间分离模型，用于推测对话中的变量和隐藏变量的传播，从而在对话中累累发展情感相关信息。为了导引这个分离过程，我们利用 ChatGPT-4.0 和 LSTM 网络提取对话话题和人际特征，以观察到的信息导引分离。总之，我们对两个广泛使用的对话感知探测 datasets 进行了评估，结果显示了我们的方法的优越性。

Native Language Identification with Big Bird Embeddings

paper_url: http://arxiv.org/abs/2309.06923
repo_url: https://github.com/sergeykramp/mthesis-bigbird-embeddings
paper_authors: Sergey Kramp, Giovanni Cassani, Chris Emmery
for: 本研究旨在 investigate whether input size is a limiting factor in Native Language Identification (NLI) tasks, and to provide effective and practical alternatives to traditional feature engineering methods.
methods: 本研究使用 Big Bird embeddings 来训练分类器，并 compare 其表现与传统的语言特征工程方法。
results: 研究发现，使用 Big Bird embeddings 可以大幅提高 NLI 模型的表现，并且可以在各种输入长度下保持稳定的性能。

Abstract
Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

摘要
native 语言识别 (NLI) 目标是根据作者写作的语言来分类作者的本土语言。历史上，这个任务曾经依赖于时间消耗的语言特征工程，而 transformer 基本的 NLI 模型未能提供有效、实用的替代方案。当前的研究发现输入大小是限制因素，并表明使用 Big Bird 嵌入 trained 分类器在 Reddit-L2 数据集上表现远胜语言特征工程模型。此外，我们还提供了输入长度的依赖关系和离线性能的分析，以及嵌入空间的Qualitative 分析。由于这种方法的效iveness 和计算效率，我们认为它将成为未来 NLI 工作的有力道路。

Scaled Prompt-Tuning for Few-Shot Natural Language Generation

paper_url: http://arxiv.org/abs/2309.06759
repo_url: None
paper_authors: Ting Hu, Christoph Meinel, Haojin Yang
for: 这个研究是为了提出 Parameter-Efficient Fine-Tuning (PEFT) 方法，以减少 fine-tuning 大语言模型 (LLMs) 的内存占用和训练成本，并在几次示例中维持或以上性能。
methods: 本研究提出了 Scaled Prompt-Tuning (SPT) 方法，它在几次示例中超过传统 Prompt-Tuning (PT) 方法的性能和数据转移能力，并未增加训练成本。此外，研究还进行了现有 PEFT 方法的比较，发现一些方法在几次示例中表现不佳，特别是在具有挑战性的数据集上。
results: 研究发现，SPT 方法在几次示例中可以维持或以上性能，并且在数据转移时具有更好的数据转移能力。此外，研究还发现了一些现有 PEFT 方法在几次示例中表现不佳，特别是在具有挑战性的数据集上。

Abstract
The increasingly Large Language Models (LLMs) demonstrate stronger language understanding and generation capabilities, while the memory demand and computation cost of fine-tuning LLMs on downstream tasks are non-negligible. Besides, fine-tuning generally requires a certain amount of data from individual tasks whilst data collection cost is another issue to consider in real-world applications. In this work, we focus on Parameter-Efficient Fine-Tuning (PEFT) methods for few-shot Natural Language Generation (NLG), which freeze most parameters in LLMs and tune a small subset of parameters in few-shot cases so that memory footprint, training cost, and labeling cost are reduced while maintaining or even improving the performance. We propose a Scaled Prompt-Tuning (SPT) method which surpasses conventional PT with better performance and generalization ability but without an obvious increase in training cost. Further study on intermediate SPT suggests the superior transferability of SPT in few-shot scenarios, providing a recipe for data-deficient and computation-limited circumstances. Moreover, a comprehensive comparison of existing PEFT methods reveals that certain approaches exhibiting decent performance with modest training cost such as Prefix-Tuning in prior study could struggle in few-shot NLG tasks, especially on challenging datasets.

摘要
LLMs 的增长强大语言理解和生成能力，然而 fine-tuning LLMs 在下游任务中的内存需求和计算成本是非常重要的。此外， fine-tuning 通常需要具体任务的数据量，而数据收集成本是实际应用中的一个问题。在这项工作中，我们关注 Parameter-Efficient Fine-Tuning (PEFT) 方法，以便在几架 Natural Language Generation (NLG) 中减少内存占用量、训练成本和标签成本，保持或者提高性能。我们提出了一种缩放提 prompt-tuning (SPT) 方法，超过了传统的 PT 方法，而无需明显增加训练成本。进一步的研究表明，SPT 在几架 scenarios 中具有更好的转移性，提供了数据缺乏和计算有限的情况下的热门策略。此外，现有的 PEFT 方法的比较表明，某些方法在几架 NLG 任务中，特别是在复杂的 dataset 上，可能会表现出差。

CONVERSER: Few-Shot Conversational Dense Retrieval with Synthetic Data Generation

paper_url: http://arxiv.org/abs/2309.06748
repo_url: https://github.com/miulab/converser
paper_authors: Chao-Wei Huang, Chen-Yu Hsu, Tsu-Yuan Hsu, Chen-An Li, Yun-Nung Chen
for: 本研究旨在提出一种基于对话的密集检索框架，以便在几个示例对话中训练密集检索模型。
methods: 我们利用大语言模型的在场学习能力，将对话查询文本生成自检索集。
results: 我们的提议方法在对话检索评测中表现与完全监督模型相当，表明我们的方法可以在几个示例对话中实现可靠的密集检索。所有代码和生成的数据集可以在 GitHub 上找到。

Abstract
Conversational search provides a natural interface for information retrieval (IR). Recent approaches have demonstrated promising results in applying dense retrieval to conversational IR. However, training dense retrievers requires large amounts of in-domain paired data. This hinders the development of conversational dense retrievers, as abundant in-domain conversations are expensive to collect. In this paper, we propose CONVERSER, a framework for training conversational dense retrievers with at most 6 examples of in-domain dialogues. Specifically, we utilize the in-context learning capability of large language models to generate conversational queries given a passage in the retrieval corpus. Experimental results on conversational retrieval benchmarks OR-QuAC and TREC CAsT 19 show that the proposed CONVERSER achieves comparable performance to fully-supervised models, demonstrating the effectiveness of our proposed framework in few-shot conversational dense retrieval. All source code and generated datasets are available at https://github.com/MiuLab/CONVERSER

摘要
对话搜索提供了一种自然的搜索界面，最近的方法已经在对话搜索中应用了稠密搜索。然而，训练稠密搜索者需要大量的领域内对话数据，这限制了对话稠密搜索的发展，因为充足的领域内对话是昂贵的收集的。在这篇论文中，我们提出了一个名为CONVERSER的框架，用于在最多6个领域对话中训练对话稠密搜索者。我们利用了大型自然语言模型的上下文学习能力，将领域内对话中的段落作为输入，生成对话查询。实验结果表明，我们的CONVERSER可以与完全监督模型相比，在少量对话稠密搜索中实现相似的性能。所有源代码和生成的数据集可以在https://github.com/MiuLab/CONVERSER上下载。

Simultaneous Machine Translation with Large Language Models

paper_url: http://arxiv.org/abs/2309.06706
repo_url: None
paper_authors: Minghan Wang, Jinming Zhao, Thuy-Trang Vu, Fatemeh Shiri, Ehsan Shareghi, Gholamreza Haffari
for: 本研究探讨了大语言模型（LLM）在同时翻译（SimulMT）任务中的可行性。
methods: 我们基于传统方法，提出了一种简单 yet effective的混合策略，使得 LLM 可以在 SimulMT 中参与无需额外训练。我们还进行了 Supervised Fine-Tuning（SFT），并在混合全句和前缀句上进行了训练。
results: 我们在使用 Llama2-7B-chat 进行 nine 种语言对的实验中发现，LLM 可以 дости到与专门的 SimulMT 模型相同的翻译质量和延迟。

Abstract
Large language models (LLM) have demonstrated their abilities to solve various natural language processing tasks through dialogue-based interactions. For instance, research indicates that LLMs can achieve competitive performance in offline machine translation tasks for high-resource languages. However, applying LLMs to simultaneous machine translation (SimulMT) poses many challenges, including issues related to the training-inference mismatch arising from different decoding patterns. In this paper, we explore the feasibility of utilizing LLMs for SimulMT. Building upon conventional approaches, we introduce a simple yet effective mixture policy that enables LLMs to engage in SimulMT without requiring additional training. Furthermore, after Supervised Fine-Tuning (SFT) on a mixture of full and prefix sentences, the model exhibits significant performance improvements. Our experiments, conducted with Llama2-7B-chat on nine language pairs from the MUST-C dataset, demonstrate that LLM can achieve translation quality and latency comparable to dedicated SimulMT models.

摘要
大型自然语言处理模型（LLM）已经在对话基于的交互中展示了各种自然语言处理任务的能力。例如，研究表明，LLM可以在高资源语言的机器翻译任务中达到竞争性的性能。然而，将LLM应用于同时机器翻译（SimulMT）存在许多挑战，包括各种训练-推理匹配问题。在这篇论文中，我们探讨了LLM在SimulMT中的可行性。基于传统方法，我们介绍了一种简单 yet有效的混合策略，使得LLM可以不需要额外训练参与SimulMT。此外，通过精心精度训练（SFT）在混合全文和前缀句子上，模型表现出了显著的性能改善。我们的实验，使用Llama2-7B-chat在MUST-C数据集上进行了九种语言对的实验，显示LLM可以达到与专门的SimulMT模型相同的翻译质量和延迟。

VLSlice: Interactive Vision-and-Language Slice Discovery

paper_url: http://arxiv.org/abs/2309.06703
repo_url: https://github.com/slymane/vlslice
paper_authors: Eric Slyman, Minsuk Kahng, Stefan Lee
for: 本研究旨在开发一种可互动的系统，以帮助用户发现 Representation-level 下的凝合性vision-and-language slice，从未标注的图像集中。
methods: 本研究使用大规模预训练，以学习可转移的模型，并通过用户指导的方式，自动发现vision-and-language slice。
results: 在用户研究中（n=22），VLSlice能够快速生成多样性凝合性vision-and-language slice，并且发布了这个工具给公众。

Abstract
Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly.

摘要
This paper introduces VLSlice, an interactive system that enables user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, or vision-and-language slices, from unlabeled image sets. Our user study (n=22) shows that VLSlice can quickly generate diverse high-coherency slices, and we have released the tool publicly.

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

paper_url: http://arxiv.org/abs/2309.06698
repo_url: https://github.com/gglab-ku/turkish-plu
paper_authors: Arda Uzunoğlu, Gözde Gül Şahin
For: The paper is written for the field of natural language processing, with a focus on procedural natural language understanding (PLU) and its applications in Turkish.* Methods: The paper uses automated translation tools to expand the number of Turkish tutorials on wikiHow, and implements strong baseline models for PLU tasks such as linking actions, goal inference, and summarization using fine-tuned language-specific and multilingual models.* Results: The paper finds that language-specific models consistently outperform their multilingual models by a significant margin across most PLU tasks, and releases the corpus, downstream tasks, and baseline models for future research.

Abstract
Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks. We release our corpus, downstream tasks and the baseline models with https://github.com/ GGLAB-KU/turkish-plu.

摘要
理解进程自然语言（例如步骤说明）是执行和规划的关键步骤。然而，当前大多数语言的相关资源却缺乏。为了填补这一空白，我们进行了关于土耳其进程文本的 caso study。我们首先使用自动翻译工具将土耳其wikiHow的教程从2,000个增加到52,000个，并验证翻译质量和原始意义的忠诚性。然后，我们生成了多个下游任务，例如行为连接、目标推理和概要。为了解决这些任务，我们实施了强大的基线模型，包括TR-BART、BERTurk和多语言模型如mBART、mT5和XLM。我们发现语言特定模型在大多数进程自然语言理解（PLU）任务上一般性能较高，与多语言模型相比。我们将我们的资料库、下游任务和基线模型发布在https://github.com/GGLAB-KU/turkish-plu上。

Statistical Rejection Sampling Improves Preference Optimization

paper_url: http://arxiv.org/abs/2309.06657
repo_url: None
paper_authors: Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu
for: 提高语言模型与人类偏好的对齐，以提高语言模型的性能。
methods: 使用 Reinforcement Learning from Human Feedback (RLHF) 和 Offline Methods such as Sequence Likelihood Calibration (SLiC) 和 Direct Preference Optimization (DPO)，以提高稳定性和扩展性，同时保持竞争性。
results: RSO 可以准确地估计目标优质策略，并在三种不同任务上进行了广泛的实验，证明了 RSO 可以在 LLM 和人类评分者的评价下表现更好。

Abstract
Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.

摘要
改善语言模型与人类偏好的对应仍然是当前研究挑战。先前的方法主要使用人类反馈学习（RLHF）的在线RL方法，如距离优化策略（PPO）。在最近，offline方法，如序列可信度调整（SLiC）和直接偏好优化（DPO），作为有力的代替方案而出现。SLiC通过使用监督练好的（SFT）政策来细化损失函数，而DPO直接基于偏好数据来优化语言模型，不需要分离的奖励模型。然而，MLE目标优化政策的最大可能性需要从该政策中标注的偏好对。DPO缺乏奖励模型，因此无法从优化政策中采样偏好对，而SLiC只能从SFT政策中采样偏好对。为解决这些限制，我们提出了一种新的方法 called Statistical Rejection Sampling Optimization（RSO），它通过拒绝采样来源优化政策，以获得更加准确的优化策略。我们还提出了一种统一框架，它可以从偏好模型角度来增强SLiC和DPO的损失函数。通过对三个多样化任务进行广泛的实验，我们证明了RSO在LLM和人类评分者的评价中一直表现优于SLiC和DPO。