2023-11-15

cs.CL

cs.CL - 2023-11-15

Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries

paper_url: http://arxiv.org/abs/2311.09458
repo_url: None
paper_authors: Prafulla Kumar Choubey, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu
for:* The paper is written to propose a fine-grained evaluation protocol for summarization models to determine their competencies in generalizing to novel summary-worthy content.methods:* The authors use a test set partitioned based on the lexical similarity of reference test summaries with training summaries to evaluate the model’s performance.* They observe a significant difference in ROUGE-2 and entity recall scores between the subsets with the lowest and highest similarity.results:* The authors show that limiting lexical repetitions in training summaries during both supervised fine-tuning and likelihood calibration stages can improve the model’s performance on novel test cases while retaining average performance.* Their automatic and human evaluations on novel test subsets and recent news articles demonstrate that limiting lexical repetitions can prevent rote learning and improve generalization.

Abstract
Ideal summarization models should generalize to novel summary-worthy content without remembering reference training summaries by rote. However, a single average performance score on the entire test set is inadequate in determining such model competencies. We propose a fine-grained evaluation protocol by partitioning a test set based on the lexical similarity of reference test summaries with training summaries. We observe up to a 5x (1.2x) difference in ROUGE-2 (entity recall) scores between the subsets with the lowest and highest similarity. Next, we show that such training repetitions also make a model vulnerable to rote learning, reproducing data artifacts such as factual errors, especially when reference test summaries are lexically close to training summaries. Consequently, we propose to limit lexical repetitions in training summaries during both supervised fine-tuning and likelihood calibration stages to improve the performance on novel test cases while retaining average performance. Our automatic and human evaluations on novel test subsets and recent news articles show that limiting lexical repetitions in training summaries can prevent rote learning and improve generalization.

摘要

Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset

paper_url: http://arxiv.org/abs/2311.09443
repo_url: None
paper_authors: Brooklyn Sheppard, Anna Richter, Allison Cohen, Elizabeth Allyn Smith, Tamara Kneese, Carolyne Pelletier, Ioana Baldini, Yue Dong
for: 本研究用于开发一个新的 dataset，捕捉女性偏见的细节和复杂性。
methods: 该 dataset 使用多学科专家和注释器共同建构，包括电影字幕注释，捕捉北美电影中的日常性偏见表达。
results: 该研究提供了偏见检测和改进的基准值，并分析了获得的注释。 hope 该工作能够推动 AI 为社会好用的 NLP 技术发展。

Abstract
Using novel approaches to dataset development, the Biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. Built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expressions of misogyny in North American film. The dataset can be used for a range of NLP tasks, including classification, severity score regression, and text generation for rewrites. In this paper, we discuss the methodology used, analyze the annotations obtained, and provide baselines using common NLP algorithms in the context of misogyny detection and mitigation. We hope this work will promote AI for social good in NLP for bias detection, explanation, and removal.

摘要
Translated into Simplified Chinese:使用创新的数据集开发方法，Biasly数据集 capture了偏见的细节和细腻性，在文献中具有独特的表现。与多种学科专家和批注人员合作建立的数据集包含电影字幕拼音，捕捉了北美电影中的日常性偏见。该数据集可以用于多种NLP任务，包括分类、偏见度评分和文本生成重写。在这篇论文中，我们介绍了使用的方法、分析获得的拼音和使用常见NLP算法进行偏见检测和修正的基线。我们希望这项工作能够促进NLP领域的AI为社会好。

Labeled Interactive Topic Models

paper_url: http://arxiv.org/abs/2311.09438
repo_url: https://github.com/jettbrains/-L-
paper_authors: Kyle Seelman, Mozhi Zhang, Jordan Boyd-Graber
for: 用于改善 neural topic model 中的主题选择
methods: 使用用户标签来修改主题，以更好地满足用户的信息需求
results: 通过人工研究，发现用户标签可以提高文档排名分数，从而更好地找到与查询有关的文档

Abstract
Topic models help users understand large document collections; however, topic models do not always find the ``right'' topics. While classical probabilistic and anchor-based topic models have interactive variants to guide models toward better topics, such interactions are not available for neural topic models such as the embedded topic model (\abr{etm}). We correct this lacuna by adding an intuitive interaction to neural topic models: users can label a topic with a word, and topics are updated so that the topic words are close to the label. This allows a user to refine topics based on their information need. While, interactivity is intuitive for \abr{etm}, we extend this framework to work with other neural topic models as well. We develop an interactive interface which allows users to interact and relabel topic models as they see fit. We evaluate our method through a human study, where users can relabel topics to find relevant documents. Using our method, user labeling improves document rank scores, helping to find more relevant documents to a given query when compared to no user labeling.

摘要

Striped Attention: Faster Ring Attention for Causal Transformers

paper_url: http://arxiv.org/abs/2311.09431
repo_url: https://github.com/exists-forall/striped_attention
paper_authors: William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley
for: 提高Transformer模型中的序列长度增长的能力
methods: 使用Ring Attention算法和Striped Attention扩展来解决每个设备内存瓶颈问题
results: 在 causal transformer 模型中实现1.45倍的终端通过put通过率提高，并在16个TPUv4板上实现1.65倍的速度提高，sequence length为256k和786k。

Abstract
To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming per-device memory bottle- necks by distributing self-attention across multiple devices. In this paper, we study the performance characteristics of Ring Attention in the important special case of causal transformer models, and identify a key workload imbal- ance due to triangular structure of causal attention computations. We propose a simple extension to Ring Attention, which we call Striped Attention to fix this imbalance. Instead of devices having contiguous subsequences, each device has a subset of tokens distributed uniformly throughout the sequence, which we demonstrate leads to more even workloads. In experiments running Striped Attention on A100 GPUs and TPUv4s, we are able to achieve up to 1.45x end-to-end throughput improvements over the original Ring Attention algorithm on causal transformer training at a sequence length of 256k. Furthermore, on 16 TPUv4 chips, we were able to achieve 1.65x speedups at sequence lengths of 786k. We release the code for our experiments as open source

摘要
为了满足长序列的增长需求，刘等人最近提出了环形注意力算法（Ring Attention），可以在单个设备内分布自注意力，从而缓解单个设备内存瓶颈。在这篇论文中，我们研究了环形注意力在重要的 causal transformer 模型中的性能特点，并发现了一个关键的工作负荷不均。我们提出了一个简单的扩展，称为扫描注意力（Striped Attention），可以解决这一问题。在实验中，我们在 A100 GPU 和 TPUv4 上运行了扫描注意力算法，并 achieved 256k 序列长度下的最大 1.45x 终端通过puts，以及 786k 序列长度下的最大 1.65x 终端通过puts。此外，我们还发布了我们的实验代码作为开源。

Predicting generalization performance with correctness discriminators

paper_url: http://arxiv.org/abs/2311.09422
repo_url: None
paper_authors: Yuekun Yao, Alexander Koller
for: 预测NLP模型在未看过数据上的准确率，以确保模型的可靠性。
methods: 提出了一种新的模型，通过训练一个推断器，来预测序列到序列模型输出是正确或错误的。
results: 在多种标注、分析和semantic parsing任务上，金字典准确率都在预测的上下限之间，并且这些上下限很接近。

Abstract
The ability to predict an NLP model's accuracy on unseen, potentially out-of-distribution data is a prerequisite for trustworthiness. We present a novel model that establishes upper and lower bounds on the accuracy, without requiring gold labels for the unseen data. We achieve this by training a discriminator which predicts whether the output of a given sequence-to-sequence model is correct or not. We show across a variety of tagging, parsing, and semantic parsing tasks that the gold accuracy is reliably between the predicted upper and lower bounds, and that these bounds are remarkably close together.

摘要
使得预测NLP模型对未看过、可能不属于输入范围的数据的准确率是一个必要的前提，以确保模型的可靠性。我们提出了一种新的模型，可以在未看过数据上预测模型的准确率，不需要黄金标签。我们通过训练一个推断器，判断给定的序列-到-序列模型输出是否正确，来实现这一点。我们在不同的标注、分析和 semantics 解析任务上显示，黄金准确率在预测的Upper和Lower bound之间，这些 bound 很接近。Here's the translation in Simplified Chinese: 使得预测NLP模型对未看过、可能不属于输入范围的数据的准确率是一个必要的前提，以确保模型的可靠性。我们提出了一种新的模型，可以在未看过数据上预测模型的准确率，不需要黄金标签。我们通过训练一个推断器，判断给定的序列-到-序列模型输出是否正确，来实现这一点。我们在不同的标注、分析和 semantics 解析任务上显示，黄金准确率在预测的Upper和Lower bound之间，这些 bound 很接近。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

paper_url: http://arxiv.org/abs/2311.09406
repo_url: None
paper_authors: James Bernhard
for: 避免权重缺失导致梯度消失的问题
methods: 提出一些替代缩放方法，包括将dot product除以键值Sum前应用softmax
results: 通过使用模拟的键和问题示例，显示了这些缩放方法在许多情况下效果更好，避免了梯度消失的问题

Abstract
The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.

摘要
transformer神经网络架构使用一种叫做注意力的机制，其中查询和键的点积被除以键维度的平方根之后应用softmax。这种缩放的点积是为了避免查询和键的绝对值变得太大，使得应用softmax导致梯度消失。在这篇论文中，我们提出了一些代替的缩放方法，包括在应用softmax之前将点积除以键的总长度。我们使用模拟的查询和键来显示，在许多情况下，这些缩放方法更有效地避免应用softmax导致梯度消失的区域。

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

paper_url: http://arxiv.org/abs/2311.09404
repo_url: None
paper_authors: Benedikt Ebing, Goran Glavaš
for: 本研究旨在系统地评估现有和提出新的翻译基于的跨语言迁移（XLT）方法，以便在低资源语言上进行迁移。
methods: 本研究使用了翻译基于的XLT方法，包括将源语言训练数据翻译回目标语言，并将目标语言测试数据翻译回源语言。此外，还添加了其他高资源语言的可靠翻译来加强模型。
results: 研究发现，使用翻译基于的XLT方法可以大幅超越零极XLT方法，并且可以通过添加其他高资源语言的翻译来进一步提高实验性能。此外，研究还提出了一种能够在不支持MT系统的语言上实现XLT的效果的策略。最后，研究发现，使用MT系统生成的目标语言验证数据来选择XLT模型可以更好地提高模型性能。

Abstract
Perfect machine translation (MT) would render cross-lingual transfer (XLT) by means of multilingual language models (LMs) superfluous. Given, on one hand, the large body of work on improving XLT with multilingual LMs and, on the other hand, recent advances in massively multilingual MT, in this work, we systematically evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages. We show that all translation-based approaches dramatically outperform zero-shot XLT with multilingual LMs, rendering the approach that combines the round-trip translation of the source-language training data with the translation of the target-language test instances the most effective. We next show that one can obtain further empirical gains by adding reliable translations to other high-resource languages to the training data. Moreover, we propose an effective translation-based XLT strategy even for languages not supported by the MT system. Finally, we show that model selection for XLT based on target-language validation data obtained with MT outperforms model selection based on the source-language data. We hope that our findings encourage adoption of more robust translation-based baselines in XLT research.

摘要
如果精准机器翻译（MT）能够实现语言转换（XLT），那么使用多语言模型（LM）来实现XLT将成为 redundant。在一个手上，有大量关于提高XLT的多语言LM的研究，而在另一个手上，有最近的质量翻译技术的进步。在这项工作中，我们系统地评估了现有的翻译基于XLT的方法，并提出了新的翻译基于XLT的方法。我们发现所有的翻译基于方法在零投入XLT中都表现出了很好的表现，而combined round-trip translation of the source-language training data with the translation of the target-language test instances的方法是最有效的。我们还证明可以通过添加其他高资源语言的可靠翻译到训练数据中来获得更高的实验性赢利。此外，我们提出了一种有效的翻译基于XLT策略，即使Language不支持MT系统。最后，我们发现基于MT系统 validation data 进行模型选择可以更好地than基于源语言数据。我们希望我们的发现能够激励XLT研究中更多使用更加稳定的翻译基于基准。

LEEETs-Dial: Linguistic Entrainment in End-to-End Task-oriented Dialogue systems

paper_url: http://arxiv.org/abs/2311.09390
repo_url: None
paper_authors: Nalin Kumar, Ondřej Dušek
for: 这个研究旨在提高对话系统的自然性，通过实现对话Alignment。
methods: 该研究使用GPT-2基于端到端对话系统，并采用共享词汇来实现对话Alignment。试用了训练实例权重、对ignment特定的损失函数和额外conditioning来生成与用户的响应。
results: 通过对MultiWOZ数据集进行比较，研究发现三种 entraining 技术均可以significantly improve alignment compared to the baseline，并被自动和手动评估指标证明。

Abstract
Linguistic entrainment, or alignment, represents a phenomenon where linguistic patterns employed by conversational participants converge to one another. While alignment has been shown to produce a more natural user experience, most dialogue systems do not have any provisions for it. In this work, we introduce methods for achieving dialogue alignment in a GPT-2-based end-to-end dialogue system through the utilization of shared vocabulary. We experiment with training instance weighting, alignment-specific loss, and additional conditioning to generate responses that align with the user. By comparing different entrainment techniques on the MultiWOZ dataset, we demonstrate that all three approaches produce significantly better-aligned results than the baseline, as confirmed by both automated and manual evaluation metrics.

摘要
语言同步（或对齐）现象表示对话参与者使用的语言模式相互听得一致。尽管对齐可以提供更自然的用户体验，但大多数对话系统没有相关的规定。在这项工作中，我们介绍了基于 GPT-2 的端到端对话系统中实现对话对齐的方法，通过共享词汇的使用。我们对训练实例权重、对齐特定的损失函数和附加条件进行实验，以生成与用户相对的回答。通过对 MultiWOZ 数据集的不同对齐技术进行比较，我们证明了所有三种方法均可以在自动和手动评估指标上提供显著更好的对齐效果。

Neural machine translation for automated feedback on children’s early-stage writing

paper_url: http://arxiv.org/abs/2311.09389
repo_url: None
paper_authors: Jonas Vestergaard Jensen, Mikkel Jordahn, Michael Riis Andersen
for: 本研究旨在自动生成初级写作评估和建议，使用机器学习技术。
methods: 本研究提议使用序列到序列模型将初级写作翻译成正常写作，以便使用语言指标进行分析。此外，提出了一种新的强度 likelihood 来抑制数据集中的噪声影响。
results: 经numerical实验 validate，可以高精度预测正常写作。

Abstract
In this work, we address the problem of assessing and constructing feedback for early-stage writing automatically using machine learning. Early-stage writing is typically vastly different from conventional writing due to phonetic spelling and lack of proper grammar, punctuation, spacing etc. Consequently, early-stage writing is highly non-trivial to analyze using common linguistic metrics. We propose to use sequence-to-sequence models for "translating" early-stage writing by students into "conventional" writing, which allows the translated text to be analyzed using linguistic metrics. Furthermore, we propose a novel robust likelihood to mitigate the effect of noise in the dataset. We investigate the proposed methods using a set of numerical experiments and demonstrate that the conventional text can be predicted with high accuracy.

摘要
在这项工作中，我们解决了自动使用机器学习进行早期写作评估和建构反馈的问题。早期写作通常具有不同的语音拼写和缺失正确的语法、标点、间距等等特点，因此对于常见语言指标来说非常困难分析。我们提议使用序列到序列模型将学生早期写作翻译成“常规”的写作，以便使用语言指标进行分析。此外，我们提出了一种新的稳定 likelihood 来抑制数据集中的噪声的影响。我们通过数字实验 investigate 这些方法，并证明可以高度准确地预测常规文本。

Banach-Tarski Embeddings and Transformers

paper_url: http://arxiv.org/abs/2311.09387
repo_url: https://github.com/jtmaher/embedding
paper_authors: Joshua Maher
for: 这个论文是为了提出一种将任意递归数据结构嵌入高维向量空间的新方法。
methods: 这个论文使用的方法包括提出一种可解释性模型，用于转换器的秘密状态向量。这个模型可以在嵌入维度够大时将嵌入vector解码回原始数据结构。此解码算法自然地实现为一种转换器。此外，这些嵌入向量还可以直接进行计算，无需解码。例如，我们提出了一种使用vector操作构建嵌入Token序列的嵌入 parse树算法。
results: 这个论文的结果表明，当嵌入维度够大时，这种嵌入可以准确地重建原始数据结构。此外，这种嵌入还可以 Directly manipulate the embedded vectors to perform computations on the underlying data without decoding.

Abstract
We introduce a new construction of embeddings of arbitrary recursive data structures into high dimensional vectors. These embeddings provide an interpretable model for the latent state vectors of transformers. We demonstrate that these embeddings can be decoded to the original data structure when the embedding dimension is sufficiently large. This decoding algorithm has a natural implementation as a transformer. We also show that these embedding vectors can be manipulated directly to perform computations on the underlying data without decoding. As an example we present an algorithm that constructs the embedded parse tree of an embedded token sequence using only vector operations in embedding space.

摘要
我们介绍了一种新的嵌入构造，用于将任意递归数据结构嵌入高维向量中。这些嵌入提供了可解释的模型 дляtransformer的latent状态向量。我们证明了这些嵌入可以在嵌入维度充分大的情况下被解码回原始数据结构。这个解码算法自然地实现为transformer。我们还证明了这些嵌入向量可以直接进行计算，而无需解码。作为示例，我们提出了一个算法，用于在嵌入空间内构建token序列的嵌入树结构。

Long-form Question Answering: An Iterative Planning-Retrieval-Generation Approach

paper_url: http://arxiv.org/abs/2311.09383
repo_url: None
paper_authors: Pritom Saha Akash, Kashob Kumar Roy, Lucian Popa, Kevin Chen-Chuan Chang
for: 这篇论文是为了解决长形问答（LFQA）问题，旨在生成详细的回答，而不是简单的是或否答案或短要的信息。
methods: 该论文提出了一种基于谱计规划、检索和生成的LFQA模型，通过多次迭代的计划、检索和生成过程来生成详细的回答。
results: 经过广泛的实验，该模型在开放领域和技术领域的QA数据集上表现出优于现有模型，在多种文本和事实指标上具有更高的表现。

Abstract
Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs, which go beyond simple yes/no responses or short factual answers. While existing QA models excel in questions with concise answers, LFQA requires handling multiple topics and their intricate relationships, demanding comprehensive explanations. Previous attempts at LFQA focused on generating long-form answers by utilizing relevant contexts from a corpus, relying solely on the question itself. However, they overlooked the possibility that the question alone might not provide sufficient information to identify the relevant contexts. Additionally, generating detailed long-form answers often entails aggregating knowledge from diverse sources. To address these limitations, we propose an LFQA model with iterative Planning, Retrieval, and Generation. This iterative process continues until a complete answer is generated for the given question. From an extensive experiment on both an open domain and a technical domain QA dataset, we find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.

摘要
长swers 问题 (LFQA) 提出了一个挑战，因为它们需要生成详细的答案，而不是单纯的是或否答案或简短的事实答案。现有的 QA 模型在问题中可以提供简短的答案，但 LFQA 需要处理多个话题和它们的复杂关系，需要详细的解释。过去的 LFQA 尝试都是通过使用相关的文本库来生成长答案，但它们忽视了问题本身可能无法提供足够的信息来定义相关的文本库。此外，生成详细的长答案通常需要从多个来源汇集知识。为解决这些限制，我们提出了一个基于迭代的计划、检索和生成的 LFQA 模型。这个迭代过程一直进行，直到为给定的问题生成完整的答案。从对开放领域和技术领域 QA 数据集的广泛实验来看，我们发现我们的模型在不同的文本和事实指标上超过了当前状态的模型。

paper_url: http://arxiv.org/abs/2311.09367
repo_url: None
paper_authors: Swapnil Mane, Suman Kundu, Rajesh Sharma
for: This paper aims to bridge the gap between disparate studies on aggression content detection and behavioral analysis of aggressive users in the context of cyber-aggressive behavior.
methods: The paper examines the comprehensive process of aggression content detection, including dataset creation, feature selection and extraction, and detection algorithm development. It also reviews studies on behavioral analysis of aggression that explore influencing factors, consequences, and patterns associated with cyber-aggressive behavior.
results: The paper identifies research gaps and encourages further progress in the unified domain of socio-computational aggressive behavior analysis.Here’s the Chinese version of the three information points:
for: 这篇论文目标是将不同领域的侵略行为探究归并到一起，以探讨cyber-侵略行为中的社会问题。
methods: 论文检查了侵略内容检测的全面过程，包括数据集创建、特征选择和提取、检测算法开发。它还查看了对侵略行为的行为分析研究，探讨了侵略行为的影响因素、后果和模式。
results: 论文发现了研究漏洞，并促进了在统一领域内的社会计算侵略行为分析的进展。

Abstract
The rise of social media platforms has led to an increase in cyber-aggressive behavior, encompassing a broad spectrum of hostile behavior, including cyberbullying, online harassment, and the dissemination of offensive and hate speech. These behaviors have been associated with significant societal consequences, ranging from online anonymity to real-world outcomes such as depression, suicidal tendencies, and, in some instances, offline violence. Recognizing the societal risks associated with unchecked aggressive content, this paper delves into the field of Aggression Content Detection and Behavioral Analysis of Aggressive Users, aiming to bridge the gap between disparate studies. In this paper, we analyzed the diversity of definitions and proposed a unified cyber-aggression definition. We examine the comprehensive process of Aggression Content Detection, spanning from dataset creation, feature selection and extraction, and detection algorithm development. Further, we review studies on Behavioral Analysis of Aggression that explore the influencing factors, consequences, and patterns associated with cyber-aggressive behavior. This systematic literature review is a cross-examination of content detection and behavioral analysis in the realm of cyber-aggression. The integrated investigation reveals the effectiveness of incorporating sociological insights into computational techniques for preventing cyber-aggressive behavior. Finally, the paper concludes by identifying research gaps and encouraging further progress in the unified domain of socio-computational aggressive behavior analysis.

摘要
“社交媒体平台的崛起导致了网络攻击性行为的增加，包括网络欺凌、网络恐吓和各种不实和恨言。这些行为与社会的后果存在联系，包括线上匿名和实际世界的抑郁、自杀倾向和，在某些情况下，网络上的暴力。本文探讨了网络攻击性行为的多元定义，并提出了统一的网络攻击定义。我们分析了各种数据集的建立、特征选择和提取，以及检测算法的发展。此外，我们审查了关于攻击者行为的行为分析研究，探讨了这些行为的影响因素、后果和模式。本文的系统性审查显示了融合社会学知识和计算技术可以预防网络攻击性行为。最后，本文总结了研究缺陷，并鼓励进一步的进展在统一的网络攻击行为分析领域。”

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

paper_url: http://arxiv.org/abs/2311.09363
repo_url: None
paper_authors: Rao Ma, Adian Liusie, Mark J. F. Gales, Kate M. Knill
for: 这 paper 探讨了 ASR 基础模型 Whisper 和 MMS 在零shot 设定下的语音分类能力。
methods: 这 paper 使用了简单的模板基于文本提示，将 decoder 的解码概率用于生成零shot 预测。无需训练模型或添加新参数，Whisper 在多个 audio-classification dataset 上表现出了有前所未有的零shot 分类性能，比前一个状态的平均精度高出 9%。
results: 这 paper 发现，对零shot 分类任务，Whisper 模型的性能随模型大小增长，表明随着 ASR 基础模型的扩大，其零shot 性能可能会提高。

Abstract
Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. However, there has been significantly less work on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming existing state-of-the-art zero-shot baseline's accuracy by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

摘要
文本和视觉基础模型可以完成许多零 shot 任务，这是一个极其愉快的特性，使这些系统可以在通用和低资源环境中应用。然而，针对 ASR 基础模型的零 shot 能力的研究远未充分，这些系统通常是特定任务的精度调整或者限定到与其训练标准和数据注解相匹配的应用。在这项工作中，我们调查了 Whisper 和 MMS，这两个 ASR 基础模型是主要用于语音识别的。我们使用简单的模板基于文本提示，并使用 resulting 的解码概率来生成零 shot 预测。无需训练模型Extra 数据或添加新参数，我们示出了 Whisper 在多个 8 个音频分类数据集上的出色的零 shot 分类性能，与现有状态的平均性能提高率为 9%。一种重要的步骤是去偏见，其中一种简单的无监督重weighting 方法可以持续提供显著的性能提升。我们进一步表明，性能随模型大小增长，implying 随 ASR 基础模型的扩大，它们可能会表现出更好的零 shot 性能。

LePaRD: A Large-Scale Dataset of Judges Citing Precedents

paper_url: http://arxiv.org/abs/2311.09356
repo_url: https://github.com/rmahari/lepard
paper_authors: Robert Mahari, Dominik Stammbach, Elliott Ash, Alex `Sandy’ Pentland
for: 本研究的目的是发展实用法律自然语言处理技术，帮助扩大法律研究的访问和 justice 的质量。
methods: 本研究使用了大量的美国联邦法院判例文献，通过Contextualized Word Embeddings 和文本分类来进行预测。
results: 研究发现，使用文本分类方法可以在预测法律前置文献中达到较高的准确率，但是法律预测仍然是一项具有挑战性的任务，具有很大的改进空间。

Abstract
We present the Legal Passage Retrieval Dataset LePaRD. LePaRD is a massive collection of U.S. federal judicial citations to precedent in context. The dataset aims to facilitate work on legal passage prediction, a challenging practice-oriented legal retrieval and reasoning task. Legal passage prediction seeks to predict relevant passages from precedential court decisions given the context of a legal argument. We extensively evaluate various retrieval approaches on LePaRD, and find that classification appears to work best. However, we note that legal precedent prediction is a difficult task, and there remains significant room for improvement. We hope that by publishing LePaRD, we will encourage others to engage with a legal NLP task that promises to help expand access to justice by reducing the burden associated with legal research. A subset of the LePaRD dataset is freely available and the whole dataset will be released upon publication.

摘要
我们介绍了《法律段落预测数据集》（LePaRD）。LePaRD是一个庞大的美国联邦司法文献引用集，旨在促进法律段落预测任务的研究。法律段落预测是一种实践 oriented 的法律检索和逻辑任务，旨在预测基于法律 Argument 的相关部分。我们在 LePaRD 上进行了广泛的评估，发现类别方法在这些任务中表现最好。然而，我们注意到法律预测是一个具有挑战性的任务，还有很大的改进空间。我们希望通过发布 LePaRD，促进法律自然语言处理任务的研究，以扩大对正义的访问。一个 LePaRD 的子集已经公开可用，整个数据集将在发表后公开。

Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization

paper_url: http://arxiv.org/abs/2311.09344
repo_url: None
paper_authors: Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, Sebastian Ruder, Priyanka Agrawal
for: 提高大语言模型在下游任务中的性能，特别是使用 labelled task 数据进行 parameter-efficient fine-tuning (PEFT)。
methods: 提出了一种基于 language 或 task 特有的 parameter 的特殊化方法，通过元素 wise 加法操作来挖掘无标注数据和英语标注数据。
results: 经验表明，该方法可以在摘要任务上取得稳定的提升，只需要训练 PEFT 模块 minimal amount of training data。

Abstract
Parameter-efficient fine-tuning (PEFT) using labeled task data can significantly improve the performance of large language models (LLMs) on the downstream task. However, there are 7000 languages in the world and many of these languages lack labeled data for real-world language generation tasks. In this paper, we propose to improve zero-shot cross-lingual transfer by composing language or task specialized parameters. Our method composes language and task PEFT modules via element-wise arithmetic operations to leverage unlabeled data and English labeled data. We extend our approach to cases where labeled data from more languages is available and propose to arithmetically compose PEFT modules trained on languages related to the target. Empirical results on summarization demonstrate that our method is an effective strategy that obtains consistent gains using minimal training of PEFT modules.

摘要
参数高效调整（PEFT）使用标注任务数据可以显著提高大语言模型（LLM）的下游任务性能。然而，世界上有7000种语言，并且许多这些语言缺乏实际语言生成任务的标注数据。在这篇论文中，我们提议通过语言或任务特化的参数进行改进零上下游语言传递。我们的方法通过语言和任务PEFT模块之间的元素积算操作来利用无标注数据和英文标注数据。我们将我们的方法扩展到有更多语言的标注数据的情况，并提议使用相关语言的PEFT模块进行加法组合。实验结果表明，我们的方法是一种有效的策略，可以通过最小的PEFT模块训练而获得常见的提升。

Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback

paper_url: http://arxiv.org/abs/2311.09336
repo_url: None
paper_authors: Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, Markus Freitag
for: 提高文本生成质量
methods: 使用细化的行为反馈，通过一个学习的错误定位模型来进行迭代改进
results: 在三个文本生成任务中，包括机器翻译、长篇问答和主题概要，观察到0.8和0.7 MetricX的提升，以及4.5和1.8 ROUGE-L的提升，单次迭代改进。使用仿生热化算法可以进一步提高质量，包括最多1.7 MetricX的提升。

Abstract
Recent improvements in text generation have leveraged human feedback to improve the quality of the generated output. However, human feedback is not always available, especially during inference. In this work, we propose an inference time optimization method FITO to use fine-grained actionable feedback in the form of error type, error location and severity level that are predicted by a learned error pinpoint model for iterative refinement. FITO starts with an initial output, then iteratively incorporates the feedback via a refinement model that generates an improved output conditioned on the feedback. Given the uncertainty of consistent refined samples at iterative steps, we formulate iterative refinement into a local search problem and develop a simulated annealing based algorithm that balances exploration of the search space and optimization for output quality. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA) and topical summarization. We observe 0.8 and 0.7 MetricX gain on Chinese-English and English-German translation, 4.5 and 1.8 ROUGE-L gain at long form QA and topic summarization respectively, with a single iteration of refinement. With our simulated annealing algorithm, we see further quality improvements, including up to 1.7 MetricX improvements over the baseline approach.

摘要
Recent improvements in text generation have leveraged human feedback to improve the quality of the generated output. However, human feedback is not always available, especially during inference. In this work, we propose an inference time optimization method FITO to use fine-grained actionable feedback in the form of error type, error location, and severity level that are predicted by a learned error pinpoint model for iterative refinement. FITO starts with an initial output, then iteratively incorporates the feedback via a refinement model that generates an improved output conditioned on the feedback. Given the uncertainty of consistent refined samples at iterative steps, we formulate iterative refinement into a local search problem and develop a simulated annealing based algorithm that balances exploration of the search space and optimization for output quality. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. We observe 0.8 and 0.7 MetricX gain on Chinese-English and English-German translation, 4.5 and 1.8 ROUGE-L gain at long form QA and topic summarization respectively, with a single iteration of refinement. With our simulated annealing algorithm, we see further quality improvements, including up to 1.7 MetricX improvements over the baseline approach.

Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

paper_url: http://arxiv.org/abs/2311.09214
repo_url: None
paper_authors: Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu
for: 提高小语言模型（SLM）的性能，使其更加接近人类认知。
methods: 提出了一种两重方法，首先是将大语言模型（LLM）中的自我评估能力抽象到 SLM 中，以减少错误的 reasoning 和幻见的影响；其次是提出了一种多种链条思维和自我评估 paradigm 的总体distillation进程，以确保更加全面和坚实地将知识传递到 SLM 中。
results: 实验结果表明，我们的方法可以显著提高 distilled SLM 的性能，并且突出了开发更小的模型，更 closely aligns with human cognition 的道路。

Abstract
Large language models (LLMs) have achieved remarkable advancements in the field of natural language processing. However, the sheer scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained contexts. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still carry over flawed reasoning or hallucinations inherited from their LLM counterparts. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability inherent in LLMs into SLMs, which aims to mitigate the adverse effects of erroneous reasoning and reduce hallucinations. Second, we advocate for a comprehensive distillation process that incorporates multiple distinct chain-of-thought and self-evaluation paradigms and ensures a more holistic and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs and sheds light on the path towards developing smaller models closely aligned with human cognition.

摘要

Distill the self-evaluation capability of LLMs into small language models (SLMs) to mitigate erroneous reasoning and reduce hallucinations.2. Use a comprehensive distillation process that incorporates multiple chain-of-thought and self-evaluation paradigms for a more holistic and robust knowledge transfer.Experiments on three NLP benchmarks show that our method significantly improves the performance of distilled SLMs and provides insights into developing smaller models aligned with human cognition.

GRIM: GRaph-based Interactive narrative visualization for gaMes

paper_url: http://arxiv.org/abs/2311.09213
repo_url: None
paper_authors: Jorge Leandro, Sudha Rao, Michael Xu, Weijia Xu, Nebosja Jojic, Chris Brockett, Bill Dolan
for: 帮助对话式角色扮演游戏（RPG）的故事创作。
methods: 使用大型生成文本模型协助创作过程。
results: 可以生成rich narrative graph with branching storylines，并且可以在设计者的交互下自动生成新的子图文件，以满足编辑需求。

Abstract
Dialogue-based Role Playing Games (RPGs) require powerful storytelling. The narratives of these may take years to write and typically involve a large creative team. In this work, we demonstrate the potential of large generative text models to assist this process. \textbf{GRIM}, a prototype \textbf{GR}aph-based \textbf{I}nteractive narrative visualization system for ga\textbf{M}es, generates a rich narrative graph with branching storylines that match a high-level narrative description and constraints provided by the designer. Game designers can interactively edit the graph by automatically generating new sub-graphs that fit the edits within the original narrative and constraints. We illustrate the use of \textbf{GRIM} in conjunction with GPT-4, generating branching narratives for four well-known stories with different contextual constraints.

摘要
对话式角色游戏（RPG）需要强大的故事编写。这些故事可能需要几年时间写作，通常需要一大群创作人员。在这个工作中，我们展示了大型生成文本模型如何帮助这个过程。我们开发了一个名为“GRIM”的原型，它是一个基于图的互动式narative视觉系统，可以根据设计师提供的高级剧本和约束生成丰富的剧本图。设计师可以通过交互地编辑图表，生成适应修改的新子图，以保持在原始剧本和约束之内。我们使用GPT-4和GRIM在不同的Contextual约束下生成分支剧本，以示其可用性。

Contrastive Chain-of-Thought Prompting

paper_url: http://arxiv.org/abs/2311.09277
repo_url: https://github.com/damo-nlp-sg/contrastive-cot
paper_authors: Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing
for: 提高语音模型的逻辑推理能力
methods: 使用对比链式思维法，提供有效和无效示例来引导语音模型进行步骤式推理，并提高推理错误的检测能力
results: 在逻辑推理benchmark上实现了对比链式思维法的普适性，并且提高了语音模型的逻辑推理能力

Abstract
Despite the success of chain of thought in enhancing language model reasoning, the underlying process remains less well understood. Although logically sound reasoning appears inherently crucial for chain of thought, prior studies surprisingly reveal minimal impact when using invalid demonstrations instead. Furthermore, the conventional chain of thought does not inform language models on what mistakes to avoid, which potentially leads to more errors. Hence, inspired by how humans can learn from both positive and negative examples, we propose contrastive chain of thought to enhance language model reasoning. Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations, to guide the model to reason step-by-step while reducing reasoning mistakes. To improve generalization, we introduce an automatic method to construct contrastive demonstrations. Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.

摘要
尽管链式思考已经提高了语言模型的逻辑能力，但它们的下面逻辑过程仍然尚不够了解。尽管逻辑正确性看起来是链式思考的核心，但是前一 Studies 显示，使用无效示例时的影响却是很小。此外， convent ional 链式思考没有告诉语言模型哪些错误需要避免，这可能会导致更多的错误。因此，我们提出了受人类学习 FROM positive 和 negative 示例的灵感，并将其应用于语言模型的逻辑reasoning。与 convent ional 链式思考相比，我们的approach 可以提供有效和无效的逻辑示例，以引导模型 step-by-step 进行逻辑reasoning，同时减少逻辑错误。为了提高泛化能力，我们提出了一种自动生成对照示例的方法。我们的实验表明，对于逻辑测试 benchmark 来说，对照链式思考可以作为一种通用的逻辑促进。

TableLlama: Towards Open Large Generalist Models for Tables

paper_url: http://arxiv.org/abs/2311.09206
repo_url: None
paper_authors: Tianshu Zhang, Xiang Yue, Yifei Li, Huan Sun
for:This paper aims to develop open-source large language models (LLMs) as generalists for a diversity of table-based tasks.methods:The authors construct a new dataset called TableInstruct, which includes a variety of realistic tables and tasks, and fine-tune an open-source model (TableLlama) with LongLoRA to address the long context challenge.results:TableLlama achieves comparable or better performance than the state-of-the-art (SOTA) on 7 out of 8 in-domain tasks, and shows 6-48 absolute point gains on 6 out-of-domain datasets, demonstrating the model’s generalizability.Here’s the simplified Chinese text:for: 这篇论文目标是开发大量自然语言模型（LLM），用于多种表格任务。methods: 作者们构建了一个新的表格数据集（TableInstruct），包括多种真实的表格和任务，并使用LongLoRA进行微调，以解决长上下文挑战。results: TableLlama在7个域内任务中达到或超过当前最佳性能（SOTA），并在6个对于任务特定设计的数据集上显示6-48个绝对点提升，示出模型的通用性。

Abstract
Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific design. On 6 out-of-domain datasets, it achieves 6-48 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model's generalizability. We will open-source our dataset and trained model to boost future work on developing open generalist models for tables.

摘要
semi-structured 表格是普遍存在的。有很多任务旨在自动理解、增强和查询表格。现有的方法frequently需要预训练表格或特定的模型体系设计，或者只能处理特定的表格类型，或者假设表格和任务过于简单。这篇论文是开发大型自然语言模型（LLM）为表格任务的第一步。为此，我们构建了一个名为 TableInstruct 的新数据集，用于训练和评估 LLM。我们还开发了首个开源的通用模型 для表格，即 TableLlama，通过长Context挑战 LongLoRA 来练习。我们在域 Setting 和 out-of-domain Setting 下进行了实验。在 7 个域 Setting 中，TableLlama 在每个任务上与 SOTA 相比，具有相似或更好的性能，即使后者具有特定的设计。在 6 个 out-of-domain 数据集上，它与基模型相比，获得了 6-48 绝对点胜。这表明训练在 TableInstruct 上增强了模型的通用性。我们将开源我们的数据集和训练模型，以便将来的开发工作。

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

paper_url: http://arxiv.org/abs/2311.09205
repo_url: https://github.com/tylerachang/curse-of-multilinguality
paper_authors: Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen
for: 本研究旨在 investigate the effects of multilinguality on language modeling performance in individual languages.
methods: 研究人员采用了10,000个单语言和多语言语言模型，对250种语言进行预训练，包括一些未得到尝试的语言家族。研究人员评估了预训练语言模型性能如何随着单语言数据集大小、多语言数据集大小、预训练语言相似性和模型大小（最大45M参数）变化。
results: 结果表明，在一定程度上添加多语言数据可以提高低资源语言模型性能，与单语言数据集大小相似的效果。这些改进取决于预训练语言之间的语法相似性，而非词汇重叠。然而，高资源语言在多语言预训练场景下 consistently poor performance。随着数据集大小的增加，添加多语言数据开始对所有语言类型的性能产生负面影响， probable due to limited model capacity（“多语言诅咒”）。这些结果表明，大规模多语言预训练可能不适用于任何语言，但更专注的模型可以显著提高性能。

Abstract
Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance.

摘要
多语言语言模型广泛应用于扩展NLP系统到低资源语言。然而，具体的证据表明多语言性对语言模型性能在个体语言中的影响仍然缺乏。在这里，我们预训练了10,000多语言和多语言语言模型，涵盖250种语言，包括一些在NLP中未得到足够研究的语言家族。我们评估了在每种语言中语言模型性能如何随(1)单语言数据集大小、(2)添加多语言数据集大小、(3)添加语言家族之间的语法相似性和(4)模型大小（最多4500万参数）而变化。我们发现，在一定程度上，添加多语言数据可以提高低资源语言模型性能，类似于增加低资源数据集大小，最多提高33%。提高取决于添加的多语言数据中的语法相似性，而词汇重叠也具有有限的效果。然而，高资源语言在多语言预训练场景下一直表现差。随着数据集大小的增加，添加多语言数据开始对低资源语言和高资源语言都有负面影响，可能是因为模型容量的限制（“多语言性的咒”）。这些结果表明，大规模多语言预训练可能不适合任何语言，但更加注重的模型可以很大程度提高性能。

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

paper_url: http://arxiv.org/abs/2311.09194
repo_url: None
paper_authors: James A. Michaelov, Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen
for: 这paper主要研究了大语言模型中的 grammatical knowledge 的抽象性，以及这种抽象性如何在不同语言之间具有共同的特征。
methods: 研究者使用了大语言模型，并对其进行了跨语言和单语言的结构预期测试，以评估模型中 grammatical knowledge 的抽象性。
results: 研究者发现，大语言模型中的 grammatical knowledge 具有抽象性，并且可以在不同语言之间共同影响文本生成。此外，模型的表现和人类实验结果相似，证明了模型中 grammatical knowledge 的抽象性和人类的语言知识之间的相似性。

Abstract
Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is processed and produced more readily. Because confounds exist when using stimuli in a single language, evidence of abstraction is even more compelling from crosslingual structural priming, where use of a syntactic structure in one language primes an analogous structure in another language. We measure crosslingual structural priming in large language models, comparing model behavior to human experimental results from eight crosslingual experiments covering six languages, and four monolingual structural priming experiments in three non-English languages. We find evidence for abstract monolingual and crosslingual grammatical representations in the models that function similarly to those found in humans. These results demonstrate that grammatical representations in multilingual language models are not only similar across languages, but they can causally influence text produced in different languages.

摘要
抽象语法知识 - parts of speech和 grammatical patterns - 是人类语言能力的关键。但是大语言模型中的语法知识如何抽象？我们通过跨语言结构启发来证明语法抽象的存在。在人类文献中，跨语言结构启发提供了吸引人的证据，其中一句语言与之前一句语言的同一个语法结构相似时，对于语言的处理和生成而言更加容易。由于单一语言的干扰因素存在，跨语言结构启发的证据更加吸引人，其中一种语言中的语法结构在另一种语言中引起相似的结构。我们使用大语言模型测量跨语言结构启发，与人类实验结果相比，来自八种cross语言实验和四种单语言实验。我们发现大语言模型中的语法表示存在抽象的特征，与人类中的语法表示类似，并且可以影响不同语言中的文本生成。这些结果表明，多语言语言模型中的语法表示不仅在不同语言之间具有相似性，而且可以 causally 影响不同语言中的文本生成。

PsyEval: A Comprehensive Large Language Model Evaluation Benchmark for Mental Health

paper_url: http://arxiv.org/abs/2311.09189
repo_url: None
paper_authors: Haoan Jin, Siyuan Chen, Mengyue Wu, Kenny Q. Zhu
for: 本研究旨在提供大语言模型（LLM）在心理健康领域的评价标准，填补当前该领域中LLM的评价缺乏的空白。
methods: 本研究使用了六个子任务，涵盖三个维度，系统地评价了八种高级LLM在心理健康领域的能力。
results: 实验结果表明，当前的LLM在心理健康领域仍有很大的提升空间，同时也揭示了未来模型优化的潜在方向。

Abstract
Recently, there has been a growing interest in utilizing large language models (LLMs) in mental health research, with studies showcasing their remarkable capabilities, such as disease detection. However, there is currently a lack of a comprehensive benchmark for evaluating the capability of LLMs in this domain. Therefore, we address this gap by introducing the first comprehensive benchmark tailored to the unique characteristics of the mental health domain. This benchmark encompasses a total of six sub-tasks, covering three dimensions, to systematically assess the capabilities of LLMs in the realm of mental health. We have designed corresponding concise prompts for each sub-task. And we comprehensively evaluate a total of eight advanced LLMs using our benchmark. Experiment results not only demonstrate significant room for improvement in current LLMs concerning mental health but also unveil potential directions for future model optimization.

摘要
近些时间，大语言模型（LLM）在心理健康研究中的应用受到了越来越多的关注，研究显示其惊人的能力，如疾病检测。然而，当前心理健康领域中LLM的能力的全面评估 benchmark 缺乏。因此，我们填补这一空白，引入了心理健康领域的首个全面性 benchmark。这个 benchmark 涵盖了六个子任务，覆盖三个维度，用于系统地评估 LLM 在心理健康领域的能力。我们设计了每个子任务的简洁提示。我们对八个高级 LLM 进行了全面评估，实验结果表明，现有 LLM 在心理健康领域还有很大的提升空间，同时也揭示了未来模型优化的潜在方向。

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

paper_url: http://arxiv.org/abs/2311.09184
repo_url: https://github.com/yale-nlp/instrusum
paper_authors: Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan
for: 这篇论文旨在研究语言模型（LLM）在更复杂的概要任务设定下的性能，特别是在指定概要特征的情况下。
methods: 作者使用了一组指定文章和自然语言需求来训练 LLM，并对5种基于 LLM 的概要系统进行人工评估。以及使用了4种评估协议和11种 LLM 进行自动评估。
results: 研究发现，对于 LLM 来说，制定概要任务仍然是一项具有挑战性的任务，因为（1）所有评估的 LLM 都会在概要中作出错误和其他类型的错误;（2）所有基于 LLM 的评估方法无法与人类标注员 achieve strong alignment 的质量评估标准;（3）不同的 LLM 在概要生成和评估方面表现出了大的性能差距。

Abstract
While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for the desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluation on 5 LLM-based summarization systems. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods in total. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation. We make our collected benchmark, InstruSum, publicly available to facilitate future research in this direction.

摘要
大型语言模型（LLM）已经在标准化的摘要 benchmark 上 дости得了强大的表现，但它们在更加复杂的摘要任务设定中的表现更少研究。因此，我们将 LLM benchmark 在 instruction 控制的文本摘要任务中，其中模型输入包括来源文章和自然语言需求摘要特性。为此，我们为这个任务设定了评估对象 dataset 并进行了人类评估 five LLM 摘要系统。然后，我们对 LLM 自动评估的 benchmark 进行了四种评估协议和 eleven LLM 的评估，共计 forty 种评估方法。我们的研究发现， instruction 控制的文本摘要仍然是 LLM 的挑战，因为：1. 所有 LLM 评估都会在摘要中发生实际和其他类型的错误。2. 所有 LLM 基于的评估方法无法与人类评估者在评估候选摘要质量上实现强大的一致。3. 不同的 LLM 在摘要生成和评估中表现出大的性能差异。我们将我们收集的 benchmark， InstruSum，公开供后续研究使用。

ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models

paper_url: http://arxiv.org/abs/2311.09182
repo_url: None
paper_authors: Jierui Li, Vipul Raheja, Dhruv Kumar
for: 研究长文档自相矛盾的能力
methods: 使用四个现有的开源和商业可用的大语言模型（GPT3.5、GPT4、PaLM2、LLaMAv2）进行分析
results: GPT4表现最好，可以超越人类的表现，但 ainda有问题，尤其是需要更多的细节和 контекст的自相矛盾。

Abstract
In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradictions types, and scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the code associated with the experiments.

摘要
Translation in Simplified Chinese:近期，大型语言模型（LLM）在各种文档级任务上表现出色，如文档分类、概要和问答等。然而，关于 LLMS 在自相矛盾任务上的能力研究却非常有限。在这项工作中，我们介绍了 ContraDoc，首个人类标注的长文档自相矛盾数据集，覆盖多个领域、文档长度、自相矛盾类型和范围。然后，我们分析了四种当前最佳的开源和商业可用 LLM：GPT3.5、GPT4、PaLM2 和 LLaMAv2 在这个数据集上的表现。虽然 GPT4 表现最佳，可以超越人类的表现，但我们发现它在自相矛盾需要更多的细节和上下文时表现不可靠。我们发布了数据集和相关实验代码。

PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

paper_url: http://arxiv.org/abs/2311.09180
repo_url: None
paper_authors: Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jennifer Neville, Tara Safavi
for: 提高写作和沟通质量和效率
methods: 使用搜索引擎增强大型语言模型的写作助手，以提供个性化的写作 Output
results: 实现了个性化的社交媒体和Reddit评论生成，并且可以用作写作质量预测和优化低质量生成

Abstract
Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style and specialized knowledge. In this paper, we address this challenge by proposing PEARL, a retrieval-augmented LLM writing assistant personalized with a generation-calibrated retriever. Our retriever is trained to select historic user-authored documents for prompt augmentation, such that they are likely to best personalize LLM generations for a user request. We propose two key novelties for training our retriever: 1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and 2) A scale-calibrating KL-divergence objective that ensures that our retriever closely tracks the benefit of a document for personalized generation. We demonstrate the effectiveness of PEARL in generating personalized workplace social media posts and Reddit comments. Finally, we showcase the potential of a generation-calibrated retriever to double as a performance predictor and further improve low-quality generations via LLM chaining.

摘要
强大的大语言模型已经促进了写作助手的开发，这些助手承诺可以大幅提高写作和交流的质量和效率。然而，一个阻碍效果的问题是LLM输出的不具有作者的通信风格和专业知识的个性化。在这篇论文中，我们解决这个挑战，提出了一种基于检索的LLM写作助手，即PEARL。我们的检索器通过选择历史用户自动生成的文档来补充请求，以便最大化LLM生成的个性化效果。我们提出了两项关键新特点来训练我们的检索器：1）一种用于选择可以从属性的请求和文档，以便提高个性化效果；2）一种托管KL散度目标，确保检索器与个性化生成的效果相似。我们示出PEARL在生成工作室社交媒体帖子和Reddit评论中的个性化效果。 finally，我们展示了一种基于生成检索的性能预测器，可以进一步改善低质量生成的LLM链。

SiRA: Sparse Mixture of Low Rank Adaptation

paper_url: http://arxiv.org/abs/2311.09179
repo_url: None
paper_authors: Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen, Lei Meng
For: 这篇论文的目的是提出一种简单且高效的推导大型自然语言模型（LoRA），以适应下游任务。* Methods: 这篇论文使用了一种称为“简单混合”的方法，即将LoRA的所有参数都用来适应特定任务。然而，这种方法在实验中被证明是不太有效的。因此，这篇论文提出了一种新的方法，即SiRA，它使用了简单的混合来提高LoRA的性能。* Results: 这篇论文的实验结果显示，SiRA比LoRA和其他混合专家方法在不同单任务和多任务设置中表现更好。

Abstract
Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top $k$ experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simple expert dropout on top of gating network to reduce the over-fitting issue. Through extensive experiments, we verify SiRA performs better than LoRA and other mixture of expert approaches across different single tasks and multitask settings.

摘要
“对大型语言模型进行高效调整”（Parameter Efficient Tuning）是一种广泛使用的方法，以适应下游任务。大多数前一些研究假设添加密集可训练参数，其中所有参数都用于适应特定任务。但我们在LoRA的例子中发现，增加更多的可训练参数并不对Empirical Effective。驱动于此，我们展开了对“稀疏”计算的重要性，并提出了SiRA：稀疏混合低阶适应。SiRA利用Sparse Mixture of Expert（SMoE）来提高LoRA的性能。具体来说，它强制 Top $k$ 专家路由具有容量限制，限制每个专家处理的 Token 最多数量。我们还提出了一种新的简单的专家排除方法，以降低过滤问题。通过广泛的实验，我们证明SiRA在不同的单任务和多任务设置中表现比LoRA和其他混合专家方法更好。

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

paper_url: http://arxiv.org/abs/2311.09154
repo_url: None
paper_authors: Wenhong Zhu, Hongkun Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, Hongyuan Lu
for: 评估大语言模型（LLM）的真实能力，因为数据污染导致评估结果不准确。
methods: 提出了一种新的评估方法——Clean-Eval，通过抽象和反编译污染数据，生成表达相同意义的不同表面形式的样本，并使用语义检测器筛选低质量样本，选择最佳样本基于BLEURT分数。
results: Clean-Eval可以准确地评估污染后的LLM表现，并且可以生成新的评估标准。实验表明，Clean-Eval在几个不同场景下能够重新评估污染后的LLM表现。

Abstract
We are currently in an era of fierce competition among various large language models (LLMs) continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination, and it wastes dozens of time and effort for researchers and engineers to download and try those contaminated models. To save our precious time, we propose a novel and useful method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs in a cleaner manner. Clean-Eval employs an LLM to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter the generated low-quality samples to narrow down this candidate set. The best candidate is finally selected from this set based on the BLEURT score. According to human assessment, this best candidate is semantically similar to the original contamination data but expressed differently. All candidates can form a new benchmark to evaluate the model. Our experiments illustrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.

摘要
现在是一个大型语言模型（LLM）不断推进指标性能的竞争时代。然而，评估这些 LLM 的真实能力已成为一个困难和重要的问题，因为可能存在数据污染，这会浪费研究人员和工程师们很多时间和努力来下载和尝试这些污染的模型。为了保留我们的宝贵时间，我们提出了一种新的方法，即 Clean-Eval，它解决了数据污染问题，并评估 LLM 在更加干净的环境下。Clean-Eval 使用一个 LLM 来重新表述和反翻污染数据，生成表达同一个意义，但表现在不同的表面形式中的候选集。然后，一个Semantic Detector 被用来筛选生成的低质量样本，从而缩小候选集。最后，根据 BLEURT 分数，从候选集中选择最佳候选。根据人工评估，这个最佳候选与原始污染数据具有相同的含义，但表现在不同的表面形式中。所有候选都可以组成一个新的评估标准。我们的实验表明，Clean-Eval 可以减少在污染 LLM 下的实际评估结果的损失，在少量学习和微调学习场景下。

Grounding or Guesswork? Large Language Models are Presumptive Grounders

paper_url: http://arxiv.org/abs/2311.09144
repo_url: None
paper_authors: Omar Shaikh, Kristina Gligorić, Ashna Khetan, Matthias Gerstgrasser, Diyi Yang, Dan Jurafsky
for: 这个论文主要是研究人工智能和人的对话中的共同基础建立方面。
methods: 这个论文使用了一些对话动作（如 clarify 和 acknowledge）来研究人工智能是否可以成功地建立共同基础。
results: 研究发现现有的大型自然语言处理模型（LLMs）在建立共同基础时偏向假设共同基础的存在，而不使用对话动作来确认共同基础。

Abstract
Effective conversation requires common ground: a shared understanding between the participants. Common ground, however, does not emerge spontaneously in conversation. Speakers and listeners work together to both identify and construct a shared basis while avoiding misunderstanding. To accomplish grounding, humans rely on a range of dialogue acts, like clarification (What do you mean?) and acknowledgment (I understand.). In domains like teaching and emotional support, carefully constructing grounding prevents misunderstanding. However, it is unclear whether large language models (LLMs) leverage these dialogue acts in constructing common ground. To this end, we curate a set of grounding acts and propose corresponding metrics that quantify attempted grounding. We study whether LLMs use these grounding acts, simulating them taking turns from several dialogue datasets, and comparing the results to humans. We find that current LLMs are presumptive grounders, biased towards assuming common ground without using grounding acts. To understand the roots of this behavior, we examine the role of instruction tuning and reinforcement learning with human feedback (RLHF), finding that RLHF leads to less grounding. Altogether, our work highlights the need for more research investigating grounding in human-AI interaction.

摘要
有效的对话需要共同基础：参与者之间的共同理解。然而，这些共同基础不会自然地出现在对话中。说话者和听者需要共同工作，以确定并构建共同基础，并避免错解。人类在教学和情感支持等领域中，会考虑地构建共同基础，以避免错解。然而，是否LLMs会利用对话措施来建立共同基础，这是一个未知的问题。为了解决这个问题，我们筛选了一组共同基础动作，并提出了相应的评价指标。我们研究了LLMs是否使用这些共同基础动作，通过对多个对话集进行模拟，并与人类对话进行比较。我们发现，当前的LLMs具有假设共同基础的倾向，即不使用共同基础动作来建立共同基础。为了了解这种行为的起源，我们研究了指导调整和人类反馈学习（RLHF）的作用，发现RLHF会减少共同基础的使用。总之，我们的工作强调了人机交互中共同基础的研究的重要性。

RRescue: Ranking LLM Responses to Enhance Reasoning Over Context

paper_url: http://arxiv.org/abs/2311.09136
repo_url: None
paper_authors: Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, Fei Liu
for: 这篇论文目的是提高大语言模型（LLM）的上下文理解能力，以便更好地应用于响应生成。
methods: 该论文提出了一种新的应用ranking指标来优化LLM的上下文理解，包括人工标注、规则函数和模型蒸馏等方法。
results: 通过使用这种新的应用ranking指标，论文的实验结果表明LLM的上下文理解能力得到了改进，并且在最新的多文档问答 dataset 上达到了更高的成绩。

Abstract
Effectively using a given context is paramount for large language models. A context window can include task specifications, retrieved documents, previous conversations, and even model self-reflections, functioning similarly to episodic memory. While efforts are being made to expand the context window, studies indicate that LLMs do not use their context optimally for response generation. In this paper, we present a novel approach to optimize LLMs using ranking metrics, which teaches LLMs to rank a collection of contextually-grounded candidate responses. Rather than a traditional full ordering, we advocate for a partial ordering. This is because achieving consensus on the perfect order for system responses can be challenging. Our partial ordering is more robust, less sensitive to noise, and can be acquired through human labelers, heuristic functions, or model distillation. We test our system's improved contextual understanding using the latest benchmarks, including a new multi-document question answering dataset. We conduct ablation studies to understand crucial factors, such as how to gather candidate responses, determine their most suitable order, and balance supervised fine-tuning with ranking metrics. Our approach, named RRescue, suggests a promising avenue for enhancing LLMs' contextual understanding via response ranking.

摘要
使用给定的上下文是大语言模型的关键。上下文窗口可以包括任务规范、检索到的文档、先前的对话和模型自我反思，功能类似于 episodic memory。然而，研究表明，LLMs 不使用上下文最优。在这篇论文中，我们提出了一种新的方法来优化 LLMs，使其可以 ranks 上下文化的候选答案集。而不是传统的全局排序，我们建议使用 partial ordering。这是因为实现完美的上下文排序可能是困难的。我们的 partial ordering 更加稳定， less sensitive to noise，可以通过人工标注、规则函数或模型泛化来获得。我们测试了我们的系统在最新的benchmark中的改进上下文理解，包括一个新的多文档问答数据集。我们进行了ablation study来理解关键因素，如如何收集候选答案、确定其最佳顺序和平衡upervised fine-tuning with ranking metrics。我们的方法，名为 RRescue，建议一种可能提高 LLMs 的上下文理解的 Avenues。

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

paper_url: http://arxiv.org/abs/2311.09132
repo_url: None
paper_authors: Miguel Moura Ramos, Patrick Fernandes, António Farinhas, André F. T. Martins
for: 这种技术是为了提高语言模型生成的文本质量，使其更加类似于人类生成的文本。
methods: 这种技术使用人类反馈来训练抽象模型，并在语言模型的训练过程中使用它来改进模型的性能。
results: 这个研究表明，通过integratingquality metrics into the MT pipeline可以提高翻译质量，并且 combining RL training with reranking techniques可以实现显著的提高。

Abstract
Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation. In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach. Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality. Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.

摘要
人类反馈学习（RLHF）是一种现代技术，用于改进语言模型生成的文本质量，使其更加类似于人类生成的文本。RLHF的成功一大部分归功于其奖励模型，通过人类反馈来训练。在机器翻译（MT）领域，可以 readily使用人类标注数据来训练奖励模型，最近的方法使用最小极大 Bayes风险解oding和重新排序技术，已经在提高翻译质量方面取得了 significanthy进步。本研究旨在全面探讨和比较在MT管道中 integrateQuality metrics as reward models的技术。这包括使用奖励模型来筛选数据，在训练阶段通过RL进行训练，以及在推理时使用重新排序技术，并评估这些技术的组合效果。我们的实验结果，在多个翻译任务上进行了检验，强调了有效的数据筛选，基于估计的质量，在RL中激发全部的潜力，提高翻译质量。此外，我们的发现还证明了RL训练与重新排序技术的组合可以实现显著的提高翻译质量。

paper_url: http://arxiv.org/abs/2311.09130
repo_url: https://github.com/naitian/semantic-memes
paper_authors: Naitian Zhou, David Jurgens, David Bamman
for: This paper explores sociolinguistic variation in memes, using a computational pipeline to cluster individual instances of memes into templates and semantic variables.
methods: The paper uses a multimodal approach, taking advantage of the visual templates and text in memes to analyze their semantic function.
results: The study discovers meaningful social variation in meme usage between subreddits, and patterns of meme innovation and acculturation within these communities align with previous findings on written language.Here is the same information in Simplified Chinese text:
for: 这篇论文探索了社会语言变化在抖音中，使用计算机方法将个体抖音划分成模板和语义变量。
methods: 论文采用多Modal方法，利用抖音的视觉模板和文本来分析其语义功能。
results: 研究发现抖音在社区之间存在社会意义的变化，并发现抖音创新和同化在这些社区中与过去语言变化的趋势相吻合。

Abstract
Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting \textsc{SemanticMemes} dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language.

摘要
很多NLP领域的研究使用计算方法来探索社会语言变化。在这篇论文中，我们 argue That memes，作为Multimodal的语言形式，也存在意义的社会变化。我们构建了一个计算管道来将个体照片分为模板和semantic variable，利用其 Multimodal结构来做此。我们将这些方法应用于Reddit上的大量meme图片集合，并将结果作为\textsc{SemanticMemes}数据集，包含3.8M个图片，按Semantic功能进行分组。我们使用这些分组来分析Memes的语言变化，发现不仅在subreddit之间存在社会意义的变化，而且在这些社区中，meme创新和同化的模式与前面的文本语言发现相似。

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

paper_url: http://arxiv.org/abs/2311.09122
repo_url: None
paper_authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
for: 这 paper 的目的是开发一个开源、社区驱动的项目，以创建多种语言的高质量名实体识别（NER）标准 benchmark。
methods: 这 paper 使用了多种语言的名实体识别数据集，并对其进行了cross-lingual consistent的标注。
results: 这 paper 提供了多种语言的名实体识别数据集，并在不同的语言和学习环境中提供了初步的模型基线。

Abstract
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.

摘要
我们介绍Universal NER（UNER），一个开放、社区驱动的项目，旨在开发多种语言的高标准命名实体识别标准。UNER v1包含18个数据集，每个数据集包含多种语言的命名实体，以跨语言一致的方式进行标注。在这篇论文中，我们详细介绍了UNER数据集的创建和组合，以及在本语言和跨语言学习环境中的初步模型基线。我们将数据、代码和适应模型公开发布。

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

paper_url: http://arxiv.org/abs/2311.09117
repo_url: None
paper_authors: Heng-Jui Chang, James Glass
for: 这篇论文是为了提出一种数据效率的自主supervised fine-tuning框架，以获得 speaker和噪声不变的语音表示。
methods: 该框架使用 speaker-invariant clustering (Spin) 学习精确的音频单元，并通过预测音频片段来强化内容表示。
results: 相比之前的状态艺术方法，R-Spin 可以在严重扭曲语音场景下获得更好的表示性，同时减少了计算资源的使用量，达到 12 倍的提升。

Abstract
This paper introduces Robust Spin (R-Spin), a data-efficient self-supervised fine-tuning framework for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.

摘要
Translation in Simplified Chinese:这篇论文介绍了一种名为Robust Spin（R-Spin）的数据精简自我超越框架，用于实现Speaker和噪声不变的语音表示。R-Spin解决了Spin的问题，并提高了语音表示的内容。R-Spin可以在严重扭曲的语音enario中具有12倍的计算资源减少，并在前一代方法中出perform。这篇论文还提供了详细的分析，以显示дискреTE Units在语音编码器训练中的贡献和提高 robustness在多种听频环境中。

paper_url: http://arxiv.org/abs/2311.09106
repo_url: None
paper_authors: Rajkumar Pujari, Chengfei Wu, Dan Goldwasser
for: 本研究旨在 Computational setting中理解ambiguous statements的语言含义，并将其与现实世界相关的实体、行为和态度关联。
methods: 本研究使用了两个具有挑战性的 datasets，需要理解文本的现实世界上下文才能解决 Effectively。此外，还开发了基于现有 ‘Discourse Contextualization Framework’ 和 ‘Political Actor Representation’ 模型的更结构化基elines。
results: 本研究通过对基elines的比较分析，提供了更深入的理解社会语言理解挑战的信息。

Abstract
Social media discourse from US politicians frequently consists of 'seemingly similar language used by opposing sides of the political spectrum'. But often, it translates to starkly contrasting real-world actions. For instance, "We need to keep our students safe from mass shootings" may signal either "arming teachers to stop the shooter" or "banning guns to reduce mass shootings" depending on who says it and their political stance on the issue. In this paper, we define and characterize the context that is required to fully understand such ambiguous statements in a computational setting and ground them in real-world entities, actions, and attitudes. To that end, we propose two challenging datasets that require an understanding of the real-world context of the text to be solved effectively. We benchmark these datasets against baselines built upon large pre-trained models such as BERT, RoBERTa, GPT-3, etc. Additionally, we develop and benchmark more structured baselines building upon existing 'Discourse Contextualization Framework' and 'Political Actor Representation' models. We perform analysis of the datasets and baseline predictions to obtain further insights into the pragmatic language understanding challenges posed by the proposed social grounding tasks.

摘要
社交媒体讨论由美国政客们频繁使用"看起来相似的语言",但实际上它们可能表达出极其不同的现实世界行动。例如，"我们需要保护学生免受大规模枪击"可能表示"武装教师以阻止射手"或"禁止枪支以减少大规模枪击"，这取决于说话人的政治立场。在这篇论文中，我们定义和描述了 Computational Setting中需要完全理解这些抽象语言的上下文，并将其固定到现实世界实体、行动和态度。为此，我们提出了两个复杂的数据集，需要理解文本的现实世界上下文才能解决 effectively。我们对这些数据集进行了基线测试，并开发了基于现有"Discourse Contextualization Framework"和"Political Actor Representation"模型的更结构化基线。我们对数据集和基线预测进行分析，以获得更深入的语言理解挑战的进一步洞察。

MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument Annotation

paper_url: http://arxiv.org/abs/2311.09105
repo_url: None
paper_authors: Xiaozhi Wang, Hao Peng, Yong Guan, Kaisheng Zeng, Jianhui Chen, Lei Hou, Xu Han, Yankai Lin, Zhiyuan Liu, Ruobing Xie, Jie Zhou, Juanzi Li
for: This paper is written for the purpose of introducing a new dataset, MAVEN-Arg, which supports event understanding tasks such as event detection, event argument extraction, and event relation extraction.
methods: The paper uses a large-scale dataset, MAVEN-Arg, which is augmented with event argument annotations, to support the development and evaluation of event understanding models.
results: The paper reports that MAVEN-Arg is a challenging dataset for both fine-tuned EAE models and proprietary large language models (LLMs), and demonstrates the potential benefits of an all-in-one dataset for future event prediction applications using LLMs.Here is the information in Simplified Chinese text:
for: 这篇论文是为了介绍一个新的数据集MAVEN-Arg，该数据集支持事件理解任务，包括事件检测、事件Argument提取和事件关系提取。
methods: 这篇论文使用了一个大规模的数据集MAVEN-Arg，该数据集包括事件Argument的注释，以支持事件理解模型的发展和评估。
results: 论文表明，MAVEN-Arg是对于both fine-tuned EAE模型和专有大语言模型（LLMs）来说是一个具有挑战性的数据集，并demonstrates该数据集的可能性用于未来事件预测应用程序。

Abstract
Understanding events in texts is a core objective of natural language understanding, which requires detecting event occurrences, extracting event arguments, and analyzing inter-event relationships. However, due to the annotation challenges brought by task complexity, a large-scale dataset covering the full process of event understanding has long been absent. In this paper, we introduce MAVEN-Arg, which augments MAVEN datasets with event argument annotations, making the first all-in-one dataset supporting event detection, event argument extraction (EAE), and event relation extraction. As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained with laborious human annotation; (3) the exhaustive annotation supporting all task variants of EAE, which annotates both entity and non-entity event arguments in document level. Experiments indicate that MAVEN-Arg is quite challenging for both fine-tuned EAE models and proprietary large language models (LLMs). Furthermore, to demonstrate the benefits of an all-in-one dataset, we preliminarily explore a potential application, future event prediction, with LLMs. MAVEN-Arg and our code can be obtained from https://github.com/THU-KEG/MAVEN-Argument.

摘要
Understanding events in texts is a core goal of natural language understanding, which involves detecting event occurrences, extracting event arguments, and analyzing inter-event relationships. However, due to the challenges of annotation, a large-scale dataset covering the full process of event understanding has been lacking. In this paper, we introduce MAVEN-Arg, which adds event argument annotations to the MAVEN datasets, creating the first all-in-one dataset supporting event detection, event argument extraction (EAE), and event relation extraction. As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained through laborious human annotation; (3) exhaustive annotation supporting all task variants of EAE, which annotates both entity and non-entity event arguments at the document level. Experiments show that MAVEN-Arg is quite challenging for both fine-tuned EAE models and proprietary large language models (LLMs). Furthermore, to demonstrate the benefits of an all-in-one dataset, we preliminarily explore a potential application, future event prediction, with LLMs. MAVEN-Arg and our code can be obtained from .

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

paper_url: http://arxiv.org/abs/2311.09096
repo_url: https://github.com/thu-coai/jailbreakdefense_goalpriority
paper_authors: Zhexin Zhang, Junxiao Yang, Pei Ke, Minlie Huang
for: 这个论文的目的是提出一种对销害攻击的防御方法，帮助保护大语言模型（LLMs）免受销害攻击。
methods: 该论文使用了目标优先级的思想来防御销害攻击，在训练和推理阶段都实现了目标优先级的 integrating。
results: 该论文的实验结果表明，通过在推理阶段实现目标优先级，可以减少销害攻击的成功率，并且不会影响大语言模型的总体性能。此外，通过在训练阶段实现目标优先级，可以更好地防止销害攻击。

Abstract
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompanied by a growing array of safety risks. While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of exploration into defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the inherent conflict between the goals of being helpful and ensuring safety. To counter jailbreaking attacks, we propose to integrate goal prioritization at both training and inference stages. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking attacks, reducing it from 66.4% to 2.0% for ChatGPT and from 68.2% to 19.4% for Vicuna-33B, without compromising general performance. Furthermore, integrating the concept of goal prioritization into the training phase reduces the ASR from 71.0% to 6.6% for LLama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half, decreasing it from 71.0% to 34.0%. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks. We hope our work could contribute to the comprehension of jailbreaking attacks and defenses, and shed light on the relationship between LLMs' capability and safety. Our code will be available at \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

摘要
大型自然语言模型（LLM）继续进步，但同时也涉及到一系列的安全隐患。虽然有很多研究利用 LLM 的弱点进行攻击，但对于防御攻击的研究却受到了相对的少量关注。我们指出，在 LLM 中进行干预时存在一个重要的因素，即帮助和安全之间的矛盾。为了防御攻击，我们提议在训练和执行阶段都进行目标优先级化。在执行阶段实现目标优先级化后，可以显著减少攻击成功率（ASR），从66.4%降低至2.0% для ChatGPT，从68.2%降低至19.4% для Vicuna-33B，无需妥协总体性能。此外，在训练阶段 integrate 目标优先级化也可以降低 ASR 至6.6% для LLama2-13B。即使在没有攻击样本的情况下，我们的方法仍可以减少 ASR 的一半，从71.0%降低至34.0%。此外，我们的研究还发现，强大的 LLM 面临更大的安全隐患，但同时它们也拥有更大的防御能力。我们希望我们的工作可以对攻击和防御之间的关系提供更深入的理解，并为 LLM 的安全做出贡献。我们的代码将在 GitHub 上公开，请参考 \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

paper_url: http://arxiv.org/abs/2311.09090
repo_url: None
paper_authors: Marta Marchiori Manerba, Karolina Stańczak, Riccardo Guidotti, Isabelle Augenstein
for: 本研究旨在探讨语言模型中社会偏见的问题，并提出了一种新的探测方法。
methods: 本研究使用了一种新的词语混杂度-based fairness分数，并收集了一个大规模的探测数据集，以分析语言模型的总体协会以及社会分类、标签和刻板印象方面的偏见。
results: 研究发现，语言模型中的偏见更加复杂，大型模型变体具有更高度的偏见，并且发现不同religion表达的人群在所有模型中产生最大的不同待遇。

Abstract
Large language models have been shown to encode a variety of social biases, which carries the risk of downstream harms. While the impact of these biases has been recognized, prior methods for bias evaluation have been limited to binary association tests on small datasets, offering a constrained view of the nature of societal biases within language models. In this paper, we propose an original framework for probing language models for societal biases. We collect a probing dataset to analyze language models' general associations, as well as along the axes of societal categories, identities, and stereotypes. To this end, we leverage a novel perplexity-based fairness score. We curate a large-scale benchmarking dataset addressing drawbacks and limitations of existing fairness collections, expanding to a variety of different identities and stereotypes. When comparing our methodology with prior work, we demonstrate that biases within language models are more nuanced than previously acknowledged. In agreement with recent findings, we find that larger model variants exhibit a higher degree of bias. Moreover, we expose how identities expressing different religions lead to the most pronounced disparate treatments across all models.

摘要
大型语言模型已经显示出了多种社会偏见，这可能导致下游危害。虽然这些偏见的影响已经被认可，但先前的偏见评估方法受限于小型数据集和二元关联测试，这只能提供社会偏见在语言模型中的压缩视图。在这篇论文中，我们提出了一种原创的语言模型偏见探测框架。我们收集了一个探测数据集，以分析语言模型的通用关联以及社会分类、标签和刻板印象的方向。为此，我们利用了一种新的折衣率基准公平分数。我们创建了一个大规模的比较数据集，以解决现有公平集的缺点和限制，扩展到不同的标签和刻板印象。与先前的工作比较，我们发现了更多的偏见在语言模型中，特别是更大的模型变体更加偏见。此外，我们发现了不同的宗教标签表达时，所有模型中的最大差异。

paper_url: http://arxiv.org/abs/2311.09066
repo_url: None
paper_authors: Chenghao Yang, Tuhin Chakrabarty, Karli R Hochstatter, Melissa N Slavin, Nabila El-Bassel, Smaranda Muresan
For: The paper aims to develop a tool to identify at-risk patients with opioid use disorder by analyzing community-based social media platforms like Reddit.* Methods: The authors use a corpus of 2500 opioid-related posts from various subreddits to annotate span-level extractive explanations and evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting.* Results: The authors find that using explanations during modeling leads to a significant boost in classification accuracy, demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum.Here are the three points in Simplified Chinese text:* For: 这篇论文目标是开发一种用于识别患有酒精使用障碍的患者的工具，通过分析社区基于的Reddit社交媒体平台上的自透露。* Methods: 作者使用2500篇关于酒精的Reddit帖子，并对它们进行分析和注释，以评估一些当前最佳的模型在不同的超级vised、几个shot和零shot设置下的表现。* Results: 作者发现，在高度关键的领域中，使用解释时期的模型会导致识别酒精使用障碍的准确率显著提高，这demonstrates解释的有利role在研究酒精使用障碍continuum中。

Abstract
In the last decade, the United States has lost more than 500,000 people from an overdose involving prescription and illicit opioids (https://www.cdc.gov/drugoverdose/epidemic/index.html) making it a national public health emergency (USDHHS, 2017). To more effectively prevent unintentional opioid overdoses, medical practitioners require robust and timely tools that can effectively identify at-risk patients. Community-based social media platforms such as Reddit allow self-disclosure for users to discuss otherwise sensitive drug-related behaviors, often acting as indicators for opioid use disorder. Towards this, we present a moderate size corpus of 2500 opioid-related posts from various subreddits spanning 6 different phases of opioid use: Medical Use, Misuse, Addiction, Recovery, Relapse, Not Using. For every post, we annotate span-level extractive explanations and crucially study their role both in annotation quality and model development. We evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting. Experimental results and error analysis show that identifying the phases of opioid use disorder is highly contextual and challenging. However, we find that using explanations during modeling leads to a significant boost in classification accuracy demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum. The dataset will be made available for research on Github in the formal version.

摘要
在过去一个十年，美国已经失去了超过500,000名人因为吸毒过量，其中包括药物和黑市药品（https://www.cdc.gov/drugoverdose/epidemic/index.html），这使得这成为一个国家紧急公共卫生问题（USDHHS, 2017）。为了更好地预防意外的毒品过量，医疗专业人员需要强大和时间相对的工具，以有效地识别有风险的病人。社区基础的社交媒体平台如Reddit，allow users to disclose themselves and discuss sensitive drug-related behaviors, often serving as indicators of opioid use disorder. 为此，我们提供了一个 Moderate-sized corpus of 2500 opioid-related posts from various subreddits spanning 6 different phases of opioid use: Medical Use, Misuse, Addiction, Recovery, Relapse, Not Using. For every post, we annotate span-level extractive explanations and crucially study their role both in annotation quality and model development. We evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting. Experimental results and error analysis show that identifying the phases of opioid use disorder is highly contextual and challenging. However, we find that using explanations during modeling leads to a significant boost in classification accuracy, demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum. The dataset will be made available for research on Github in the formal version.

Do Localization Methods Actually Localize Memorized Data in LLMs?

paper_url: http://arxiv.org/abs/2311.09060
repo_url: None
paper_authors: Ting-Yun Chang, Jesse Thomason, Robin Jia
for: 本研究旨在找到LLMs中记忆某个序列的小量神经元。
methods: 本文使用两个benchmark方法评估本地化方法的效果，一个是INJ Benchmark，通过在小量神经元中插入新信息来测试本地化方法的准确性；另一个是DEL Benchmark，通过测试dropout located neurons是否会使模型忘记记忆的序列。
results: 本研究发现，五种本地化方法在两个benchmark上都达到了一定的成果，尤其是使用减少方法时，能够准确地本地化记忆。但是，所identified神经元不一定是特定的一个记忆序列的特征。

Abstract
Large language models (LLMs) can memorize many pretrained sequences verbatim. This paper studies if we can locate a small set of neurons in LLMs responsible for memorizing a given sequence. While the concept of localization is often mentioned in prior work, methods for localization have never been systematically and directly evaluated; we address this with two benchmarking approaches. In our INJ Benchmark, we actively inject a piece of new information into a small subset of LLM weights and measure whether localization methods can identify these "ground truth" weights. In the DEL Benchmark, we study localization of pretrained data that LLMs have already memorized; while this setting lacks ground truth, we can still evaluate localization by measuring whether dropping out located neurons erases a memorized sequence from the model. We evaluate five localization methods on our two benchmarks, and both show similar rankings. All methods exhibit promising localization ability, especially for pruning-based methods, though the neurons they identify are not necessarily specific to a single memorized sequence.

摘要
In our INJ Benchmark, we actively inject a piece of new information into a small subset of LLM weights and measure whether localization methods can identify these "ground truth" weights. In the DEL Benchmark, we study localization of pre-trained data that LLMs have already memorized; while this setting lacks ground truth, we can still evaluate localization by measuring whether dropping out located neurons erases a memorized sequence from the model.We evaluate five localization methods on our two benchmarks, and all show promising localization ability, especially for pruning-based methods. However, the neurons they identify are not necessarily specific to a single memorized sequence.

GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models

paper_url: http://arxiv.org/abs/2311.09048
repo_url: None
paper_authors: Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, Elia Bruni
for: 评估视频基于多modal语言模型的语言固定和物理理解能力
methods: 使用Unity simulations进行两 tier评估，包括语言固定和直觉物理理解能力
results: 现有多modal语言模型具有语言固定和直觉物理理解缺陷，GRASP benchmark可以帮助监测未来模型的进步

Abstract
This paper presents GRASP, a novel benchmark to evaluate the language grounding and physical understanding capabilities of video-based multimodal large language models (LLMs). This evaluation is accomplished via a two-tier approach leveraging Unity simulations. The initial level tests for language grounding by assessing a model's ability to relate simple textual descriptions with visual information. The second level evaluates the model's understanding of 'Intuitive Physics' principles, such as object permanence and continuity. In addition to releasing the benchmark, we use it to evaluate several state-of-the-art multimodal LLMs. Our evaluation reveals significant shortcomings in current models' language grounding and intuitive physics. These identified limitations underline the importance of benchmarks like GRASP to monitor the progress of future models in developing these competencies.

摘要
这篇论文介绍了GRASP，一个新的评估语言固定和物理理解能力的视频基于多模态大语言模型（LLM）的benchmark。这种评估方式通过Unity simulate层次结构来实现。第一层测试语言固定的能力，通过将简单的文本描述与视觉信息相关联。第二层测试模型的物理理解能力，包括物体永久性和连续性。此外，我们还使用GRASP评估多种当前领先的多模态LLM。我们的评估发现当前模型的语言固定和直觉物理存在显著的缺陷。这些缺陷证明了GRASP这种benchmark的重要性，以便监测未来模型的发展。

Exploring the Potential of Large Language Models in Computational Argumentation

paper_url: http://arxiv.org/abs/2311.09022
repo_url: https://github.com/damo-nlp-sg/llm-argumentation
paper_authors: Guizhen Chen, Liying Cheng, Luu Anh Tuan, Lidong Bing
for: 本研究旨在评估大语言模型（LLMs）在计算辩论领域中的表现，包括零学习和少学习 Setting下的能力。
methods: 本研究使用了多种任务，包括辩论挖掘和辩论生成，以评估LLMs的表现。我们还提供了一个新的对话生成测试集，以全面评估LLMs的综合性能。
results: 实验结果显示LLMs在大多数任务中表现出色，证明它们在计算辩论领域具有remarkable能力。然而，我们也注意到了评估计算辩论的限制，并提供了未来研究的建议。

Abstract
Computational argumentation has become an essential tool in various fields, including artificial intelligence, law, and public policy. It is an emerging research field in natural language processing (NLP) that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models (LLMs) have demonstrated strong abilities in understanding context and generating natural language, it is worthwhile to evaluate the performance of LLMs on various computational argumentation tasks. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models and LLaMA2 models, under zero-shot and few-shot settings within the realm of computational argumentation. We organize existing tasks into 6 main classes and standardise the format of 14 open-sourced datasets. In addition, we present a new benchmark dataset on counter speech generation, that aims to holistically evaluate the end-to-end performance of LLMs on argument mining and argument generation. Extensive experiments show that LLMs exhibit commendable performance across most of these datasets, demonstrating their capabilities in the field of argumentation. We also highlight the limitations in evaluating computational argumentation and provide suggestions for future research directions in this field.

摘要
计算辩论已成为不同领域的重要工具，包括人工智能、法律和公共政策。这是自然语言处理（NLP）的一个快速发展的研究领域，吸引了更多的关注。研究计算辩论主要涉及两类任务：辩论挖掘和辩论生成。由于大语言模型（LLMs）在理解上下文和生成自然语言方面表现出色，因此值得评估LLMs在不同计算辩论任务中的表现。本工作计划在零 shot和几 shot设置下评估 ChatGPT、Flan 模型和 LLaMA2 模型在计算辩论任务中的表现。我们将现有任务分为 6 个主要类型，并标准化 datasets 的格式。此外，我们还提供了一个新的benchmark dataset，用于全面评估 LLMS 在辩论挖掘和辩论生成任务中的综合表现。广泛的实验表明 LLMS 在大多数 datasets 中表现出色，证明它们在辩论领域的能力。我们还提出了计算辩论评估的限制和未来研究方向。

End-to-end Task-oriented Dialogue: A Survey of Tasks, Methods, and Future Directions

paper_url: http://arxiv.org/abs/2311.09008
repo_url: None
paper_authors: Libo Qin, Wenbo Pan, Qiguang Chen, Lizi Liao, Zhou Yu, Yue Zhang, Wanxiang Che, Min Li
for: 这篇论文主要针对的是End-to-end task-oriented dialogue（EToD）研究领域，旨在提供一份系统性的综述，涵盖该领域的所有方法和最新趋势。
methods: 该论文使用了大量的深度神经网络模型，特别是使用大型预训练模型，以实现EToD研究中的显著进步。
results: 该论文提供了一个综述EToD研究领域的新趋势和前沿领域，并提供了一个公共网站（https://etods.net/），以便EToD研究人员直接访问最新的进步。

Abstract
End-to-end task-oriented dialogue (EToD) can directly generate responses in an end-to-end fashion without modular training, which attracts escalating popularity. The advancement of deep neural networks, especially the successful use of large pre-trained models, has further led to significant progress in EToD research in recent years. In this paper, we present a thorough review and provide a unified perspective to summarize existing approaches as well as recent trends to advance the development of EToD research. The contributions of this paper can be summarized: (1) \textbf{\textit{First survey}: to our knowledge, we take the first step to present a thorough survey of this research field; (2) \textbf{\textit{New taxonomy}: we first introduce a unified perspective for EToD, including (i) \textit{Modularly EToD} and (ii) \textit{Fully EToD}; (3) \textbf{\textit{New Frontiers}: we discuss some potential frontier areas as well as the corresponding challenges, hoping to spur breakthrough research in EToD field; (4) \textbf{\textit{Abundant resources}: we build a public website\footnote{We collect the related papers, baseline projects, and leaderboards for the community at \url{https://etods.net/}.}, where EToD researchers could directly access the recent progress. We hope this work can serve as a thorough reference for the EToD research community.

摘要
END-TO-END TASK-ORIENTED DIALOGUE (EToD) 可以直接生成响应，无需模块化训练，这已经在过去几年中吸引了越来越多的关注。深度神经网络的发展，特别是大型预训练模型的成功使用，导致了 EToD 研究领域的 significiant progress。在这篇论文中，我们提供了一份系统性的回顾和总结，旨在推动 EToD 研究的发展。本文的贡献包括：1. 首次调查：我们知道的所有文献中，我们是第一个进行这项研究的全面调查。2. 新的分类：我们首先引入了 EToD 的统一视角，包括（i）模块化 EToD 和（ii）完全 EToD。3. 新的前iers：我们讨论了一些潜在的前沿领域，以及相应的挑战，希望能够促进 EToD 领域的突破性研究。4. 充沛的资源：我们建立了一个公共网站（https://etods.net/）， где EToD 研究人员可以直接访问最新的进展。我们希望这份工作能够成为 EToD 研究社区的参考。

Data Similarity is Not Enough to Explain Language Model Performance

paper_url: http://arxiv.org/abs/2311.09006
repo_url: https://github.com/gyauney/data-similarity-is-not-enough
paper_authors: Gregory Yauney, Emily Reif, David Mimno
for: 这个论文旨在探讨语言模型在多种下游任务中的高性能是如何实现的？
methods: 该论文使用了多种同构和示例特定的相似度度量（嵌入-, 字符-和模型基于的）来衡量语言模型在下游任务中的性能。
results: 在多语言任务中，相似度度量与语言模型的性能显著相关，但在其他benchmark中，相似度度量与准确率或者even每个相似度度量之间没有相关性。这表明下游任务和预训练数据之间的关系比较复杂。

Abstract
Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.

摘要

Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output

paper_url: http://arxiv.org/abs/2311.09000
repo_url: https://github.com/yuxiaw/factcheck-gpt
paper_authors: Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov
for: 这篇论文旨在提供一个涵盖所有阶段的Annotation scheme来验证大型自然语言模型（LLM）生成的回答的实现方式，以便确保其精度和可靠性。
methods: 这篇论文使用了一个多阶段的Annotation scheme，让评分者能够为LLM生成的回答提供细化的标签，以捕捉回答中的可靠性和事实不一致之处。此外，这篇论文还开发了一个Annotation tool来加速评分过程，并且可以自动插入证据等自动结果。
results: 根据初步实验结果，FactTool、FactScore和Perplexity.ai等工具在验证false claims方面的性能不太理想，其F1分数为0.53。这篇论文提供了一个开放领域的文档级实验库，并且提供了一个网站供下载Annotation tool和代码。

Abstract
The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We design and build an annotation tool to speed up the labelling procedure and ease the workload of raters. It allows flexible incorporation of automatic results in any stage, e.g. automatically-retrieved evidence. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims with the best F1=0.53. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

摘要
通过大语言模型（LLM）在各种实际应用中的普及，需要验证其输出的事实准确性的机制。在这项工作中，我们提出了一种涵盖所有阶段的综合答案，用于标注 LLG 生成的响应中的事实准确性，并设计了一个多Stage annotation scheme，以生成细化的标签，包括 LLG 输出中的可靠性和事实不一致。我们设计了一个用于加速标注过程的标注工具，并且可以自动 incorporate 任何阶段的自动结果，例如自动检索到的证据。我们还构建了一个开放领域文档级别的事实准确性标准吗，包括声明、句子和文档三个级别。我们的初步实验表明，FacTool、FactScore 和 Perplexity.ai 在标识false声明方面的最佳 F1 值为 0.53。我们的标注工具、标准吗和代码可以在 GitHub 上获取。

SentAlign: Accurate and Scalable Sentence Alignment

paper_url: http://arxiv.org/abs/2311.08982
repo_url: https://github.com/steinst/sentalign
paper_authors: Steinþór Steingrímsson, Hrafn Loftsson, Andy Way
for: 该论文设计了一个高精度的句子对齐工具，用于处理非常大的平行文档对。
methods: 该算法使用用户定义的参数，采用分治分解方法对大量句子进行对齐，并使用LaBSE双语句子表示来评分。
results: SentAlign在德语-法语和英语-冰岛语两个评估集上表现出色，并在下游机器翻译任务中表现更好。

Abstract
We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.

摘要
我们介绍了 SentAlign，一款精度很高的句子对齐工具，可以处理非常大的平行文档对。通过用户定义的参数，对齐算法会评估所有可能的对齐路径，并使用分治分解方法对文档中的千余句进行对齐。对齐函数基于 LaBSE 双语句表示。 SentAlign 在两个不同的评估集上（德语-法语和英语-冰岛语）和下游机器翻译任务上表现出色，超越了五个其他句子对齐工具。

Speculative Contrastive Decoding

paper_url: http://arxiv.org/abs/2311.08981
repo_url: None
paper_authors: Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, Chang Zhou
for: 提高大语言模型（LLM）的推断质量和速度
methods: 使用 amateur models 预测专家模型的生成，并使用自然冲突来优化推断结果
results: 实验结果表明，使用 Speculative Contrastive Decoding（SCD）可以达到类似的加速因子，同时提高推断质量，并且可以减少计算资源的消耗

Abstract
Large language models (LLMs) have shown extraordinary performance in various language tasks, but high computational requirements hinder their widespread deployment. Speculative decoding, which uses amateur models to predict the generation of expert models, has been proposed as a way to accelerate LLM inference. However, speculative decoding focuses on acceleration instead of making the best use of the token distribution from amateur models. We proposed Speculative Contrastive Decoding (SCD), an accelerated decoding method leveraging the natural contrast between expert and amateur models in speculative decoding. Comprehensive evaluations on four benchmarks show that SCD can achieve similar acceleration factors as speculative decoding while further improving the generation quality as the contrastive decoding. The analysis of token probabilities further demonstrates the compatibility between speculative and contrastive decoding. Overall, SCD provides an effective approach to enhance the decoding quality of LLMs while saving computational resources.

摘要

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

paper_url: http://arxiv.org/abs/2311.08966
repo_url: None
paper_authors: Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
for: 提高流式自动语音识别（ASR）中罕见词或上下文实体的识别性能。
methods: combine拟音和文本信息，以 distinguishing 同音或同字符序列的词语。
results: 在LibriSpeech corpus上，提出的方法实现了不同规模和偏好列表的罕见词错误率的国际先进性。

Abstract
Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR). However, deep biasing with large-scale rare words remains challenging, as the performance drops significantly when more distractors exist and there are words with similar grapheme sequences in the bias list. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Moreover, the introduction of training with text-only data containing more rare words benefits large-scale deep biasing. The experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.

摘要
深层偏迁对扬声器可以改善不同语言模型中的识别性能，尤其是在实时自动语音识别（ASR）应用中。然而，深层偏迁大规模罕见词仍然存在挑战，因为性能下降很快，有许多干扰符和类似的字符序列存在偏迁列表中。在这篇论文中，我们将扬声器中罕见词的音频和文本信息结合起来，以便在同音或同字符序列时分词。此外，在训练文本只含罕见词数据时，大规模深层偏迁的训练效果也得到了改进。在 LibriSpeech 数据集上进行的实验表明，提出的方法可以在不同的规模和偏迁列表水平上取得状态的词错率最佳性能。

Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models

paper_url: http://arxiv.org/abs/2311.08921
repo_url: None
paper_authors: Tingyu Xie, Qi Li, Yan Zhang, Zuozhu Liu, Hongwei Wang
for: investigate the possibilities of pushing the boundary of zero-shot NER with LLM via a training-free self-improving strategy.
methods: utilize an unlabeled corpus to stimulate the self-learning ability of LLMs on NER, and explore various strategies to select reliable samples from the self-annotated dataset as demonstrations.
results: achieve an obvious performance improvement, and there might still be space for improvement via more advanced strategy for reliable entity selection.Here’s the full text in Simplified Chinese:
for: 本研究旨在探索利用大型自然语言模型（LLM）进行零shotNamed Entity Recognition（NER）任务的可能性，并提出一种无需训练的自我改进策略。
methods: 我们利用一个无标注语料来刺激LLM的自我学习能力，并考虑了多种策略来选择自动标注数据中的可靠示例作为示例。
results: 我们的研究发现，使用自我改进策略可以进一步推动零shotNER的发展，并实现显著的性能提升。此外，我们还发现，简单地增加无标注语料或 iterative self-improving 并不能保证改进。

Abstract
Exploring the application of powerful large language models (LLMs) on the fundamental named entity recognition (NER) task has drawn much attention recently. This work aims to investigate the possibilities of pushing the boundary of zero-shot NER with LLM via a training-free self-improving strategy. We propose a self-improving framework, which utilize an unlabeled corpus to stimulate the self-learning ability of LLMs on NER. First, we use LLM to make predictions on the unlabeled corpus and obtain the self-annotated data. Second, we explore various strategies to select reliable samples from the self-annotated dataset as demonstrations, considering the similarity, diversity and reliability of demonstrations. Finally, we conduct inference for the test query via in-context learning with the selected self-annotated demonstrations. Through comprehensive experimental analysis, our study yielded the following findings: (1) The self-improving framework further pushes the boundary of zero-shot NER with LLMs, and achieves an obvious performance improvement; (2) Iterative self-improving or naively increasing the size of unlabeled corpus does not guarantee improvements; (3) There might still be space for improvement via more advanced strategy for reliable entity selection.

摘要
First, we use LLM to make predictions on the unlabeled corpus and obtain the self-annotated data. Next, we explore various strategies to select reliable samples from the self-annotated dataset as demonstrations, taking into account the similarity, diversity, and reliability of the demonstrations. Finally, we conduct inference for the test query via in-context learning with the selected self-annotated demonstrations.Our comprehensive experimental analysis yielded the following findings:1. The self-improving framework further pushes the boundary of zero-shot NER with LLMs, achieving an obvious performance improvement.2. Iterative self-improving or simply increasing the size of the unlabeled corpus does not guarantee improvements.3. There may still be room for improvement via more advanced strategies for selecting reliable entities.

HELLaMA: LLaMA-based Table to Text Generation by Highlighting the Important Evidence

paper_url: http://arxiv.org/abs/2311.08896
repo_url: None
paper_authors: Junyi Bian, Xiaolei Qin, Wuhe Zou, Mengzuo Huang, Weidong Zhang
for: 这个论文主要是为了提出一种基于大语言模型的表格转文本方法，以优化表格转文本任务的性能。
methods: 这个方法使用了两个模块：一个表格理解器，用于从表格中提取相关的行数据，以及一个表格摘要生成器，用于基于高亮的表格生成文本。此外， authors还提出了一种搜索策略来生成表格理解 Label。
results: 在FetaQA和QTSumm数据集上，该方法达到了当前最佳的STATE-OF-THE-ARTResults，并且发现高亮输入表格可以显著提高模型的性能，同时提供有价值的解释性。

Abstract
Large models have demonstrated significant progress across various domains, particularly in tasks related to text generation. In the domain of Table to Text, many Large Language Model (LLM)-based methods currently resort to modifying prompts to invoke public APIs, incurring potential costs and information leaks. With the advent of open-source large models, fine-tuning LLMs has become feasible. In this study, we conducted parameter-efficient fine-tuning on the LLaMA2 model. Distinguishing itself from previous fine-tuning-based table-to-text methods, our approach involves injecting reasoning information into the input by emphasizing table-specific row data. Our model consists of two modules: 1) a table reasoner that identifies relevant row evidence, and 2) a table summarizer that generates sentences based on the highlighted table. To facilitate this, we propose a search strategy to construct reasoning labels for training the table reasoner. On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results. Additionally, we observed that highlighting input tables significantly enhances the model's performance and provides valuable interpretability.

摘要
大型模型在不同领域的任务中已经实现了显著的进步，尤其是在文本生成相关的任务中。在表格到文本领域，许多大语言模型（LLM）基于方法通常是修改提示来访问公共API，可能会导致潜在的成本和信息泄露。随着开源大型模型的出现，细化LLM成为可能。在这项研究中，我们进行了效率高的参数调整LLaMA2模型。与前期 Fine-tuning 基于表格到文本方法不同，我们的方法是通过强调表格特定的行数据来注入逻辑信息。我们的模型包括两个模块：1）表格逻辑器，用于确定相关的行证据；2）表格概要生成器，用于基于突出的表格生成句子。为了实现这一点，我们提议一种搜索策略来构建逻辑标签用于训练表格逻辑器。在FetaQA和QTSumm数据集上，我们的方法实现了状态的最佳结果。此外，我们发现高亮输入表格会显著提高模型的性能并提供有价值的解释性。

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

paper_url: http://arxiv.org/abs/2311.08890
repo_url: None
paper_authors: Thanmay Jayakumar, Fauzan Farooqui, Luqman Farooqui
for: 本研究目的是评估通用语言模型在法律领域的性能，以及与专门为法律领域开发的模型进行比较。
methods: 本研究使用了三个通用语言模型（ChatGPT-20b、LLaMA-2-70b和Falcon-180b），对LEDGAR子集进行零shot测试，以评估这些模型在合同提供分类任务中的性能。
results: 研究发现，通用语言模型可以在大多数情况下正确地分类主题，但是它们的mic-F1/mac-F1性能与特定于法律领域的小型模型相比，可能下降到19.2/26.8％。这表明，为法律领域开发更强大的语言模型是有必要的。

Abstract
Realizing the recent advances in Natural Language Processing (NLP) to the legal sector poses challenging problems such as extremely long sequence lengths, specialized vocabulary that is usually only understood by legal professionals, and high amounts of data imbalance. The recent surge of Large Language Models (LLMs) has begun to provide new opportunities to apply NLP in the legal domain due to their ability to handle lengthy, complex sequences. Moreover, the emergence of domain-specific LLMs has displayed extremely promising results on various tasks. In this study, we aim to quantify how general LLMs perform in comparison to legal-domain models (be it an LLM or otherwise). Specifically, we compare the zero-shot performance of three general-purpose LLMs (ChatGPT-20b, LLaMA-2-70b, and Falcon-180b) on the LEDGAR subset of the LexGLUE benchmark for contract provision classification. Although the LLMs were not explicitly trained on legal data, we observe that they are still able to classify the theme correctly in most cases. However, we find that their mic-F1/mac-F1 performance is up to 19.2/26.8\% lesser than smaller models fine-tuned on the legal domain, thus underscoring the need for more powerful legal-domain LLMs.

摘要
现在的自然语言处理（NLP）技术在法律领域中提供了挑战性的问题，例如非常长的序列长度、专业legal vocabulary和大量数据不均衡。最近的大语言模型（LLMs）已经开始为法律领域提供新的应用机会，因为它们可以处理长、复杂的序列。此外，域 específico LLMS 的出现已经在多个任务上显示出非常有 promise。在本研究中，我们想要量化一般 LLMS 与法律领域模型（LLM或其他）的比较。我们比较三个一般用途 LLMS（ChatGPT-20b、LLaMA-2-70b和Falcon-180b）在 LEDGAR 子集上的零shot性性能。尽管 LLMS 没有直接接触法律数据，但我们发现它们仍然可以正确地分类主题。然而，我们发现它们的 mic-F1/mac-F1 性能与小型法律领域模型 fine-tuned 的性能相比，下降到 19.2/26.8%，这emet underscore the need for more powerful legal-domain LLMS。

CLIMB: Curriculum Learning for Infant-inspired Model Building

paper_url: http://arxiv.org/abs/2311.08886
repo_url: None
paper_authors: Richard Diehl Martinez, Zebulon Goriely, Hope McGovern, Christopher Davis, Andrew Caines, Paula Buttery, Lisa Beinborn
for: 本研究是为了提高语言模型的性能，并 investigate cognitively-motivated curriculum learning的效果。
methods: 本研究使用了三种不同的认知驱动的课程学习方法，包括词汇课程、数据课程和目标课程。
results: 研究发现，使用不同的课程学习方法可以获得一些有限的改善，但是不一致地改善所有语言测试任务。研究还发现，选择合适的模型架构和训练参数可以获得较好的改善。

Abstract
We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabulary curriculum, we analyze methods for constraining the vocabulary in the early stages of training to simulate cognitively more plausible learning curves. In the data curriculum experiments, we vary the order of the training instances based on i) infant-inspired expectations and ii) the learning behavior of the model. In the objective curriculum, we explore different variations of combining the conventional masked language modeling task with a more coarse-grained word class prediction task to reinforce linguistic generalization capabilities. Our results did not yield consistent improvements over our own non-curriculum learning baseline across a range of linguistic benchmarks; however, we do find marginal gains on select tasks. Our analysis highlights key takeaways for specific combinations of tasks and settings which benefit from our proposed curricula. We moreover determine that careful selection of model architecture, and training hyper-parameters yield substantial improvements over the default baselines provided by the BabyLM challenge.

摘要
我们描述我们团队在STRICT-SMALL track上的 BabyLM 挑战中的贡献。挑战需要从头开始训练一个语言模型，只使用一个相对较小的训练集数据量为十万个单词。我们在语言评估任务中运行三种认知驱动的课程学习方法，并分析它们对模型性能的影响。在词汇课程中，我们分析了在初期训练阶段限制词汇的方法，以模拟更加认知可能的学习曲线。在数据课程实验中，我们变化了训练实例的顺序，根据i) 婴儿引发的期望和ii) 模型学习行为。在目标课程中，我们探索不同的拟合面见任务和更粗糙的词类预测任务的结合方式，以强化语言总结能力。我们的结果没有在一系列语言标准准点上得到了一致的改进，但我们发现了一些任务上的微妙改进。我们的分析强调特定任务和设置中的课程学习的优点。此外，我们发现选择模型架构和训练超参数可以提供substantial改进。

Enabling Large Language Models to Learn from Rules

paper_url: http://arxiv.org/abs/2311.08883
repo_url: https://github.com/jettbrains/-L-
paper_authors: Wenkai Yang, Yankai Lin, Jie Zhou, Jirong Wen
for: 本研究旨在探讨使用规则来帮助大型自然语言模型（LLM）学习新的任务或知识。
methods: 我们提出了一种名为规则浸泡的方法，它首先使用LLM的强 Context-Aware 能力提取规则中的知识，然后将知识Explicitly 编码到LLM 参数中，通过学习内部的信号来帮助LLM 学习。
results: 我们的实验结果显示，使用我们的方法可以让LLM更加快速地学习新任务或知识，并且在样本数量和泛化能力方面都比例例-based 学习更高效。

Abstract
Large language models (LLMs) have shown incredible performance in completing various real-world tasks. The current knowledge learning paradigm of LLMs is mainly based on learning from examples, in which LLMs learn the internal rule implicitly from a certain number of supervised examples. However, the learning paradigm may not well learn those complicated rules, especially when the training examples are limited. We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. That is, humans can grasp the new tasks or knowledge quickly and generalize well given only a detailed rule and a few optional examples. Therefore, in this paper, we aim to explore the feasibility of this new learning paradigm, which encodes the rule-based knowledge into LLMs. We propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules and then explicitly encode the knowledge into LLMs' parameters by learning from the above in-context signals produced inside the model. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability.

摘要

Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation

paper_url: http://arxiv.org/abs/2311.08877
repo_url: None
paper_authors: Vaishnavi Shrivastava, Percy Liang, Ananya Kumar
for: 提高LLM的可靠性，使其在问答 зада中准确地表达自己的信任度。
methods: 使用语言模型来描述自己的信任度，并使用一个伪装的信任模型来评估原始模型的信任度。
results: 使用这两种方法可以获得更高的AUC值（84.6%平均值），提高LLM的可靠性。

Abstract
To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4).

摘要
Translated into Simplified Chinese:以维护用户信任，大型自然语言模型（LLMs）应该在错误的示例上显示低自信，而不是误导用户。标准的自信估计方法是使用这些模型的软条对应的概率，但在2023年11月，现场的LMMs如GPT-4和Claude-v1.3并不提供这些概率。我们首先研究用于描述自信的语言方法 -- 将LMM询问自己的答案中的自信度 -- 这perform reasonably well（GPT-4的80.5% AUC在12个问答dataset上的平均值上升7%），但还有改善的空间。我们然后探索使用代理自信模型 -- 使用一个拥有概率的模型来评估原始模型在特定问题上的自信度。 surprisingly，这些概率来自不同和常较弱的模型，这种方法在12个dataset上高于语言自信的AUC（84.6%的GPT-4平均值）。我们的最佳方法是融合语言自信和代理模型概率，得到了现场的自信估计（84.6%的GPT-4平均值）。

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

paper_url: http://arxiv.org/abs/2311.08849
repo_url: None
paper_authors: Yihong Liu, Peiqin Lin, Mingyang Wang, Hinrich Schütze
for: 本文旨在提出一种高效的多语言语模型适应方法，以提高适应多语言语言模型的效率和可行性。
methods: 本文提出了一种名为\textbf{\textsc{Ofa}}的框架，它通过智能初始化目标语言中未看到的字词的embeddings来适应多语言语言模型。\textsc{Ofa}使用了外部的多语言word embeddings，并将它们的对应关系注入到新的embeddings中。此外，\textsc{Ofa}还应用了矩阵因子分解，将高维的embeddings替换为两个低维的矩阵，从而减少参数的数量。
results: 经过广泛的实验表明，由\textsc{Ofa}初始化的模型能够高效地适应多语言语言模型，并在多种下沉任务上表现出色。此外，\textsc{Ofa}不仅加速了继续预训的整合，还提高了零Instance cross语言传递性。

Abstract
Pretraining multilingual language models from scratch requires considerable computational resources and substantial training data. Therefore, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the language model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}), which wisely initializes the embeddings of unseen subwords from target languages and thus can adapt a PLM to multiple languages efficiently and effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual word embeddings and injects the alignment knowledge into the new embeddings. In addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which significantly reduces the number of parameters while not sacrificing the performance. Through extensive experiments, we show models initialized by \textsc{Ofa} are efficient and outperform several baselines. \textsc{Ofa} not only accelerates the convergence of continued pretraining, which is friendly to a limited computation budget, but also improves the zero-shot crosslingual transfer on a wide range of downstream tasks. We make our code and models publicly available.

摘要
<>Translate the given text into Simplified Chinese.<>现有的多语言语言模型（PLM）预训练需要较大的计算资源和大量的训练数据。因此，一种更有效的方法是使用现有的 PLM 进行多语言适应，通过词库扩展和继续预训练。但这种方法通常会随机初始化目标语言中的新词表示，并添加大量的词表示参数到语言模型中，从而降低效率。为解决这些问题，我们提出了一个新的框架：\textbf{一个 для所有} (\textbf{\textsc{Ofa}), 它智能初始化目标语言中的未看过词表示，并可以快速和有效地将 PLM 适应多种语言。\textsc{Ofa} 利用外部的多语言Word embeddings 和注入对应关系知识，并应用矩阵分解，将繁琐的词表示替换为两个更低维度的矩阵，这样减少了参数的数量，而不会降低性能。经过广泛的实验，我们发现模型使用 \textsc{Ofa} 初始化的效果更好，并在多种下游任务上实现了零shot Cross-Lingual 传递。\textsc{Ofa} 不仅加速了继续预训练的整合，也提高了零shot Cross-Lingual 传递的性能，这对有限的计算预算是友好的。我们将代码和模型公开发布。

Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder

paper_url: http://arxiv.org/abs/2311.08844
repo_url: None
paper_authors: Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed
for: 本研究的目的是提高阿拉伯语言图像描述的水平，提供更多的泛型语言模型。
methods: 本研究使用了视觉编码器和 Gemini 文本解码器，以实现视觉和语言组件的融合。同时，我们还提出了一种自动从英语数据集中获取数据的新方法。
results: 对于我们的评估数据集，\textit{Violet} 表现出了显著的提升，例如在我们手动标注的数据集上达到了 CIDEr 分数为 61.2，并在 Flickr8k 上提高了13个点。

Abstract
Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 million people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-language model dedicated to Arabic, dubbed \textit{Violet}. Our model is based on a vision encoder and a Gemini text decoder that maintains generation fluency while allowing fusion between the vision and language components. To train our model, we introduce a new method for automatically acquiring data from available English datasets. We also manually prepare a new dataset for evaluation. \textit{Violet} performs sizeably better than our baselines on all of our evaluation datasets. For example, it reaches a CIDEr score of $61.2$ on our manually annotated dataset and achieves an improvement of $13$ points on Flickr8k.

摘要
To train our model, we introduce a new method for automatically acquiring data from existing English datasets. Additionally, we manually prepare a new dataset for evaluation. Compared to our baselines, Violet performs significantly better on all of our evaluation datasets. For example, it achieves a CIDEr score of 61.2 on our manually annotated dataset and improves by 13 points on Flickr8k.

Disinformation Capabilities of Large Language Models

paper_url: http://arxiv.org/abs/2311.08838
repo_url: https://github.com/kinit-sk/disinformation-capabilities
paper_authors: Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, Maria Bielikova
for: 本研究探讨了现代语言模型（LLM）可能在扩散假新闻方面的能力，以及这些能力对民主社会的影响。
methods: 研究使用20个假新闻narritives测试了10个LLM的能力，包括生成新闻文章的质量、与假新闻narritives的倾向度、生成安全警告等方面。
results: 研究发现，LLMs可以生成有力的新闻文章，并且往往同意危险的假新闻narritives。此外，检测模型也能够准确地检测LLM生成的假新闻文章。

Abstract
Automated disinformation generation is often listed as one of the risks of large language models (LLMs). The theoretical ability to flood the information space with disinformation content might have dramatic consequences for democratic societies around the world. This paper presents a comprehensive study of the disinformation capabilities of the current generation of LLMs to generate false news articles in English language. In our study, we evaluated the capabilities of 10 LLMs using 20 disinformation narratives. We evaluated several aspects of the LLMs: how well they are at generating news articles, how strongly they tend to agree or disagree with the disinformation narratives, how often they generate safety warnings, etc. We also evaluated the abilities of detection models to detect these articles as LLM-generated. We conclude that LLMs are able to generate convincing news articles that agree with dangerous disinformation narratives.

摘要
自动化假信息生成是大语言模型（LLM）的风险之一。这种理论上的信息淹没能力可能对世界各地的民主社会造成巨大的影响。本文提供了现代大语言模型对英语新闻文章的假信息生成能力的全面研究。我们在这种研究中评估了10个LLM的表现，使用20个假信息 narraative。我们评估了这些LLM的新闻文章生成能力、假信息narraative的同意程度、安全警告的生成频率等方面。我们还评估了检测模型对这些文章是否能够检测出LLM生成的能力。我们结论是，LLM可以生成有力的新闻文章，并与危险的假信息narraative相符。

StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

paper_url: http://arxiv.org/abs/2311.08803
repo_url: None
paper_authors: Chang Gao, Haiyun Jiang, Deng Cai, Shuming Shi, Wai Lam
for: 提高Chain-of-thought（CoT）提问方法的普适性和一致性，以解决现有方法的通用性和任务级别一致性问题。
methods: 使用LLMs的能力，提出了一个完整的框架StrategylLM，通过自动生成、评估和选择有前途的策略来解决各种任务。
results: StrategylLM在13个数据集和4个复杂任务上取得了无人干预的比较优秀成绩，比基elineCoT-SC提高了39.2%到43.3%，70.3%到72.5%，51.7%到62.0%和30.0%到79.2%。

Abstract
Most existing chain-of-thought (CoT) prompting methods suffer from the issues of generalizability and consistency, as they often rely on instance-specific solutions that may not be applicable to other cases and lack task-level consistency in their reasoning steps. To address these limitations, we propose a comprehensive framework, StrategyLLM, harnessing the capabilities of LLMs to tackle various tasks. The framework improves generalizability by formulating general problem-solving strategies and enhances consistency by producing consistent solutions using these strategies. StrategyLLM employs four LLM-based agents: strategy generator, executor, optimizer, and evaluator, working together to generate, evaluate, and select promising strategies for a given task automatically. The experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (39.2% $\rightarrow$ 43.3%), commonsense reasoning (70.3% $\rightarrow$ 72.5%), algorithmic reasoning (51.7% $\rightarrow$ 62.0%), and symbolic reasoning (30.0% $\rightarrow$ 79.2%).

摘要
现有的链式思维（CoT）提问方法受到普适性和一致性的限制，因为它们经常依赖于特定情况的解决方案，这些解决方案可能无法应用于其他情况，而且缺乏任务水平的一致性在其思维步骤中。为了解决这些限制，我们提出了一个全面的框架，名为策略LLM，充分利用LLM的能力来解决各种任务。该框架提高了普适性，通过形ulated general problem-solving策略，并增强一致性，通过使用这些策略生成一致的解决方案。策略LLM使用四个LLM基于的代理：策略生成器、执行器、优化器和评估器，这些代理共同工作，自动生成、评估和选择有投入潜力的策略，以解决给定任务。实验结果表明，策略LLM比基线CoT-SC，需要人工标注解决方案的情况下，在13个数据集上的4个挑战任务中表现出色，无需人类参与，包括数学逻辑（39.2% $\rightarrow$ 43.3%）、通情能力（70.3% $\rightarrow$ 72.5%）、算法逻辑（51.7% $\rightarrow$ 62.0%）和符号逻辑（30.0% $\rightarrow$ 79.2%）。

German FinBERT: A German Pre-trained Language Model

paper_url: http://arxiv.org/abs/2311.08793
repo_url: None
paper_authors: Moritz Scherrmann
for: 本研究开发了一个特有的德国语言模型，名为德国金融BERT，供金融文本数据分析使用。
methods: 本研究使用了广泛的预训练过程，运用了大量的金融报告、紧急公告和新闻，与德国公司相关。
results: 研究结果显示，德国金融BERT在下游任务中表现出色，尤其是在金融专业数据上。这表明德国金融BERT能够捕捉领域特有的特征。

Abstract
This study presents German FinBERT, a novel pre-trained German language model tailored for financial textual data. The model is trained through a comprehensive pre-training process, leveraging a substantial corpus comprising financial reports, ad-hoc announcements and news related to German companies. The corpus size is comparable to the data sets commonly used for training standard BERT models. I evaluate the performance of German FinBERT on downstream tasks, specifically sentiment prediction, topic recognition and question answering against generic German language models. My results demonstrate improved performance on finance-specific data, indicating the efficacy of German FinBERT in capturing domain-specific nuances. The presented findings suggest that German FinBERT holds promise as a valuable tool for financial text analysis, potentially benefiting various applications in the financial domain.

摘要

Accelerating Toeplitz Neural Network with Constant-time Inference Complexity

paper_url: http://arxiv.org/abs/2311.08756
repo_url: https://github.com/opennlplab/etsc-exact-toeplitz-to-ssm-conversion
paper_authors: Zhen Qin, Yiran Zhong
for: 本文旨在将 toeplitz neural networks (TNNs) 转化为 state space models (SSMs)，以便在推理过程中实现常数复杂性。
methods: 作者通过对 TNNs 的推理过程进行优化，将其转化为 SSMs。该过程被形式化为一个优化问题，并提供了关闭式解决方案。在解决过程中，作者使用离散傅里叶变换 (DFT) 来高效解决 Vandermonde 线性系统问题。
results: 作者在语言模型任务上进行了广泛的实验，证明了其方法的有效性。具体来说，作者的方法可以在不同的设定下保持数值稳定性，并且与其他梯度下降解决方案相比，具有更高的数值稳定性。

Abstract
Toeplitz Neural Networks (TNNs) have exhibited outstanding performance in various sequence modeling tasks. They outperform commonly used Transformer-based models while benefiting from log-linear space-time complexities. On the other hand, State Space Models (SSMs) achieve lower performance than TNNs in language modeling but offer the advantage of constant inference complexity. In this paper, we aim to combine the strengths of TNNs and SSMs by converting TNNs to SSMs during inference, thereby enabling TNNs to achieve the same constant inference complexities as SSMs. To accomplish this, we formulate the conversion process as an optimization problem and provide a closed-form solution. We demonstrate how to transform the target equation into a Vandermonde linear system problem, which can be efficiently solved using the Discrete Fourier Transform (DFT). Notably, our method requires no training and maintains numerical stability. It can be also applied to any LongConv-based model. To assess its effectiveness, we conduct extensive experiments on language modeling tasks across various settings. Additionally, we compare our method to other gradient-descent solutions, highlighting the superior numerical stability of our approach. The source code is available at https://github.com/OpenNLPLab/ETSC-Exact-Toeplitz-to-SSM-Conversion.

摘要
托平论 neural network (TNN) 在不同的序列模型任务中表现出色，而且比通用的 transformer 型模型更具有 Log-linear 空间时间复杂度的优势。然而，状态空间模型 (SSM) 在语言模型中表现较差，但它具有常数推理复杂度的优点。在这篇论文中，我们想要将 TNN 转换成 SSM 以实现常数推理复杂度，而不需要训练。我们将转换过程定义为优化问题，并提供了关闭式解决方案。我们将目标方程转换成 Vandermonde 线性系统问题，可以使用离散傅立叶变换 (DFT) 高效解决。值得注意的是，我们的方法不需要训练，并且保持了数值稳定性。此外，我们的方法可以应用于任何 LongConv 基于模型。为评估其效果，我们在不同的语言模型任务上进行了广泛的实验。此外，我们与其他梯度下降解决方案进行比较，高亮了我们的方法的数值稳定性的优势。源代码可以在 GitHub 上找到：https://github.com/OpenNLPLab/ETSC-Exact-Toeplitz-to-SSM-Conversion。

Thread of Thought Unraveling Chaotic Contexts

paper_url: http://arxiv.org/abs/2311.08734
repo_url: None
paper_authors: Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, Jianbing Shen
For: The paper aims to improve the reasoning performance of large language models (LLMs) in chaotic contexts by introducing a new “Thread of Thought” (ThoT) strategy.* Methods: The ThoT strategy segments and analyzes extended contexts, selecting pertinent information to improve the reasoning performance of LLMs. The strategy is versatile and can be integrated with various LLMs and prompting techniques.* Results: The paper demonstrates the effectiveness of ThoT using three datasets (PopQA, EntityQ, and MTCR) and shows that ThoT significantly improves reasoning performance compared to other prompting techniques.

Abstract
Large Language Models (LLMs) have ushered in a transformative era in the field of natural language processing, excelling in tasks related to text comprehension and generation. Nevertheless, they encounter difficulties when confronted with chaotic contexts (e.g., distractors rather than long irrelevant context), leading to the inadvertent omission of certain details within the chaotic context. In response to these challenges, we introduce the "Thread of Thought" (ThoT) strategy, which draws inspiration from human cognitive processes. ThoT systematically segments and analyzes extended contexts while adeptly selecting pertinent information. This strategy serves as a versatile "plug-and-play" module, seamlessly integrating with various LLMs and prompting techniques. In the experiments, we utilize the PopQA and EntityQ datasets, as well as a Multi-Turn Conversation Response dataset (MTCR) we collected, to illustrate that ThoT significantly improves reasoning performance compared to other prompting techniques.

摘要

Enhancing Emergency Decision-making with Knowledge Graphs and Large Language Models

paper_url: http://arxiv.org/abs/2311.08732
repo_url: None
paper_authors: Minze Chen, Zhenxiang Tao, Weitong Tang, Tingxin Qin, Rui Yang, Chunli Zhu
for: 提供可靠的紧急决策支持methods: 使用知识图和大语言模型results: 在不同的紧急情况下，与基eline模型相比，得到了显著的改善，得分9.06、9.09、9.03和9.09。

Abstract
Emergency management urgently requires comprehensive knowledge while having a high possibility to go beyond individuals' cognitive scope. Therefore, artificial intelligence(AI) supported decision-making under that circumstance is of vital importance. Recent emerging large language models (LLM) provide a new direction for enhancing targeted machine intelligence. However, the utilization of LLM directly would inevitably introduce unreliable output for its inherent issue of hallucination and poor reasoning skills. In this work, we develop a system called Enhancing Emergency decision-making with Knowledge Graph and LLM (E-KELL), which provides evidence-based decision-making in various emergency stages. The study constructs a structured emergency knowledge graph and guides LLMs to reason over it via a prompt chain. In real-world evaluations, E-KELL receives scores of 9.06, 9.09, 9.03, and 9.09 in comprehensibility, accuracy, conciseness, and instructiveness from a group of emergency commanders and firefighters, demonstrating a significant improvement across various situations compared to baseline models. This work introduces a novel approach to providing reliable emergency decision support.

摘要
应急管理强需全面知识，同时具有跨个人认知范围的可能性。因此，基于人工智能（AI）的决策在这种情况下是非常重要的。最新的大语言模型（LLM）提供了一个新的方向来提高目标机器智能。然而，直接使用LLM会导致不可靠的输出，因为它们的内置问题包括幻觉和思维能力不足。在这项工作中，我们开发了一个系统called Enhancing Emergency decision-making with Knowledge Graph and LLM（E-KELL），它提供了基于证据的决策在不同的应急阶段。研究构建了一个结构化的应急知识图，并使用提示链导引LLM进行图上的理解。在实际评估中，E-KELL得分9.06、9.09、9.03和9.09在可读性、准确性、简洁性和指导性方面，分别从一群应急指挥官和消防员手中得到评分，表明与基eline模型相比在不同的情况下显著提高。这项工作介绍了一种可靠的应急决策支持方法。

Uncertainty Estimation on Sequential Labeling via Uncertainty Transmission

paper_url: http://arxiv.org/abs/2311.08726
repo_url: None
paper_authors: Jianfeng He, Linlin Yu, Shuo Lei, Chang-Tien Lu, Feng Chen
for: 本研究旨在提高Named Entity Recognition（NER）预测的不确定性评估（UE-NER）。
methods: 本研究提出了一个Sequential Labeling Posterior Network（SLPN），用于估算NER预测结果的不确定性。SLPN考虑了ENTITY之间的连接（即一个ENTITY嵌入是基于其他ENTITY的学习），并且特别处理了WRONG-SPAN情况。
results: 本研究在两个数据集上实现了显著的改善，例如在MIT-Restaurant数据集上提高了AUPR指数5.54个点。

Abstract
Sequential labeling is a task predicting labels for each token in a sequence, such as Named Entity Recognition (NER). NER tasks aim to extract entities and predict their labels given a text, which is important in information extraction. Although previous works have shown great progress in improving NER performance, uncertainty estimation on NER (UE-NER) is still underexplored but essential. This work focuses on UE-NER, which aims to estimate uncertainty scores for the NER predictions. Previous uncertainty estimation models often overlook two unique characteristics of NER: the connection between entities (i.e., one entity embedding is learned based on the other ones) and wrong span cases in the entity extraction subtask. Therefore, we propose a Sequential Labeling Posterior Network (SLPN) to estimate uncertainty scores for the extracted entities, considering uncertainty transmitted from other tokens. Moreover, we have defined an evaluation strategy to address the specificity of wrong-span cases. Our SLPN has achieved significant improvements on two datasets, such as a 5.54-point improvement in AUPR on the MIT-Restaurant dataset.

摘要
Sequential labeling是一个任务，它 predicts labels for each token in a sequence，如Named Entity Recognition (NER)。NER任务的目标是从文本中提取实体，并预测它们的标签，这是信息抽取中非常重要的一步。虽然之前的工作已经达到了NER性能的很大进步，但UE-NER（Named Entity Recognition uncertainty estimation）还是被忽略了，这是非常重要的。本工作关注UE-NER，它的目标是为NER预测中的实体提取 uncertainty scores。以前的uncertainty estimation模型经常忽略了NER中的两个特有特征：实体之间的连接（即一个实体嵌入是基于其他实体学习的）以及实体提取子任务中的错误案例。因此，我们提出了一个Sequential Labeling Posterior Network (SLPN)，用于估计实体预测中的uncertainty scores，考虑实体之间的uncertainty传递。此外，我们定义了一种评估策略，用于解决实体提取子任务中的特殊错误案例。我们的SLPN在两个 dataset上达到了显著的改进，如MIT-Restaurant dataset上的AUPR提高5.54点。

Method for Text Entity Linking in Power Distribution Scheduling Oriented to Power Distribution Network Knowledge Graph

paper_url: http://arxiv.org/abs/2311.08724
repo_url: None
paper_authors: Xiang Li, Che Wang, Bing Li, Hao Chen, Sizhe Li
for: 本研究旨在链接发电 dispatch 文本中的实体到一个电力分配网络知识图的方法。
methods: 该方法利用电力分配网络知识图和发电 dispatch 文本中实体的semantic、phonetic和syntactic特征进行深入理解，并使用加强型模型——lexical semantic feature-based skip convolutional neural network (LSF-SCNN) 进行实体匹配。
results: 比较控制模型的实验结果表明，LSF-SCNN 模型在英语发电 dispatch 文本中高精度地链接了多种实体类型，表现了高总准确率在实体链接中。

Abstract
The proposed method for linking entities in power distribution dispatch texts to a power distribution network knowledge graph is based on a deep understanding of these networks. This method leverages the unique features of entities in both the power distribution network's knowledge graph and the dispatch texts, focusing on their semantic, phonetic, and syntactic characteristics. An enhanced model, the Lexical Semantic Feature-based Skip Convolutional Neural Network (LSF-SCNN), is utilized for effectively matching dispatch text entities with those in the knowledge graph. The efficacy of this model, compared to a control model, is evaluated through cross-validation methods in real-world power distribution dispatch scenarios. The results indicate that the LSF-SCNN model excels in accurately linking a variety of entity types, demonstrating high overall accuracy in entity linking when the process is conducted in English.

摘要
“提议的方法是基于电力分配网络知识图的深入理解，该方法利用知识图和调度文本中实体的语义、语音和语法特征。使用加强模型——lexical semantic feature-based skip convolutional neural network（LSF-SCNN），可以有效地匹配调度文本中的实体与知识图中的实体。通过跨验证方法在实际电力分配调度场景中评估模型的效果，结果表明LSF-SCNN模型在英语下可以准确地连接多种实体类型，实现高精度实体连接。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Token Prediction as Implicit Classification to Identify LLM-Generated Text

paper_url: http://arxiv.org/abs/2311.08723
repo_url: https://github.com/markchenyutian/t5-sentinel-public
paper_authors: Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj
for: 本研究旨在提出一种新的语言模型标识方法，以便在文本生成中识别可能的大型语言模型（LLMs）。
methods: 我们重新框定了分类任务为下一个字符预测任务，直接使用基础LM进行 fine-tune，而不是添加额外的分类层。我们使用 Text-to-Text Transfer Transformer（T5）模型作为我们的实验基础。
results: 我们的方法在文本分类任务中表现出色，表明其简单性和效率。此外，我们对模型提取的特征进行了解释性研究，发现它能够在不同的LLMs中分辨出不同的写作风格，即使没有显式的分类器。我们还收集了一个名为 OpenLLMText 的数据集，包含约 340k 的文本样本，来自人类和 LLMs，包括 GPT3.5、PaLM、LLaMA 和 GPT2。

Abstract
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task and directly fine-tune the base LM to perform it. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments. We compared our approach to the more direct approach of utilizing hidden states for classification. Evaluation shows the exceptional performance of our method in the text classification task, highlighting its simplicity and efficiency. Furthermore, interpretability studies on the features extracted by our model reveal its ability to differentiate distinctive writing styles among various LLMs even in the absence of an explicit classifier. We also collected a dataset named OpenLLMText, containing approximately 340k text samples from human and LLMs, including GPT3.5, PaLM, LLaMA, and GPT2.

摘要
这篇论文介绍了一种新的方法，用于识别文本生成过程中可能的大语言模型（LLM）。而不是添加一层分类层到基础语言模型（LM）上，我们将分类任务重新定义为下一个字符预测任务，并直接使用基础LM进行 fine-tune。我们使用 Text-to-Text Transfer Transformer（T5）模型作为我们的实验室。我们与直接使用隐藏状态进行分类的方法进行比较。评估结果表明我们的方法在文本分类任务中表现出色，强调其简单性和效率。此外，我们对我们的模型提取的特征进行了解释性研究，发现它能够在不同的LLM下 diferenciate 不同的写作风格，甚至在没有显式分类器的情况下。我们还收集了一个名为 OpenLLMText 的数据集，包含约 340k 的文本样本，来自人类和 LLM，包括 GPT3.5、PaLM、LLaMA 和 GPT2。

Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory

paper_url: http://arxiv.org/abs/2311.08719
repo_url: None
paper_authors: Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, Guannan Zhang
for: 提高大语言模型在长期人机交互中的表现，减少偏见的问题。
methods: 提出了一种新的记忆机制 called TiM，允许大语言模型在对话流中维护一个演化的记忆，并通过插入、忘记和合并操作来动态更新记忆。
results: 在实际和模拟对话中，通过使用 TiM 机制，大语言模型的响应表现得到了显著提高，并且可以减少偏见的问题。

Abstract
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable performance in long-term human-machine interactions, which basically relies on iterative recalling and reasoning of history to generate high-quality responses. However, such repeated recall-reason steps easily produce biased thoughts, \textit{i.e.}, inconsistent reasoning results when recalling the same history for different questions. On the contrary, humans can keep thoughts in the memory and recall them without repeated reasoning. Motivated by this human capability, we propose a novel memory mechanism called TiM (Think-in-Memory) that enables LLMs to maintain an evolved memory for storing historical thoughts along the conversation stream. The TiM framework consists of two crucial stages: (1) before generating a response, a LLM agent recalls relevant thoughts from memory, and (2) after generating a response, the LLM agent post-thinks and incorporates both historical and new thoughts to update the memory. Thus, TiM can eliminate the issue of repeated reasoning by saving the post-thinking thoughts as the history. Besides, we formulate the basic principles to organize the thoughts in memory based on the well-established operations, (\textit{i.e.}, insert, forget, and merge operations), allowing for dynamic updates and evolution of the thoughts. Furthermore, we introduce Locality-Sensitive Hashing into TiM to achieve efficient retrieval for the long-term conversations. We conduct qualitative and quantitative experiments on real-world and simulated dialogues covering a wide range of topics, demonstrating that equipping existing LLMs with TiM significantly enhances their performance in generating responses for long-term interactions.

摘要
大型语言模型（LLM）具有增强的记忆功能，在人机交互中表现出了很高的能力。然而，在重复 recall 和推理的过程中，LLM 容易产生偏见，即不同问题时的推理结果不一致。人类可以将想法保持在记忆中，而不需要重复推理。为了解决这个问题，我们提出了一种新的记忆机制called TiM（思考在内存），允许 LLM 在对话流中维护一个演进的记忆。TiM 框架包括两个关键阶段：（1）在生成响应之前，LLM 代理检索相关的思想记忆中，（2）在生成响应后，LLM 代理在历史和新的思想之间进行后思考和融合，以更新记忆。因此，TiM 可以消除重复推理的问题，并将后思考的思想作为历史记忆保存。此外，我们采用了在 TiM 中使用 Local Sensitive Hashing 进行高效的检索，以便应对长期对话。我们在实际和模拟对话中进行了质量和量的实验，demonstrating equip existing LLMs with TiM 可以明显提高它们在长期交互中的响应能力。

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling

paper_url: http://arxiv.org/abs/2311.08718
repo_url: None
paper_authors: Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, Yang Zhang
for: 这 paper aims to improve the reliability, trustworthiness, and interpretability of large language models (LLMs) by developing an uncertainty decomposition framework.
methods: The proposed framework, called input clarifications ensemble, generates a set of clarifications for the input and feeds them into the fixed LLMs to ensure accurate and reliable uncertainty quantification.
results: Empirical evaluations demonstrate that the proposed framework provides accurate and reliable uncertainty quantification on various tasks, and the code will be made publicly available at https://github.com/UCSB-NLP-Chang/llm_uncertainty.Here's the Chinese version:
for: 这 paper 的目的是提高大型自然语言处理模型（LLMs）的可靠性、可信度和可解释性。
methods: 提议的框架是输入明确集，它会生成输入的一组明确度，然后将其传递给固定的 LLMs，以确保准确和可靠的不确定量评估。
results: 实验证明，提议的框架可以在不同任务上提供准确和可靠的不确定量评估，代码将会在 https://github.com/UCSB-NLP-Chang/llm_uncertainty 上公开发布。

Abstract
Uncertainty decomposition refers to the task of decomposing the total uncertainty of a model into data (aleatoric) uncertainty, resulting from the inherent complexity or ambiguity of the data, and model (epistemic) uncertainty, resulting from the lack of knowledge in the model. Performing uncertainty decomposition for large language models (LLMs) is an important step toward improving the reliability, trustworthiness, and interpretability of LLMs, but this research task is very challenging and remains unresolved. The existing canonical method, Bayesian Neural Network (BNN), cannot be applied to LLMs, because BNN requires training and ensembling multiple variants of models, which is infeasible or prohibitively expensive for LLMs. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarifications ensemble, which bypasses the need to train new models. Rather than ensembling models with different parameters, our approach generates a set of clarifications for the input, feeds them into the fixed LLMs, and ensembles the corresponding predictions. We show that our framework shares a symmetric decomposition structure with BNN. Empirical evaluations demonstrate that the proposed framework provides accurate and reliable uncertainty quantification on various tasks. Code will be made publicly available at https://github.com/UCSB-NLP-Chang/llm_uncertainty .

摘要
<> transtable("Uncertainty decomposition")uncertainty decomposition REFERS TO THE TASK OF DECOMPOSING THE TOTAL UNCERTAINTY OF A MODEL INTO DATA (aleatoric) uncertainty, RESULTING FROM THE INHERENT COMPLEXITY OR AMBIGUITY OF THE DATA, AND MODEL (epistemic) uncertainty, RESULTING FROM THE LACK OF KNOWLEDGE IN THE MODEL. PERFORMING UNCERTAINTY DECOMPOSITION FOR LARGE LANGUAGE MODELS (LLMs) IS AN IMPORTANT STEP TOWARD IMPROVING THE RELIABILITY, TRUSTWORTHINESS, AND INTERPRETABILITY OF LLMs, BUT THIS RESEARCH TASK IS VERY CHALLENGING AND REMAINS UNRESOLVED. THE EXISTING CANONICAL METHOD, BAYESIAN NEURAL NETWORK (BNN), CANNOT BE APPLIED TO LLMs, BECAUSE BNN REQUIRES TRAINING AND ENSMBLING MULTIPLE VARIANTS OF MODELS, WHICH IS INFEASIBLE OR PROHIBITIVELY EXPENSIVE FOR LLMs. IN THIS PAPER, WE INTRODUCE AN UNCERTAINTY DECOMPOSITION FRAMEWORK FOR LLMs, CALLED INPUT CLARIFICATIONS ENSEMBLE, WHICH BYPASSES THE NEED TO TRAIN NEW MODELS. RATHER THAN ENSMBLING MODELS WITH DIFFERENT PARAMETERS, OUR APPROACH GENERATES A SET OF CLARIFICATIONS FOR THE INPUT, FEEDS THEM INTO THE FIXED LLMs, AND ENSMBLES THE CORRESPONDING PREDICTIONS. WE SHOW THAT OUR FRAMEWORK SHARES A SYMMETRIC DECOMPOSITION STRUCTURE WITH BNN. EMPIRICAL EVALUATIONS DEMONSTRATE THAT THE PROPOSED FRAMEWORK PROVIDES ACCURATE AND RELIABLE UNCERTAINTY QUANTIFICATION ON VARIOUS TASKS. CODE WILL BE MADE PUBLICLY AVAILABLE AT https://github.com/UCSB-NLP-Chang/llm_uncertainty .

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

paper_url: http://arxiv.org/abs/2311.08711
repo_url: https://github.com/ytyz1307zzh/plug
paper_authors: Zhihan Zhang, Dong-Ho Lee, Yuwei Fang, Wenhao Yu, Mengzhao Jia, Meng Jiang, Francesco Barbieri
for: 提高大语言模型在不同人类指令下的理解和回答能力
methods: 使用高资源语言（主要是英语）为核心，实现指令准备语言转化为目标语言的回答
results: 比直接回答目标语言alone提高了大语言模型对指令的遵从能力，增加了29%的平均提升率。

Abstract
Instruction tuning has remarkably advanced large language models (LLMs) in understanding and responding to diverse human instructions. Despite the success in high-resource languages, its application in lower-resource ones faces challenges due to the imbalanced foundational abilities of LLMs across different languages, stemming from the uneven language distribution in their pre-training data. To tackle this issue, we propose pivot language guided generation (PLUG), an approach that utilizes a high-resource language, primarily English, as the pivot to enhance instruction tuning in lower-resource languages. It trains the model to first process instructions in the pivot language, and then produce responses in the target language. To evaluate our approach, we introduce a benchmark, X-AlpacaEval, of instructions in 4 languages (Chinese, Korean, Italian, and Spanish), each annotated by professional translators. Our approach demonstrates a significant improvement in the instruction-following abilities of LLMs by 29% on average, compared to directly responding in the target language alone. Further experiments validate the versatility of our approach by employing alternative pivot languages beyond English to assist languages where LLMs exhibit lower proficiency.

摘要
具有杰出表现的指令调整技术已经大幅提高了大型自然语言模型（LLM）的理解和回应多样化人类指令的能力。然而，在低资源语言上应用这些技术却遇到了挑战，这主要归结于LLM在不同语言的基础能力的不均衡，这种不均衡来自于模型在它们的预训练数据中的语言分布不均。为解决这个问题，我们提出了锚语言导向生成（PLUG）方法，该方法利用高资源语言（主要是英语）作为锚点，以提高低资源语言中的指令调整能力。它将模型首先在锚语言中处理指令，然后生成回应在目标语言中。为评估我们的方法，我们提出了一个标准测试套件，名为X-AlpacaEval，该套件包含4种语言（中文、韩语、意大利语和西班牙语）的指令，每个指令由专业翻译员进行标注。我们的方法在平均上提高了LLM的指令遵循能力 by 29%，相比直接在目标语言中回应。此外，我们的实验还证明了我们的方法可以采用不同的锚语言来帮助语言，其中LLM表现较低的语言。

Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

paper_url: http://arxiv.org/abs/2311.08705
repo_url: None
paper_authors: Ankita Gupta, Chulaka Gunasekara, Hui Wan, Jatin Ganhotra, Sachindra Joshi, Marina Danilevsky
for: 本研究旨在探讨对话摘要模型的Robustness Challenge，包括对各种自然语言变化和噪声的影响。
methods: 我们使用公开的数据集对现有的对话摘要模型进行了系统性的研究，以评估这些模型对各种语言变化和噪声的抗预测性。我们引入了两种类型的干扰：utterance-level干扰和对话-level干扰。
results: 我们发现，尽管使用精度级进行了微调和指令级进行了微调，但是这些模型都受到输入变化的影响，特别是对话-level干扰。我们还通过人工评估 validate our findings。此外，我们发现使用一部分干扰数据进行训练并不能解决对话摘要模型的Robustness Challenge。

Abstract
Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-art dialogue summarization models using publicly available datasets. To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We conduct our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which capture different aspects of the summarization model's performance. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate if the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data and observe that this approach is insufficient to address robustness challenges with current models and thus warrants a more thorough investigation to identify better solutions. Overall, our work highlights robustness challenges in dialogue summarization and provides insights for future research.

摘要
对话摘要任务 involve 摘要长 conversations 而保留最重要信息。实际对话中经常出现自然的变化（例如重复、停顿），现有的对话摘要模型在这些对话中表现不佳。在这项研究中，我们系统地研究这些变化对现状对话摘要模型的影响。为了模拟实际变化，我们引入了两种类型的杂化：个别话语杂化（ modify 个别话语中的错误和语言变化）和对话杂化（添加无关信息的交流，例如重复、致谢）。我们按照三个维度进行分析：一致性、重要性和忠诚度，这些维度捕捉了不同的对话摘要模型表现方面。我们发现， beide fine-tuned 和 instruction-tuned 模型受到输入变化的影响，其中后者更加敏感，特别是对话杂化。我们还通过人工评估 validate 我们的发现。最后，我们 investigate 是否可以通过训练 fine-tuned 模型 WITH 一部分杂化数据来提高其robustness，并发现这种方法不足以解决当前模型的Robustness挑战，因此需要进一步的调查以找到更好的解决方案。总之，我们的工作强调对话摘要中的Robustness挑战和未来研究的需要。

Attribute Diversity Determines the Systematicity Gap in VQA

paper_url: http://arxiv.org/abs/2311.08695
repo_url: None
paper_authors: Ian Berlot-Attwell, A. Michael Carrell, Kumar Krishna Agrawal, Yash Sharma, Naomi Saphra
for: 研究 neural network 是否可以通过将 familar concept 组合在一起来泛化到新的情况。
methods: 引入了一个新的诊断数据集 CLEVR-HOPE，以测试系统aticity gap 在视觉问答中的表现。
results: 发现尽量增加训练数据量不会减少系统aticity gap，但是增加不同类型的属性组合在未seen combination中的训练数据多样性可以减少系统aticity gap。

Abstract
The degree to which neural networks can generalize to new combinations of familiar concepts, and the conditions under which they are able to do so, has long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that while increased quantity of training data does not reduce the systematicity gap, increased training data diversity of the attributes in the unseen combination does. In all, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.

摘要
“神经网络是否能够总结新的熵合？”这个问题一直是开放的。在这个工作中，我们研究视觉问答中的系统特性差距：推理已经看过和未经看过的对象属性的组合性的表现差异。为了测试，我们引入了一个新的诊断数据集，CLEVR-HOPE。我们发现，尽管增加训练数据量不会减少系统特性差距，但是增加未经看过组合属性的训练数据多样性可以减少系统特性差距。总之，我们的实验表明，更多的独特属性类型组合被训练时，更可预期性的结果。

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

paper_url: http://arxiv.org/abs/2311.08692
repo_url: None
paper_authors: Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
for: 本研究旨在提高大语言模型（LLM）的 ensemble 性能，通过挖掘各自领域和任务中的专业知识，实现更好的 ensemble 性能。
methods: 本研究提出了一种名为 Zooter 的奖励导引路由方法，通过训练路由函数来精准地分配每个查询到适合的 LLM 中。此外，研究还提出了一种基于标签的抑制难以预测的噪声的技术。
results: 研究发现，Zooter 在一系列 benchmark 集合上表现出色，比单个模型的表现更好，并在 44% 的任务上击败了多个奖励模型排名方法。

Abstract
The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate it by mining latent expertise with off-the-shelf reward models. We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. Zooter shows computation efficiency in inference as it introduces only a minor computation overhead of a routing function compared with reward model ranking methods. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks. Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.

摘要
<>Translate the given text into Simplified Chinese.<>LLM的补偿潜力假设市售LLM有多个领域和任务的多样化专业知识，以 ensemble 方式实现更好的性能。现有的LLM ensemble方法主要集中于奖励模型排名输出，导致计算开销增加。为解决这个问题，我们再次探讨LLM的补偿潜力，并通过挖掘缓存专业知识使用市售奖励模型。我们提议Zooter，一种奖励导航方法，通过在训练查询上分配奖励来培养路由函数，可以准确地将每个查询分配给LLM拥有相关专业知识。我们还 integra 了标签基本标签增强来降低使用奖励作为银色监督时的噪音。Zooter在推理中引入了只有市售奖励模型排名方法相对较少的计算开销。我们对一个包含26个子集的全面 benchmark 集进行了评估，Zooter在 average 上超过了最佳单个模型，并在44%的任务上排名第一。

Understanding Calibration for Multilingual Question Answering Models

paper_url: http://arxiv.org/abs/2311.08669
repo_url: None
paper_authors: Yahan Yang, Soham Dan, Dan Roth, Insup Lee
for: 这篇论文主要研究了多语言预训练语言模型在问答任务中的准确性。
methods: 该论文使用了多种问答模型设计和多种语言进行了广泛的实验，包括抽取式和生成式问答模型，以及高资源语言和低资源语言。它还研究了不同维度的准确性，包括在适应区、离distribution和跨语言传递设置中。
results: 研究发现自动翻译数据增强技术可以大幅提高模型准确性，并进行了一系列的减少实验来研究模型大小对准确性的影响和多语言模型与单语言模型的比较。

Abstract
Multilingual pre-trained language models are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well they are calibrated. In this paper, we study the calibration properties of several pre-trained multilingual large language models (LLMs) on a variety of question-answering tasks. We perform extensive experiments, spanning both extractive and generative QA model designs and diverse languages, spanning both high-resource and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. We demonstrate automatically translated data augmentation as a highly effective technique to improve model calibration. We also conduct a number of ablation experiments to study the effect of model size on calibration and how multilingual models compare with their monolingual counterparts for diverse tasks and languages.

摘要
多语言预训练语言模型在问答任务（QA）中表现非常出色，在多种多语言benchmark上达到了高准确率。然而，对于这些模型的准确性calibration的了解非常少。在这篇论文中，我们研究了多种预训练多语言大型语言模型（LLMs）在问答任务中的准确性calibration性。我们进行了广泛的实验，涵盖了EXTRACTIVE和生成型问答模型的设计，以及多种语言和资源量的组合。我们研究了不同的calibration维度，包括在适用范围内、外部和跨语言传递设置中的calibration性，并 investigate了提高calibration性的策略，包括后期方法和规则化的细化。我们示出了自动翻译数据增强为一种非常有效的技术来提高模型的准确性calibration。我们还进行了一些减少实验来研究模型大小对calibration的影响和多语言模型与单语言模型在多种任务和语言上的比较。

paper_url: http://arxiv.org/abs/2311.08666
repo_url: https://github.com/kj2013/claff-diplomacy
paper_authors: Kokil Jaidka, Hansin Ahuja, Lynnette Ng
for: 研究在线上战略游戏《 дипломати》中玩家之间的交互，以了解玩家如何在游戏中谈判他们的方式。
methods: 使用了10,000多则聊天讯息的标注数据，以分析不同谈判策略的重要性，并评估这些策略在预测短期和长期游戏结果中的影响。
results: 发现谈判策略可以通过语言模型化聊天讯息来预测，但是在短期内的信任性预测需要更多的资料。然而，谈判策略在图像意识强化学习方法中是非常重要的，可以预测长期游戏结果，如玩家的成功。

Abstract
Online games are dynamic environments where players interact with each other, which offers a rich setting for understanding how players negotiate their way through the game to an ultimate victory. This work studies online player interactions during the turn-based strategy game, Diplomacy. We annotated a dataset of over 10,000 chat messages for different negotiation strategies and empirically examined their importance in predicting long- and short-term game outcomes. Although negotiation strategies can be predicted reasonably accurately through the linguistic modeling of the chat messages, more is needed for predicting short-term outcomes such as trustworthiness. On the other hand, they are essential in graph-aware reinforcement learning approaches to predict long-term outcomes, such as a player's success, based on their prior negotiation history. We close with a discussion of the implications and impact of our work. The dataset is available at https://github.com/kj2013/claff-diplomacy.

摘要
在线游戏是动态环境，玩家之间的互动可以提供丰富的数据来理解玩家如何在游戏中获得最终胜利。这个研究 focuses on线上玩家互动中的谈判策略，并对不同的谈判策略进行了类别标注。我们分析了超过10,000封聊天讯息，并评估了这些谈判策略对游戏的长期和短期结果的影响。虽然可以透过语言模型估计谈判策略的准确性，但是在短期内的信任性仍然是难以预测的。然而，这些谈判策略在图形意识型态的强化学习方法中是非常重要的，可以预测长期的成功。我们在结论中讨论了这个研究的影响和意义，并提供了资料集的网站地址。

Multistage Collaborative Knowledge Distillation from Large Language Models

paper_url: http://arxiv.org/abs/2311.08640
repo_url: None
paper_authors: Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, Benjamin Rozonoyer, Md Arafat Sultan, Jay-Yoon Lee, Mohit Iyyer, Andrew McCallum
for: 这 paper 是为了解决 semi-supervised sequence prediction 任务，其中有限量的标注数据不足以有效地训练模型，而同时几何shot提示大型自然语言模型 (LLM) 的性能有限。
methods: 这 paper 使用了一种新的混合型知识填充方法 (MCKD)，其首先使用几何shot在 Context 中学习来生成假标签 для无标注数据。然后，在每个阶段的填充中，一对学生在不同的分区上进行训练，每个学生生成新的和改进的假标签来监督下一个阶段的学生。
results: 这 paper 的结果表明，在两个 constituency parsing 任务上，使用多stage collaborative knowledge distillation (MCKD) 可以提高模型的性能。在 CRAFT 生物医学解析任务上，3-stage MCKD 使用 50 个标注例可以与 supervised finetuning 使用 500 个标注例匹配的性能，并且超过提示 LL 和 vanilla KD 的性能 by 7.5% 和 3.7% 的解析 F1，分别。

Abstract
We study semi-supervised sequence prediction tasks where labeled data are too scarce to effectively finetune a model and at the same time few-shot prompting of a large language model (LLM) has suboptimal performance. This happens when a task, such as parsing, is expensive to annotate and also unfamiliar to a pretrained LLM. In this paper, we present a discovery that student models distilled from a prompted LLM can often generalize better than their teacher on such tasks. Leveraging this finding, we propose a new distillation method, multistage collaborative knowledge distillation from an LLM (MCKD), for such tasks. MCKD first prompts an LLM using few-shot in-context learning to produce pseudolabels for unlabeled data. Then, at each stage of distillation, a pair of students are trained on disjoint partitions of the pseudolabeled data. Each student subsequently produces new and improved pseudolabels for the unseen partition to supervise the next round of student(s) with. We show the benefit of multistage cross-partition labeling on two constituency parsing tasks. On CRAFT biomedical parsing, 3-stage MCKD with 50 labeled examples matches the performance of supervised finetuning with 500 examples and outperforms the prompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively.

摘要
我们研究半supervised序列预测任务，其中标签资料短缺，无法有效地调整模型。另一方面，几个shot提示大型自然语言模型（LLM）的表现有限。在这篇论文中，我们发现学生模型从提示LLM的distillation中可以对such tasks generalize更好。基于这发现，我们提出了一个新的distillation方法：多stage合作知识传递法（MCKD）。MCKD首先使用几个shot在场景学习生成pseudolabels для无标的数据。然后，在每个阶段的distillation中，一对学生被训练在不同的分区中。每个学生 subsequntially生成新的和改善的pseudolabels для未看到的分区，以便supervise the next round of student(s) with。我们显示了多stage交叉分区标签的 benefitu two constituency parsing tasks。在CRAFT生物医学分析任务上，3 stage MCKD with 50标签例和supervised fine-tuning with 500标签例的表现相似，并且超过提示LLM和vanilla KD的构造解析F1指标 by 7.5%和3.7%，对于这两个任务而言。

Formal Proofs as Structured Explanations: Proposing Several Tasks on Explainable Natural Language Inference

paper_url: http://arxiv.org/abs/2311.08637
repo_url: None
paper_authors: Lasha Abzianidze
for: 提出一种使用正式证明来实现可解释的自然语言推理（NLI）任务。
methods: 使用可靠高性能的逻辑基于NLI系统生成正式证明，并利用生成的证明中的深入信息来定义可解释NLI任务。
results: 提出一系列有结构化解释的NLI任务，可以根据解释的粒度来排序。 argue that these tasks will have fewer shortcomings than existing explainable NLI tasks.

Abstract
In this position paper, we propose a way of exploiting formal proofs to put forward several explainable natural language inference (NLI) tasks. The formal proofs will be produced by a reliable and high-performing logic-based NLI system. Taking advantage of the in-depth information available in the generated formal proofs, we show how it can be used to define NLI tasks with structured explanations. The proposed tasks can be ordered according to difficulty defined in terms of the granularity of explanations. We argue that the tasks will suffer with substantially fewer shortcomings than the existing explainable NLI tasks (or datasets).

摘要
在这份位置论文中，我们提出了使用正式证明来提出一些可解释的自然语言推理（NLI）任务。正式证明将由可靠高性能的逻辑基于NLI系统生成。利用生成的正式证明中的深入信息，我们显示了如何使用结构化解释来定义NLI任务。我们提出的任务可以按Difficulty进行排序，定义为证明粒度的水平。我们认为这些任务具有较少缺陷，相比现有的可解释NLI任务（或数据集）。

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

paper_url: http://arxiv.org/abs/2311.08623
repo_url: None
paper_authors: Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha
for: 降低encoder-decoder transformer模型的推理时间
methods: 使用Dynamic Early Exit on Decoder (DEED)方法，包括多出口encoder-decoder模型、深度监管和适应模块等简单 yet practical技术，以提高推理精度并降低推理时间
results: 对两种state-of-the-art encoder-decoder transformer模型进行评测，实现了30%-60%的总推理时间减少，同时保持与基线相当或更高的准确率

Abstract
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.

摘要
<>预测模型Encoder-decoder transformer模型在视觉语言任务中获得了很大成功，但它们受到高速引入延迟的困扰。通常，解码器占总时间的大部分，因为解码器使用自动回归的方式。为了加速推断，我们提出了在解码器上进行动态早期离开的方法（DEED）。我们构建了多出口encoder-decoder transformer模型，该模型在每个解码层都可以生成可信度的预测。此外，我们利用了简单 yet practical的技术，包括共享生成头和适应模块，以保持精度 when exiting at shallow decoder layers。基于多出口模型，我们在推断过程中实施了Step-level动态早期离开，其中模型可以根据当前层的信息使用 fewer decoder layers。由于不同的decoder layers可能会在不同的推断步骤中使用，我们在每个推断步骤 compute deeper-layer decoder features的时候，以确保不同推断步骤的特征是协调的。我们使用了两个状态对模型在多种视觉语言任务上进行评估。我们发现，我们的方法可以降低总推断时间的30%-60%，与基eline相比，保持相对或更高的准确率。

Multiple-Question Multiple-Answer Text-VQA

paper_url: http://arxiv.org/abs/2311.08622
repo_url: https://github.com/jha1990/VQA-Multimodal-AI
paper_authors: Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay Mahadevan
for:多个问题和多个答案（MQMA）是一种新的文本-VQA方法，用于在encoder-decoder transformer模型中进行文本理解和图像理解。methods:MQMA方法使用多个问题和内容作为输入，并在encoder和decoder中进行自动进程的推理，以同时预测多个答案。我们对标准encoder-decoder transformer模型进行了一些新的建模修改以支持MQMA。results:MQMA预训练模型在多个文本-VQA数据集上达到了当前最佳result，具体是OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%)的绝对改进。

Abstract
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.

摘要
我们提出了多问题多答案（MQMA），一种新的方法来实现编码器-解码器变换器模型中的文本-VQA任务。文本-VQA任务需要模型理解多Modal内容：文本（通常来自OCR）和相关的图像。根据我们所知，前一代的approaches都是处理单个问题和其相关的内容来预测单个答案。而我们提出的MQMA方法则是在编码器中输入多个问题和内容，并在解码器中预测多个答案，这些答案将在自动进行重复的情况下同时预测。我们对标准编码器-解码器变换器模型进行了一些新的建议，以支持MQMA。我们还提出了一个MQMA净化预训练任务，用于教导模型将多个问题和内容与相应的答案进行对齐和分割。MQMA预训练模型在多个文本-VQA数据集上达到了最佳状态，每个数据集都有强的基eline。具体来说，在OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%)中的绝对提升。

Toucan: Token-Aware Character Level Language Modeling

paper_url: http://arxiv.org/abs/2311.08620
repo_url: None
paper_authors: William Fleshman, Benjamin Van Durme
for: 这篇论文主要是为了提高Character-level语言模型的效率，使其能够更快地生成字符串。
methods: 这篇论文提出了一种基于”token-aware”的修改方法，可以帮助Character-level语言模型更好地处理长字符串。这种方法通过学习将字符串转换为token来实现，而不需要额外的tokenizer。
results: 对比于先前的工作，这种方法可以提高字符生成速度，而无需减少语言模型的表现。此外，这种方法还可以处理更长的字符串，并且可以生成更多的长字符串。code和项目可以在https://nlp.jhu.edu/nuggets/上获取。

Abstract
Character-level language models obviate the need for separately trained tokenizers, but efficiency suffers from longer sequence lengths. Learning to combine character representations into tokens has made training these models more efficient, but they still require decoding characters individually. We propose Toucan, an augmentation to character-level models to make them "token-aware". Comparing our method to prior work, we demonstrate significant speed-ups in character generation without a loss in language modeling performance. We then explore differences between our learned dynamic tokenization of character sequences with popular fixed vocabulary solutions such as Byte-Pair Encoding and WordPiece, finding our approach leads to a greater amount of longer sequences tokenized as single items. Our project and code are available at https://nlp.jhu.edu/nuggets/.

摘要
⟨SYS⟩文本翻译成简化中文。Character-level语言模型取消了分配单独的tokenizer的需要，但是序列长度变长会导致效率下降。学习将字符表示合并到 tokens中的方法可以使训练这些模型更加高效，但它们仍然需要解码字符个个。我们提出了Toucan，一种将字符级模型转化为“字符认识”的增强。与先前的工作进行比较，我们示出了不失语言模型表现的速度提升。然后，我们探讨了我们学习的动态tokenization和固定词库解决方案如Byte-PairEncoding和WordPiece的 diferencias，发现我们的方法可以处理更多的更长的序列。我们的项目和代码可以在https://nlp.jhu.edu/nuggets/找到。

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

paper_url: http://arxiv.org/abs/2311.08607
repo_url: https://github.com/spaghettisystems/emotion_whisper
paper_authors: Mohamed Osman, Tamer Nadeem, Ghada Khoriba
for: 这项研究的目的是提高人机交互的进步，通过识别口头沟通中的情感。
methods: 这项研究使用了16个不同的数据集，共计375小时的数据，包括英语、中文和日语等语言。研究采用软标注系统来捕捉情感的渐进强度。使用了Whisper编码器和启发自对比例学习的数据增强方法，注重情感的时间动态。
results: 研究在四个多语言数据集上进行验证，显示出了显著的零基eline泛化性。发布了开源模型权重和初步的良好结果，并在Hume-Prosody上进行了细化调整。

Abstract
Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system to capture gradational emotional intensities. Using the Whisper encoder and data augmentation methods inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on four multilingual datasets demonstrates notable zero-shot generalization. We publish our open source model weights and initial promising results after fine-tuning on Hume-Prosody.

摘要
感知情感在人机交互中的重要性。当前的情感检测方法经常在不同的文本库中显示偏见。为解决这个问题，我们的研究将16种不同的数据集融合起来，共计375小时的数据，涵盖英语、中文和日语等语言。我们提议一种柔化标签系统，以捕捉情感的柔化强度。使用Whisper编码器和基于对比学习的数据增强方法，我们的方法强调情感的时间动态。我们的验证在四种多语言数据集上表现出了显著的零shot泛化。我们将我们的开源模型 веса和初步成果发布在Hume-Prosody上。

2023-11-15

Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries

Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset

Labeled Interactive Topic Models

Striped Attention: Faster Ring Attention for Causal Transformers

Predicting generalization performance with correctness discriminators

Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

LEEETs-Dial: Linguistic Entrainment in End-to-End Task-oriented Dialogue systems

Neural machine translation for automated feedback on children’s early-stage writing

Banach-Tarski Embeddings and Transformers

Long-form Question Answering: An Iterative Planning-Retrieval-Generation Approach

A Survey on Online User Aggression: Content Detection and Behavioural Analysis on Social Media Platforms

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

LePaRD: A Large-Scale Dataset of Judges Citing Precedents

Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization

Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback

Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

GRIM: GRaph-based Interactive narrative visualization for gaMes

Contrastive Chain-of-Thought Prompting

TableLlama: Towards Open Large Generalist Models for Tables

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

PsyEval: A Comprehensive Large Language Model Evaluation Benchmark for Mental Health

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models

PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

SiRA: Sparse Mixture of Low Rank Adaptation

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

Grounding or Guesswork? Large Language Models are Presumptive Grounders

RRescue: Ranking LLM Responses to Enhance Reasoning Over Context

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

Social Meme-ing: Measuring Linguistic Variation in Memes

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

“We Demand Justice!”: Towards Grounding Political Text in Social Context

MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument Annotation

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Social Bias Probing: Fairness Benchmarking for Language Models

Identifying Self-Disclosures of Use, Misuse and Addiction in Community-based Social Media Posts

Do Localization Methods Actually Localize Memorized Data in LLMs?

GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models

Exploring the Potential of Large Language Models in Computational Argumentation

End-to-end Task-oriented Dialogue: A Survey of Tasks, Methods, and Future Directions

Data Similarity is Not Enough to Explain Language Model Performance

Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output

SentAlign: Accurate and Scalable Sentence Alignment

Speculative Contrastive Decoding

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models

HELLaMA: LLaMA-based Table to Text Generation by Highlighting the Important Evidence

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

CLIMB: Curriculum Learning for Infant-inspired Model Building

Enabling Large Language Models to Learn from Rules

Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder

Disinformation Capabilities of Large Language Models

StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

German FinBERT: A German Pre-trained Language Model

Accelerating Toeplitz Neural Network with Constant-time Inference Complexity

Thread of Thought Unraveling Chaotic Contexts

Enhancing Emergency Decision-making with Knowledge Graphs and Large Language Models

Uncertainty Estimation on Sequential Labeling via Uncertainty Transmission

Method for Text Entity Linking in Power Distribution Scheduling Oriented to Power Distribution Network Knowledge Graph

Token Prediction as Implicit Classification to Identify LLM-Generated Text

Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

Attribute Diversity Determines the Systematicity Gap in VQA

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

Understanding Calibration for Multilingual Question Answering Models

It Takes Two to Negotiate: Modeling Social Exchange in Online Multiplayer Games

Multistage Collaborative Knowledge Distillation from Large Language Models

Formal Proofs as Structured Explanations: Proposing Several Tasks on Explainable Natural Language Inference

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

Multiple-Question Multiple-Answer Text-VQA

Toucan: Token-Aware Character Level Language Modeling

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech