2023-11-16

cs.CL

cs.CL - 2023-11-16

Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study

paper_url: http://arxiv.org/abs/2311.10236
repo_url: https://github.com/maikezuefle/latent-feature-splits
paper_authors: Maike Züfle, Verna Dankers, Ivan Titov
for: 提高社交媒体平台上的仇恨言语检测系统的Robustness，避免模型过拟合特定目标和关键词。
methods: 使用新的训练测试分割方法，包括Subset-Sum-Split和Closest-Split，对两个数据集和四个预训练模型进行测试。
results: 研究发现，当模型面临到偏移的数据分布时，其表现很差，这表明任务难度不一定是人类可理解的。研究还发现，不同的数据分布下的模型表现差异很大，建议在模型开发和评估中使用矩阵特征基于的分割方法。

Abstract
With the ever-growing presence of social media platforms comes the increased spread of harmful content and the need for robust hate speech detection systems. Such systems easily overfit to specific targets and keywords, and evaluating them without considering distribution shifts that might occur between train and test data overestimates their benefit. We challenge hate speech models via new train-test splits of existing datasets that rely on the clustering of models' hidden representations. We present two split variants (Subset-Sum-Split and Closest-Split) that, when applied to two datasets using four pretrained models, reveal how models catastrophically fail on blind spots in the latent space. This result generalises when developing a split with one model and evaluating it on another. Our analysis suggests that there is no clear surface-level property of the data split that correlates with the decreased performance, which underscores that task difficulty is not always humanly interpretable. We recommend incorporating latent feature-based splits in model development and release two splits via the GenBench benchmark.

摘要
随着社交媒体平台的不断普及，恶意内容的散布也在不断增加，需要建立有力的恶意言语检测系统。但是现有的检测系统容易过拟合特定目标和关键词，而不充分考虑数据分布的变化可能会发生在训练和测试数据之间。我们通过新的训练测试分割方法来挑战恶意言语模型。我们提出了两种分割方法（子集和最近的分割），当应用于两个数据集上四个预训练模型时，发现模型在隐藏表示空间中的缺陷。这个结果普适地发生在开发新的分割和使用另一个模型进行评估。我们的分析表明，不存在明确的表层特征，可以用来判断任务难度。我们建议在模型开发和发布中包含隐藏特征基于的分割。我们释放了两个分割，并将其作为GenBench测试套件。

The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese

paper_url: http://arxiv.org/abs/2311.10181
repo_url: None
paper_authors: Yunke He, Xixian Liao, Jialing Liang, Gemma Boleda
for: 这研究探讨了不同说话人之间对同一个对象或实体的命名方式的差异。
methods: 研究人员使用了语言和视觉数据集，对1319个自然的图像进行了20个不同的命名。他们 investigate了对象之familiarity对命名差异的影响。
results: 研究发现， Familiarity会影响命名差异，有两种方式：一是通过扩展词汇，使命名更加多样化；二是通过推广标准名称，使命名更加统一。研究 illustrate了如何使用计算机资源来解决认知科学问题。

Abstract
Different speakers often produce different names for the same object or entity (e.g., "woman" vs. "tourist" for a female tourist). The reasons behind variation in naming are not well understood. We create a Language and Vision dataset for Mandarin Chinese that provides an average of 20 names for 1319 naturalistic images, and investigate how familiarity with a given kind of object relates to the degree of naming variation it triggers across subjects. We propose that familiarity influences naming variation in two competing ways: increasing familiarity can either expand vocabulary, leading to higher variation, or promote convergence on conventional names, thereby reducing variation. We find evidence for both factors being at play. Our study illustrates how computational resources can be used to address research questions in Cognitive Science.

摘要
不同的说话人经常生成不同的名称 для同一个物体或实体（例如，"女性旅客" vs. "旅客" для女性旅客）。名称的变化原因还不很清楚。我们创建了一个语言和视觉数据集 для普通话，提供了每个图像的平均20个名称，并研究了对象的 familiairity 如何影响名称的变化。我们提出了两种可能的影响因素：增加familiarity可能会扩展词汇，导致更高的变化，或者推动对常见名称的共识，从而减少变化。我们发现了这两种因素都在运作。我们的研究示例了如何使用计算机资源解决认知科学研究问题。

JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing

paper_url: http://arxiv.org/abs/2311.10174
repo_url: https://github.com/shesterg/jwsign-machine-translation
paper_authors: Shester Gueuwou, Sophie Siake, Colin Leong, Mathias Müller
for: 本研究的目的是提供一个大型、多语言的手语译写数据集，以促进手语译写、翻译和生成任务的进步。
methods: 本研究使用的方法包括对JWSign数据集进行神经机器翻译实验，以及为不同的语言组合实现多语言系统。
results: 实验结果显示，使用多语言系统可以超越双语基eline系统，且在较高资源enario中，Language pairs的类型相似性 clustering 可以提高翻译质量。

Abstract
Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towards sign languages from high-income countries. In this work we introduce a new large and highly multilingual dataset for sign language translation: JWSign. The dataset consists of 2,530 hours of Bible translations in 98 sign languages, featuring more than 1,500 individual signers. On this dataset, we report neural machine translation experiments. Apart from bilingual baseline systems, we also train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages. Our experiments highlight that multilingual systems are superior to bilingual baselines, and that in higher-resource scenarios, clustering language pairs that are related improves translation quality.

摘要
技术进步受到了数据不足的阻碍，妨碍了认知、翻译和生产任务的进步。全球各种手语的缺乏完整的数据集，使得这一领域的研究受到了极大的偏袋，大多数研究集中在高收入国家的手语上进行，这使得这个领域的研究非常偏向高收入国家的手语。在这项工作中，我们介绍了一个新的大型、多语言的手语翻译数据集：JWSign。该数据集包括98种手语的2,530小时的圣经翻译，共有1,500名个体手语演示者。在这个数据集上，我们报告了神经机器翻译实验。除了双语基eline系统，我们还训练了多语言系统，其中一些考虑了手语或口语语言之间的类型学关系。我们的实验表明，多语言系统比双语基eline系统更高效，而在更高资源的场景下，将相关的语言对 grouping 可以提高翻译质量。

A Computationally Efficient Sparsified Online Newton Method

paper_url: http://arxiv.org/abs/2311.10085
repo_url: None
paper_authors: Fnu Devvrit, Sai Surya Duvvuri, Rohan Anil, Vineet Gupta, Cho-Jui Hsieh, Inderjit Dhillon
for: 这篇论文的目的是提出一种可扩展的第二类方法，以提高深度神经网络训练的快速度和效率。
methods: 这篇论文使用了一种称为SONew的新方法，它是一种具有优化的条件系统，可以实现高效的深度神经网络训练。这个方法基于LogDet矩阵差分量的新用途，并且运用了简洁条件以减少遗传 regret。
results: 这篇论文的实验结果显示，SONew方法可以实现30%更快的快速度，3.4%的效能提升，并且80%的训练损失减少，相比于内存高效的优化器，包括第一类方法。此外，SONew方法可以实现高度的并行和高效率的实现，并且可以轻松地扩展到大规模的实验。

Abstract
Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew

摘要
第二顺序方法具有增强深度神经网络训练的潜在潜力，但它们的巨大内存和计算需求限制了它们的实用性。因此，有一个需求是可扩展的第二顺序方法，可以高效地训练大型模型。在这篇文章中，我们介绍了简化在线新点方法（SONew），它是一个具有优化组件的内存有效的第二顺序方法。我们使用了一个新的LogDet矩阵差异测度，并与简化条件相结合，以减少在线凸优化框架中的遗憾。我们对大规模benchmark进行实验，获得了30%的更快的渐进、3.4%的相对提高验证性能，以及80%的相对提高训练损失。相比之下，内存高效的优化器，包括首顺序方法，SONew具有更高的实用性。实际上，我们发现，对大型benchmark的应用，Shampoo方法无法扩展，并且需要较多的工程实践来扩展。相比之下，SONew方法具有更直接的实现方式，增加了它的实用性。SONew代码可以在以下链接获取：https://github.com/devvrit/SONew。

Characterizing Tradeoffs in Language Model Decoding with Informational Interpretations

paper_url: http://arxiv.org/abs/2311.10083
repo_url: None
paper_authors: Chung-Ching Chang, William W. Cohen, Yun-Hsuan Sung
for: 这 paper 是为了提出一种语言模型预测器的理论框架，用于解决预测器的设计问题。
methods: 这 paper 使用动态 программирова法和信息理论来描述语言模型预测器的算法。它将预测器的设计从 logit 空间提升到 action-state value function 空间，并将每个组件在这个空间中的解释。
results: 这 paper 显示了如何通过优化 action-state value functions，以获得更好的预测性和可解释性。这些结果可以帮助解决预测器的质量和性能问题。

Abstract
We propose a theoretical framework for formulating language model decoder algorithms with dynamic programming and information theory. With dynamic programming, we lift the design of decoder algorithms from the logit space to the action-state value function space, and show that the decoding algorithms are consequences of optimizing the action-state value functions. Each component in the action-state value function space has an information theoretical interpretation. With the lifting and interpretation, it becomes evident what the decoder algorithm is optimized for, and hence facilitating the arbitration of the tradeoffs in sensibleness, diversity, and attribution.

摘要
我们提出了一个理论框架，用于形式化语言模型decoder算法的动态计划和信息理论。通过动态计划，我们将decoder算法的设计从ilogit空间提升到动作-状态价值函数空间，并显示出decoding算法是优化动作-状态价值函数的结果。每个动作-状态价值函数空间中的组件都有信息理论的解释。通过提升和解释，就可以看出decoder算法是优化什么，因此可以促进折衔敏感、多样性和责任的衡量。

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

paper_url: http://arxiv.org/abs/2311.10081
repo_url: None
paper_authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran
for: The paper aims to improve the performance of large vision language models (LVLMs) by incorporating natural language feedback (NLF) to enhance their alignment and interactions.
methods: The paper proposes a novel categorization of NLF into two types: critique and refinement, and uses conditional reinforcement learning to train the LVLMs to incorporate feedback in multi-turn interactions.
results: The paper shows that the proposed method, called DRESS, can generate more helpful, honest, and harmless responses, and more effectively learn from feedback during multi-turn interactions compared to state-of-the-art LVLMs.

Abstract
We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. Second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these, we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF, we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to SOTA LVMLs.

摘要
我们介绍DRESS，一个大型视觉语言模型（LVLM），它创新地利用自然语言反馈（NLF）来提高对人类偏好的调整和互动。现有LVLMs通常只通过 instrucion 精化阶段来提高对人类偏好的调整，而无法采用更多的反馈来减少生成无用、幻想或危险的回答。其次，视觉指令练习数据通常是多turn对话格式，但连续对话中的连接和依赖关系较弱，这限制了LVLMs的有效多turn互动能力。为此，我们提出了一种新的NLF分类方法：批评和细化。批评NLF可以识别回答的优缺点，并用于对LVLMs进行调整，使其更加适应人类偏好。细化NLF可以提供具体的改进建议，并被用来提高LVLMs的互动能力，即LVLMs能够通过反馈进行多turn互动中的改进。由于NLF的非准确性，我们扩展了条件奖励学习的训练方法。我们的实验结果表明，DRESS可以生成更有用（9.76%）、诚实（11.52%）和无害（21.03%）的回答，并在多turn互动中更好地学习反馈。

Unambiguity and Fewness for Nonuniform Families of Polynomial-Size Nondeterministic Finite Automata

paper_url: http://arxiv.org/abs/2311.09979
repo_url: None
paper_authors: Tomoyuki Yamakami
for: 这个论文主要针对的是非征Compatibility promise decision问题的解决方案。
methods: 论文使用了非征Compatibility families of polynomial-size finite automata，这些自动机有 polynomially many inner states，来解决这些问题。
results: 论文表明，在一些特定情况下，这些非征Compatibility families of finite automata 有不同的计算能力，而且两种方法（一个方法是限制机器只能做一个方向的移动，另一个方法是限制机器的长度为 polynomially-bounded）是等价的。

Abstract
Nonuniform families of polynomial-size finite automata, which are series of indexed finite automata having polynomially many inner states, are used in the past literature to solve nonuniform families of promise decision problems. Among such nonuniform families of finite automata, we focus our attention, in particular, on the variants of nondeterministic finite automata, which have at most "one" (unambiguous), "polynomially many" (few) accepting computation paths, or unambiguous/few computation paths leading to each fixed configuration. When such machines are limited to make only one-way head moves, we can prove with no unproven hardness assumptions that some of these variants are different in computational power from each other. As for two-way machines restricted to instances of polynomially-bounded length, families of two-way polynomial-size nondeterministic finite automata are equivalent in power to families of polynomial-size unambiguous finite automata.

摘要
非均匀家族的多项式大小自动机，这是在过去文献中用于解决非均匀家族的Promise决策问题的工具。我们在这些非均匀自动机家族中特别关注尝试机器，它们在最多只能有一个（不ambiguous）， polynomially many（少）的接受计算路径，或者每个固定配置都有一个或 polynomially many 的计算路径。当这些机器只能做一个一向头移时，我们可以证明不带任何难度假设的情况下，这些变体之间存在不同的计算能力。而两个方向的机器，限制到 polynomially-bounded 长度的实例时， families of two-way polynomial-size nondeterministic finite automata 和 families of polynomial-size unambiguous finite automata 之间存在相同的计算能力。

Hijacking Large Language Models via Adversarial In-Context Learning

paper_url: http://arxiv.org/abs/2311.09948
repo_url: None
paper_authors: Yao Qiang, Xiangyu Zhou, Dongxiao Zhu
for: 这个论文是为了研究受限下的语言模型（LLM）在特定任务上的性能，并使用示例作为示范来启用LLM。
methods: 这篇论文使用了梯度基于的搜索方法来学习并贴加不可见的反对示例，以控制LLM生成的响应。
results: 实验结果表明，这种攻击可以让LLM生成targeted的不良输出，通过吸引LLM的注意力向 adversarial tokens。

Abstract
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMs to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.

摘要
内容学习（ICL）已经 emerged as a powerful paradigm，利用 LLMS для特定任务，通过使用标签的示例作为条件答案。 despite its promising performance， ICL 受到示例选择和排序的不稳定性问题。 In addition, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMS to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.

paper_url: http://arxiv.org/abs/2311.09945
repo_url: https://github.com/once2gain/personalitydetection
paper_authors: Qirui Tang, Wenkang Jiang, Yihua Du, Lei Lin
for: 这篇论文旨在提高社交媒体文本中人格检测的准确率，并提供一种基于注意力的信息提取机制和人格检测框架。
methods: 论文提出了一种基于注意力的信息提取机制（AIEM），以及一种基于注意力的干扰除框架（ADF），以提高人格检测的准确率。
results: 论文在两个常用的数据集上达到了当前领域的最佳性能水平，具体来说是在Twitter-Myers-Briggs Type Indicator（Twitter-MBTI）数据集上的平均准确率提高10.2%。

Abstract
In social media networks, users produce a large amount of text content anytime, providing researchers with a valuable approach to digging for personality-related information. Personality detection based on user-generated texts is a universal method that can be used to build user portraits. The presence of noise in social media texts hinders personality detection. However, previous studies have not fully addressed this challenge. Inspired by the scanning reading technique, we propose an attention-based information extraction mechanism (AIEM) for long texts, which is applied to quickly locate valuable pieces of information, and focus more attention on the deep semantics of key pieces. Then, we provide a novel attention-based denoising framework (ADF) for personality detection tasks and achieve state-of-the-art performance on two commonly used datasets. Notably, we obtain an average accuracy improvement of 10.2% on the gold standard Twitter-Myers-Briggs Type Indicator (Twitter-MBTI) dataset. We made our code publicly available on GitHub. We shed light on how AIEM works to magnify personality-related signals.

摘要
在社交媒体网络中，用户生成大量文本内容，为研究人员提供了一个价值的检测人格信息的方法。基于用户生成的文本进行人格检测是一种通用的方法，可以 construir 用户肖像。但是，社交媒体文本中的噪声会妨碍人格检测。然而，先前的研究并没有彻底解决这个挑战。我们提出了基于扫描阅读技术的注意力基本信息提取机制（AIEM），用于快速找到有价值信息，并更多地关注深层 semantics 的关键 Piece。然后，我们提出了一种基于注意力的噪声降减框架（ADF），用于人格检测任务，并在两个常用的数据集上实现了状态的表现。特别是，我们在 Twitter-Myers-Briggs Type Indicator（Twitter-MBTI）数据集上实现了10.2%的平均准确率提升。我们在 GitHub 上公开了我们的代码。我们探讨了 AIEM 如何强调人格相关的信号。

Language Generation from Human Brain Activities

paper_url: http://arxiv.org/abs/2311.09889
repo_url: None
paper_authors: Ziyi Ye, Qingyao Ai, Yiqun Liu, Min Zhang, Christina Lioma, Tuukka Ruotsalo
for: 这个研究旨在开发一个基于脑computer interfaces（BCIs）的语言生成系统，以便无需侵入性地传入语言。
methods: 这个研究使用了大型语言模型（LLM）和semantic brain decoder来直接从functional magnetic resonance imaging（fMRI）输入中生成语言。
results: 研究发现，这个模型可以对于视觉或听觉语言刺激而生成协调的语言序列，并且与脑Input的内容有着Semantic相似性。相比之下，随机控制和预先生成的语言选择方法，以及标准的LLM，它们只能生成基于语言训练数据的通用单词次序。

Abstract
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical semantic representation. Inspired by recent research that revealed associations between the brain and the large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.

摘要
使用非侵入式脑计算机接口（BCI）生成人类语言有很多应用前途，如服务残疾患者和改善沟通。然而，目前通过BCI生成语言仅限于在分类设置中选择预生成的句子续写候选者中的最有可能性的 cortical semantic representation。 inspirited by recent research that revealed associations between the brain and large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Which Modality should I use – Text, Motif, or Image? : Understanding Graphs with Large Language Models

paper_url: http://arxiv.org/abs/2311.09862
repo_url: None
paper_authors: Debarati Das, Ishaan Gupta, Jaideep Srivastava, Dongyeop Kang
for: 本研究旨在探讨如何更好地融合图数据与大语言模型（LLMs），以提高LLMs在复杂图结构分析中的效iveness。
methods: 本研究使用了不同的编码方式（如文本、图像和模式）和不同的推荐方法来优化LLMs在处理复杂图结构时的表现。
results: 研究发现，图像模式，尤其是通过高级见语言模型like GPT-4V支持，比文本更有效地管理токен限制而保留重要信息。研究还探讨了不同因素对每种编码模式表现的影响。

Abstract
Large language models (LLMs) are revolutionizing various fields by leveraging large text corpora for context-aware intelligence. Due to the context size, however, encoding an entire graph with LLMs is fundamentally limited. This paper explores how to better integrate graph data with LLMs and presents a novel approach using various encoding modalities (e.g., text, image, and motif) and approximation of global connectivity of a graph using different prompting methods to enhance LLMs' effectiveness in handling complex graph structures. The study also introduces GraphTMI, a new benchmark for evaluating LLMs in graph structure analysis, focusing on factors such as homophily, motif presence, and graph difficulty. Key findings reveal that image modality, supported by advanced vision-language models like GPT-4V, is more effective than text in managing token limits while retaining critical information. The research also examines the influence of different factors on each encoding modality's performance. This study highlights the current limitations and charts future directions for LLMs in graph understanding and reasoning tasks.

摘要
Translation notes:* "Large language models" is translated as "大型语言模型" (dàxíng yǔyán módelǐ)* "revolutionizing" is translated as "革命化" (gémònghuà)* "various fields" is translated as "多个领域" (duō gè lǐngyù)* "leveraging large text corpora" is translated as "利用大量文本资料" (lìyòng dàliàng wén tiě xīn yǎng)* "context-aware intelligence" is translated as "Context-aware intelligence" (上下文意识)* "encoding an entire graph" is translated as "完整的图形编码" (quánzhì de túxíng biān mǎ)* "fundamentally limited" is translated as "基本上有限" (jībǎo shang yǒu xiàn)* "novel approach" is translated as "新的方法" (xīn de fāngfǎ)* "using various encoding modalities" is translated as "使用多种编码方式" (shǐyòu duōshì biān mǎ fāngshì)* "approximation of global connectivity" is translated as "全球连接的估计" (quánqiú liánjiē de gèjì)* "different prompting methods" is translated as "不同的提示方法" (bùdōng de tímí fāngfǎ)* "enhance LLMs' effectiveness" is translated as "增强LLMs的效果" (zēngcháng LLMs de xiànggòu)* "in handling complex graph structures" is translated as "处理复杂的图形结构" (chùzhì fùzì de túxíng jiégòu)* "Key findings reveal" is translated as "主要发现是" (zhǔyào fāxìn shì)* "image modality" is translated as "图像模式" (túxíang móshì)* "supported by advanced vision-language models" is translated as "由高级视语言模型支持" (yǐ gāojí wèi yǔ yǔ móshì)* "more effective than text" is translated as "比文本更有效" (bǐ wén tiěn jí yòu yì)* "managing token limits" is translated as "管理 токен限制" (guǎn lǐ tóu kē yùn xiàn)* "while retaining critical information" is translated as "保留关键信息" (bǎo liú guān jí xìn xīn)* "The research also examines the influence of different factors" is translated as "研究也研究了不同因素的影响" (yánjiū yě yánjiū le bùdàng yīn xiǎng de yìngxìn)* "on each encoding modality's performance" is translated as "每种编码方式的性能" (měi zhǒng biān mǎ fāngshì de xìngnéng)* "This study highlights the current limitations" is translated as "这项研究透视当前的限制" (zhè yè yánjiū tòu shì dāng zhì de jiàn zhì)* "and charts future directions" is translated as "并规划未来的发展" (dàn zhì yú yì yè yì)

GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

paper_url: http://arxiv.org/abs/2311.09860
repo_url: None
paper_authors: Wolfgang Otto, Matthäus Zloch, Lu Gan, Saurav Karmakar, Stefan Dietze
for: 本研究旨在提供精细化的机器学习模型和数据集的实体识别模型，以便更好地理解它们在学术论文中的提及。
methods: 本研究使用了一种基于BERT的首个基线模型，以及一个手动注解的全文科学论文集。
results: 本研究发现了10种不同的实体类型，包括机器学习模型和数据集，并提供了一个手动注解的全文科学论文集，以便进一步研究和应用。

Abstract
Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.

摘要
Named Entity Recognition (NER) 模型在各种自然语言处理（NLP）任务中扮演着关键的角色，包括信息抽取（IE）和文本理解。在学术写作中，关于机器学习模型和数据集的引用是学术出版物的重要组成部分，需要准确的模型来识别。despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like machine learning model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around machine learning models and datasets. In order to provide a nuanced understanding of how machine learning models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at .Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection

paper_url: http://arxiv.org/abs/2311.09834
repo_url: None
paper_authors: Sarah Masud, Mohammad Aflah Khan, Md. Shad Akhtar, Tanmoy Chakraborty
for: 本研究旨在开发计算方法来减少网络上的仇恨言论。
methods: 本研究使用黑盒模型来识别仇恨内容，并提供了一种可能的重写建议。
results: 研究发现，使用这种方法可以减少Explicit span detection in English Tweets，最高macro-F1达0.58。

Abstract
As hate speech continues to proliferate on the web, it is becoming increasingly important to develop computational methods to mitigate it. Reactively, using black-box models to identify hateful content can perplex users as to why their posts were automatically flagged as hateful. On the other hand, proactive mitigation can be achieved by suggesting rephrasing before a post is made public. However, both mitigation techniques require information about which part of a post contains the hateful aspect, i.e., what spans within a text are responsible for conveying hate. Better detection of such spans can significantly reduce explicitly hateful content on the web. To further contribute to this research area, we organized HateNorm at HASOC-FIRE 2023, focusing on explicit span detection in English Tweets. A total of 12 teams participated in the competition, with the highest macro-F1 observed at 0.58.

摘要
随着仇恨言论在网络上的迅速增加，计算方法的开发已成为一项非常重要的任务。 Reactively，使用黑盒模型标识仇恨内容可能会让用户感到困惑，因为他们不知道哪些内容被自动标识为仇恨。然而，投入型mitigation可以通过建议重写内容之前提交，以避免内容被公布。然而，这两种mitigation技术都需要知道哪些文本内容包含仇恨元素，即哪些文本段落在传递仇恨信息方面表现出色。更好地检测这些文本段落可以有效减少网络上直接的仇恨内容。为了进一步贡献到这个研究领域，我们在HASOC-FIRE 2023年组织了HateNorm比赛，专注于英语推文中的直接检测。总共12个团队参与了比赛，最高的macro-F1为0.58。

X-Mark: Towards Lossless Watermarking Through Lexical Redundancy

paper_url: http://arxiv.org/abs/2311.09832
repo_url: None
paper_authors: Liang Chen, Yatao Bian, Yang Deng, Shuaiyi Li, Bingzhe Wu, Peilin Zhao, Kam-fai Wong
For: This paper focuses on the issue of text watermarking, which is important for detecting machine-generated text.* Methods: The authors introduce a novel approach called XMark, which leverages text redundancy within the lexical space to improve text generation fluency while maintaining watermark detectability.* Results: The authors present theoretical analyses and empirical evidence showing that XMark outperforms existing methods in retaining the emergent abilities of large language models, including zero-shot and few-shot knowledge recall, logical reasoning, and instruction following.Here’s the same information in Simplified Chinese text:* For: 这篇论文关注了文本沟通技术，它在机器生成文本检测方面具有重要意义。* Methods: 作者们提出了一种新的方法，即XMark，它利用文本空间内的同义词来提高文本生成流畅性，同时保持水印检测的能力。* Results: 作者们提供了理论分析和实验证据，表明XMark比现有方法更能保持大语言模型的emergent能力，包括零批知识回忆、几批知识回忆、逻辑推理和指令遵循。

Abstract
Text watermarking has emerged as an important technique for detecting machine-generated text. However, existing methods can severely degrade text quality due to arbitrary vocabulary partitioning, which disrupts the language model's expressiveness and impedes textual coherence. To mitigate this, we introduce XMark, a novel approach that capitalizes on text redundancy within the lexical space. Specifically, XMark incorporates a mutually exclusive rule for synonyms during the language model decoding process, thereby integrating prior knowledge into vocabulary partitioning and preserving the capabilities of language generation. We present theoretical analyses and empirical evidence demonstrating that XMark substantially enhances text generation fluency while maintaining watermark detectability. Furthermore, we investigate watermarking's impact on the emergent abilities of large language models, including zero-shot and few-shot knowledge recall, logical reasoning, and instruction following. Our comprehensive experiments confirm that XMark consistently outperforms existing methods in retaining these crucial capabilities of LLMs.

摘要

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

paper_url: http://arxiv.org/abs/2311.09829
repo_url: None
paper_authors: Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng Wang, Deyi Xiong
for: 评估大语言模型（LLM）的指令遵循能力是非常重要的。一个不能遵循人类指令的模型可能无法提供可靠和有用的回答。
methods: 我们在这篇论文中引入了FollowEvalBenchmark，一个包含英文和中文测试例的多语言指令遵循测试 benchmark。所有测试例都是由人类专家手动制作的。
results: 我们通过使用FollowEvalBenchmark测试多个LLM模型，发现它们的性能远远落后于人类。这 highlights 大语言模型的指令遵循能力仍然具有很大的提升空间。

Abstract
The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. To enhance the complexity and present a sufficient challenge, each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans. This highlights the considerable room for improvement in the instruction-following ability of these models.

摘要
检测大语言模型（LLM）的 instrucion-following 能力的有效性非常重要。如果一个模型无法遵循人类的 instrucion，那么它可能无法提供可靠和有用的回答。为了实现这个目标，各种标准套件已经建立来评估这些模型的 instrucion-following 能力。然而，这些标准套件受限于单一语言，并且使用自动化的方法构建，这限制了它们的可应用性和测试例子的质量。为了bridging这个差距，我们在这篇论文中引入 FollowEval 套件。这个套件包含英文和中文两种语言的实例，并且所有的测试例子都是由人类专家手动制作的。此外，FollowEval 套件采用了五个关键的 instrucion-following 维度来评估 LLM：字符串处理、常识理解、逻辑理解、空间理解和回答约束。为了增加复杂性和提供足够的挑战，每个测试例子都会评估多个维度。我们使用 FollowEval 套件测试了多种 LLM，发现它们在人类的 instrucion-following 能力方面表现明显落后。这说明这些模型在 instrucion-following 能力方面还有很大的进步空间。

AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages

paper_url: http://arxiv.org/abs/2311.09828
repo_url: https://github.com/Unbabel/COMET
paper_authors: Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Marek Masiak, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Hamam Mokayede, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane, Salma El Anigri, Lolwethu Ndolela, Thabiso Mangwana, Shafie Abdi Mohamed, Ayinde Hassan, Oluwabusayo Olufunke Awoyomi, Lama Alkhaled, Sana Al-Azzawi, Naome A. Etori, Millicent Ochieng, Clemencia Siro, Samuel Njoroge, Eric Muchiri, Wangari Kimotho, Lyse Naomi Wamba Momo, Daud Abolade, Simbiat Ajao, Tosin Adewumi, Iyanuoluwa Shode, Ricky Macharm, Ruqayya Nasir Iro, Saheed S. Abdullahi, Stephen E. Moore, Bernard Opoku, Zainab Akinjobi, Abeeb Afolabi, Nnaemeka Obiefuna, Onyekachi Raphael Ogbu, Sam Brian, Verrah Akinyi Otiende, Chinedu Emmanuel Mbonu, Sakayo Toadoum Sari, Pontus Stenetorp
for: 这个论文的目的是为了提高非洲语言机器翻译的评估方法，以便更好地评估这些语言的翻译质量。
methods: 这篇论文使用了人工评分的方法来创建高质量的评估数据，并开发了一种基于DA训练数据的COMET评估指标，以提高非洲语言机器翻译的评估精度。
results: 这篇论文的结果表明，使用这种新的评估方法可以提高非洲语言机器翻译的评估精度，并且与人类评分有高度相关性（Spearman-rank correlation +0.406）。

Abstract
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406).

摘要
尽管我们在扩展多语言机器翻译（MT）模型和评估数据到数个非常贫语言方面做出了进步，但是很难准确度量我们在这些语言上做出的进步，因为评估通常基于n-gram匹配度量如BLEU，这些度量与人类评估的相关性较差。基于嵌入度量如COMET可以更好地与人类评估相关，但是对于非常贫语言来说，评估数据的缺乏、评估指南的复杂性（如多维质量度量（MQM）），以及多语言encoder的语言覆盖率带来了障碍。在这篇论文中，我们解决了这些挑战，通过创建高质量的人类评估数据，并采用简化MQM指南进行错误扩 span的注释和直接评估（DA）分数的计算，对13种 typologically 多样化的非洲语言进行评估。此外，我们还开发了AfriCOMET评估度量，通过利用高资源语言的DA训练数据和非洲中心的多语言encoder（AfroXLM-Roberta）创建了非洲语言MT中的 estado-of-the-art 评估度量，与人类评估相关性为+0.406（Spearman相关度）。

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

paper_url: http://arxiv.org/abs/2311.09827
repo_url: None
paper_authors: Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen
for: 本研究旨在探讨LLMs中的认知结构和过程如何受到攻击，以及如何防范这些攻击。
methods: 本研究使用了新的类型的监狱攻击， specifically targeting LLMs的认知结构和过程。 experiments conducted on AdvBench and MasterKey demonstrate that various LLMs can be compromised through cognitive overload.
results: 研究发现，通过三种不同的认知负担方式，可以成功地监狱所有研究的LLMs，而现有的防御策略很难有效地防止这些恶意用途。

Abstract
While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.

摘要
大型语言模型（LLM）已经显示出了增加的力量，但也导致了各种危害行为的出现。作为代表，跳狱攻击可以让 LLM 发生危害或不道德的反应，即使经过安全Alignment。在这篇论文中，我们 investigate 一种新的跳狱攻击，这种攻击targets LLM的认知结构和过程。 Specifically, we analyze the safety vulnerability of LLMs in the face of （1）多语言认知过载、（2）掩饰表达和（3）效果归因。与之前的跳狱攻击不同，我们提出的认知过载是一种黑盒攻击，没有需要对模型结构或模型参数的知识。在 AdvBench 和 MasterKey 上进行的实验表明，包括流行的开源模型 Llama 2 和商业模型 ChatGPT 等各种 LLMS 都可以通过认知过载遭受攻击。受到认知心理学的管理认知过载的启示，我们进一步调查了防御认知过载攻击的两种方面。实验表明，我们从三个角度来进行认知过载可以成功地跳狱所有研究过的 LLMS，而现有的防御策略很难有效地抑制这些危害用途。

Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks

paper_url: http://arxiv.org/abs/2311.09825
repo_url: None
paper_authors: Yuxuan Lu, Bingsheng Yao, Shao Zhang, Yun Wang, Peng Zhang, Tun Lu, Toby Jia-Jun Li, Dakuo Wang
for: 本研究证明了小型模型在域专知任务中的优势，并探讨了LLMs是否能在域专知任务中超越小型模型。
methods: 本研究使用了活动学习（AL）方法，并对四个数据集进行了实验比较。
results: 研究发现，即使使用了一些百度的标注数据，小型模型仍可以超过GPT-3.5的性能，而且与GPT-4相比，它们的性能相对较高。这些结论表明，LLMs的预测可以作为域专知应用中的启动方法，而人类专家仍然是数据标注驱动的域专知任务中不可或缺的。

Abstract
Large Language Models (LLMs) have demonstrated considerable advances, and several claims have been made about their exceeding human performance. However, in real-world tasks, domain knowledge is often required. Low-resource learning methods like Active Learning (AL) have been proposed to tackle the cost of domain expert annotation, raising this question: Can LLMs surpass compact models trained with expert annotations in domain-specific tasks? In this work, we conduct an empirical experiment on four datasets from three different domains comparing SOTA LLMs with small models trained on expert annotations with AL. We found that small models can outperform GPT-3.5 with a few hundreds of labeled data, and they achieve higher or similar performance with GPT-4 despite that they are hundreds time smaller. Based on these findings, we posit that LLM predictions can be used as a warmup method in real-world applications and human experts remain indispensable in tasks involving data annotation driven by domain-specific knowledge.

摘要

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

paper_url: http://arxiv.org/abs/2311.09821
repo_url: None
paper_authors: Qingyu Tan, Hwee Tou Ng, Lidong Bing
for: 提高大语言模型（LLMs）的时间知识理解能力，尤其是多个答案和多个跳跃类时间理解。
methods: 提出了一个复杂时间问答（QA）数据集 Complex-TR，以及一种新的数据增强策略，以提高 LLMS 的复杂时间理解能力和可靠性。
results: 对多个时间问答数据集进行了实验，研究结果表明，我们的方法能够提高 LLMS 的时间问答指标，比基eline方法提高了显著的多。

Abstract
Knowledge in the real world is being updated constantly. However, it is costly to frequently update large language models (LLMs). Therefore, it is crucial for LLMs to understand the concept of temporal knowledge. However, prior works on temporal question answering did not emphasize multi-answer and multi-hop types of temporal reasoning. In this paper, we propose a complex temporal question-answering (QA) dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning. Besides, we also propose a novel data augmentation strategy to improve the complex temporal reasoning capability and robustness of LLMs. We conducted experiments on multiple temporal QA datasets. Experimental results show that our method is able to improve LLMs' performance on temporal QA benchmarks by significant margins.

摘要
现实世界中的知识是不断更新的。然而，更新大型自然语言模型（LLM）的成本很高。因此，LLM需要理解时间知识的概念。但是，先前的时间问答工作没有强调多个答案和多个跳步类时间推理。在这篇论文中，我们提出了复杂时间问答（QA）数据集Complex-TR，它专注于多个答案和多个跳步时间推理。此外，我们还提出了一种新的数据增强策略，用于提高LLM的复杂时间推理能力和鲁棒性。我们在多个时间问答数据集上进行了实验，实验结果显示，我们的方法可以在时间问答标准准则上提高LLM的表现。

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

paper_url: http://arxiv.org/abs/2311.09818
repo_url: https://github.com/stanford-oval/suql
paper_authors: Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina J. Semnani, Chen Jie Yu, Gui Dávid, Monica S. Lam
for: 这篇论文旨在构建基于多种数据源的对话式查询系统，以便更好地处理结构化和无结构化数据的混合查询。
methods: 这篇论文提出了SUQL（结构化和无结构化查询语言），它是一种可执行的正式表示语言，可以自然地涵盖结构化和无结构化数据查询的复杂作业。
results: 在实验中，使用SUQL和大语言模型实现的对话式搜索代理可以在51.3%的问题中找到满足用户需求的实体，比常用的基eline提高89.3%。

Abstract
Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.

摘要
Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.Here's the translation in Traditional Chinese:Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.

Large Language Models for Propaganda Span Annotation

paper_url: http://arxiv.org/abs/2311.09812
repo_url: None
paper_authors: Maram Hasanain, Fatema Ahmed, Firoj Alam
for: 本研究的目的是检测在线通信中的宣传技巧，以及使用大语言模型GPT-4来完成 annotator 的任务。
methods: 本研究使用了一个自己开发的 Dataset，包括多个笔记者的注释。我们的结果表明，为模型提供更多的信息作为提示，可以提高注释一致性和性能。
results: 我们的结果表明，提供更多的信息作为提示可以提高注释一致性和性能。我们计划将多个笔记者的注释，包括GPT-4的注释，向社区提供。

Abstract
The use of propagandistic techniques in online communication has increased in recent years, aiming to manipulate online audiences. Efforts to automatically detect and debunk such content have been made, addressing various modeling scenarios. These include determining whether the content (text, image, or multimodal) (i) is propagandistic, (ii) employs one or more techniques, and (iii) includes techniques with identifiable spans. Significant research efforts have been devoted to the first two scenarios compared to the latter. Therefore, in this study, we focus on the task of detecting propagandistic textual spans. We investigate whether large language models such as GPT-4 can be utilized to perform the task of an annotator. For the experiments, we used an in-house developed dataset consisting of annotations from multiple annotators. Our results suggest that providing more information to the model as prompts improves the annotation agreement and performance compared to human annotations. We plan to make the annotated labels from multiple annotators, including GPT-4, available for the community.

摘要
在latest years, the use of propaganda techniques in online communication has increased, aiming to manipulate online audiences. Efforts to automatically detect and debunk such content have been made, addressing various modeling scenarios. These include determining whether the content (text, image, or multimodal) (i) is propagandistic, (ii) employs one or more techniques, and (iii) includes techniques with identifiable spans. Significant research efforts have been devoted to the first two scenarios compared to the latter. Therefore, in this study, we focus on the task of detecting propagandistic textual spans. We investigate whether large language models such as GPT-4 can be utilized to perform the task of an annotator. For the experiments, we used an in-house developed dataset consisting of annotations from multiple annotators. Our results suggest that providing more information to the model as prompts improves the annotation agreement and performance compared to human annotations. We plan to make the annotated labels from multiple annotators, including GPT-4, available for the community.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

PixT3: Pixel-based Table To Text generation

paper_url: http://arxiv.org/abs/2311.09808
repo_url: None
paper_authors: Iñigo Alonso, Eneko Agirre, Mirella Lapata
for: 这篇论文旨在提出一种多模态表格转文本模型，以提高表格转文本任务的效率和准确性。
methods: 该模型使用图像表示方法，而不是传统的文本化方法，以提高空间效率。它还使用了一种新的中间训练课程，以增强表格结构的认识。
results: 模型在ToTTo测试套件中的纯表格转文本设置中超过了状态态的表现，并在控制的表格转文本设置中保持竞争力。它还在未看过的数据集中表现出色，在所有生成设置中超越了ToTTo状态态。

Abstract
Table-to-Text has been traditionally approached as a linear language to text problem. However, visually represented tables are rich in visual information and serve as a concise, effective form of representing data and its relationships. When using text-based approaches, after the linearization process, this information is either lost or represented in a space inefficient manner. This inefficiency has remained a constant challenge for text-based approaches making them struggle with large tables. In this paper, we demonstrate that image representation of tables are more space-efficient than the typical textual linearizations, and multi-modal approaches are competitive in Table-to-Text tasks. We present PixT3, a multimodal table-to-text model that outperforms the state-of-the-art (SotA) in the ToTTo benchmark in a pure Table-to-Text setting while remaining competitive in controlled Table-to-Text scenarios. It also generalizes better in unseen datasets, outperforming ToTTo SotA in all generation settings. Additionally, we introduce a new intermediate training curriculum to reinforce table structural awareness, leading to improved generation and overall faithfulness of the models.

摘要
传统上，Table-to-Text问题被看作是一个线性语言到文本问题。然而，可见的表格具有丰富的视觉信息，并且作为数据和其关系的简洁、有效的表示形式。在文本基于的方法中， послеLinearization过程，这些信息将 Either lost or represented in an inefficient manner.这种不足在文本基于的方法中一直是一大挑战，使得它们在处理大表格时困难。在这篇论文中，我们证明了图像表示的表格更加空间效率，而且多模态方法在Table-to-Text任务中竞争。我们提出了PixT3，一种多模态表格到文本模型，在ToTTo Benchmark中超越了状态的艺术（SotA），在纯文本基于的Setting中具有比较竞争力，并在Controlled Table-to-Text Setting中具有更好的整体 faithfulness。此外，我们还引入了一种新的中间培训课程，以强化表格结构的认识，导致模型的生成和整体 faithfulness得到改进。

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

paper_url: http://arxiv.org/abs/2311.09807
repo_url: None
paper_authors: Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chloé Clavel
for: 这项研究探讨了将大型自然语言模型（LLM）训练于前代模型生成的 sintetic 数据上的后果，这已成为增加人类训练数据的方法的增长趋势。
methods: 我们开发了一组新的评价指标，旨在衡量模型在不同自然语言生成任务上的语言多样性，特别是在时间 recursive 练习中进行的。
results: 我们的发现表明，在successive 迭代中，模型的输出多样性明显减少，这表明了在这种训练方法下，LLMs 的语言能力可能受到限制。

Abstract
This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive fine-tuning experiments across various natural language generation tasks. Our findings reveal a marked decrease in the diversity of the models' outputs through successive iterations. This trend underscores the potential risks of training LLMs on predecessor-generated text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.

摘要

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

paper_url: http://arxiv.org/abs/2311.09805
repo_url: None
paper_authors: Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan
for: This paper aims to evaluate the numerical reasoning and problem-solving capabilities of large language models (LLMs) in the context of understanding and analyzing financial documents.
methods: The paper introduces DocMath-Eval, a comprehensive benchmark that incorporates different prompting strategies to assess the capabilities and limitations of existing LLMs in understanding financial documents.
results: The current best-performing system (GPT-4) can perform well on simple problems, but significantly lags behind human experts in more complex problems grounded in longer contexts. The paper concludes that DocMath-Eval can be used as a valuable benchmark to evaluate LLMs’ capabilities to solve challenging numerical reasoning problems in expert domains.

Abstract
Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 19 LLMs, including those specialized in coding and finance. We also incorporate different prompting strategies (i.e., Chain-of-Thoughts and Program-of-Thoughts) to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that, although the current best-performing system (i.e., GPT-4), can perform well on simple problems such as calculating the rate of increase in a financial metric within a short document context, it significantly lags behind human experts in more complex problems grounded in longer contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs' capabilities to solve challenging numerical reasoning problems in expert domains. We will release the benchmark and code at https://github.com/yale-nlp/DocMath-Eval.

摘要
现代LLM技术在解决类似于考试的数学问题方面已经表现出了惊人的表现。然而，这些数学解决问题在实际场景中的有效性，特别是在专家领域，仍然未经充分调查。这篇论文介绍了DocMath-Eval，一个特有的数学问题解决和分析金融文档中的文本和表格的完整性评价标准。我们评估了19种不同的LLM系统，包括编程和金融领域的专门系统。我们还采用了不同的提示策略（即链条思维和程序思维）来全面评估现有LLM的能力和局限性。我们发现，即使最佳表现的系统（即GPT-4）在短文档上解决金融指标增长率的简单问题上表现出色，但是在更复杂的问题上，它在更长的文档上缺乏人类专家的能力。我们认为DocMath-Eval可以用于评估LLM在专家领域中解决复杂的数学问题的能力。我们将在https://github.com/yale-nlp/DocMath-Eval上发布标准和代码。

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

paper_url: http://arxiv.org/abs/2311.09800
repo_url: None
paper_authors: Evgeniia Razumovskaia, Ivan Vulić, Pavle Marković, Tomasz Cichy, Qian Zheng, Tsung-Hsien Wen, Paweł Budzianowski
for: 提高信息寻找对话系统的准确性和可靠性，使其响应用户的询问能够提供有用和适合知识源的回答。
methods: 使用行为调整来改善信息寻找对话系统的准确性和可靠性，以避免现象抽象和假象。
results: 对三个标准数据集和多个领域进行了调整，并在零容量情况下在不同领域中表现出色，而且在实际生产对话中表现更好，超过GPT4。

Abstract
Factuality is a crucial requirement in information seeking dialogue: the system should respond to the user's queries so that the responses are meaningful and aligned with the knowledge provided to the system. However, most modern large language models suffer from hallucinations, that is, they generate responses not supported by or contradicting the knowledge source. To mitigate the issue and increase faithfulness of information-seeking dialogue systems, we introduce BeInfo, a simple yet effective method that applies behavioural tuning to aid information-seeking dialogue. Relying on three standard datasets, we show that models tuned with BeInfo} become considerably more faithful to the knowledge source both for datasets and domains seen during BeInfo-tuning, as well as on unseen domains, when applied in a zero-shot manner. In addition, we show that the models with 3B parameters (e.g., Flan-T5) tuned with BeInfo demonstrate strong performance on data from real `production' conversations and outperform GPT4 when tuned on a limited amount of such realistic in-domain dialogues.

摘要
factuality是寻求信息对话中的重要需求：系统应该根据用户的询问回答，以便响应是有意义的并与系统提供的知识一致。然而，现代大语言模型很容易出现幻觉，即生成不支持或与知识源相 contradicting 的回答。为了解决这个问题并提高信息寻求对话系统的忠诚度，我们介绍了BeInfo，一种简单 yet effective的方法，通过行为调整来帮助信息寻求对话。我们使用三个标准 dataset 来显示，通过BeInfo-调整，模型在seen 和 unseen 领域都变得更加忠诚于知识源。此外，我们还显示，具有3B参数的模型（如Flan-T5），通过BeInfo 的调整，在真实的生产对话数据上表现出色，并在不seen 领域中具有 Zero-shot 的优势。

How Far Can We Extract Diverse Perspectives from Large Language Models? Criteria-Based Diversity Prompting!

paper_url: http://arxiv.org/abs/2311.09799
repo_url: https://github.com/minnesotanlp/diversity-extraction-from-llms
paper_authors: Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, Dongyeop Kang
for: 这个研究旨在检验LLMs是否可以生成多元观点和理由，以及可以不同程度检验LLMs的多元观点生成能力。
methods: 研究使用了一种基于标准的提问技术来评估LLMs的多元观点生成能力，并使用了句子嵌入和距离度量来衡量 semantics 多元性。
results: 研究发现，通过使用提问技术，可以很好地评估LLMs的多元观点生成能力，并且可以在不同任务上（如荷尔豪害语言标注和故事续写）检验LLMs的多元观点生成能力。

Abstract
Collecting diverse human data on subjective NLP topics is costly and challenging. As Large Language Models (LLMs) have developed human-like capabilities, there is a recent trend in collaborative efforts between humans and LLMs for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question. In this study, we investigate LLMs' capacity for generating diverse perspectives and rationales on subjective topics, such as social norms and argumentative texts. We formulate this problem as diversity extraction in LLMs and propose a criteria-based prompting technique to ground diverse opinions and measure perspective diversity from the generated criteria words. Our results show that measuring semantic diversity through sentence embeddings and distance metrics is not enough to measure perspective diversity. To see how far we can extract diverse perspectives from LLMs, or called diversity coverage, we employ a step-by-step recall prompting for generating more outputs from the model in an iterative manner. As we apply our prompting method to other tasks (hate speech labeling and story continuation), indeed we find that LLMs are able to generate diverse opinions according to the degree of task subjectivity.

摘要
COLLECTING多样的人类数据 на主观NLP话题是成本高昂和挑战性强的。随着大语言模型（LLMs）的发展，有一种现代趋势是通过人类和LLMs的合作来生成多样数据，提供可扩展和高效的解决方案。然而，LLMs是否具备生成主观话题多样视角的能力仍是一个未知问题。在这种研究中，我们调查LLMs是否能够生成主观话题多样视角和理由，如社会规范和论战文本。我们将这个问题定义为LLMs中的多样性提取问题，并提出了基于标准的提示技术来锁定多样的意见和度量视角多样性从生成的标准词语中。我们的结果表明，通过句子嵌入和距离度量来度量semantic多样性并不够来度量视角多样性。为了测试LLMs是否能够提取多样视角，我们采用了一种步骤性的回忆提示法，通过多次生成输出来评估模型的多样性覆盖率。在我们应用提示方法于其他任务（仇恨言语标注和故事续写）时，实际上我们发现LLMs可以根据任务的主观程度生成多样的意见。

KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains

paper_url: http://arxiv.org/abs/2311.09797
repo_url: None
paper_authors: Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, Arman Cohan
for: 评估语言模型在金融领域的应用能力，特别是解决复杂数学问题。
methods: 使用 KnowledgeMath benchmark，包括 1,259 个问题，具有混合文本和表格内容，并提供专家注解的详细解决方案。
results: 评估了 14 种不同的语言模型，其中最高级别的系统 (GPT-4 with Program-of-Thoughts) 的准确率只有 45.4%，而知识扩展的 LLMs 可以提高性能 (如 GPT-3.5 从 23.9% 提高到 32.0%)，但仍然远低于人类专家的估计性能 (94%)。

Abstract
We introduce KnowledgeMath, a novel benchmark designed to evaluate LLMs' capabilities in applying financial knowledge to solve complex math word problems. Compared to prior works, this study features three core advancements. First, KnowledgeMath includes 1,259 problems with a hybrid of textual and tabular content and require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. Finally, we evaluate a wide spectrum of 14 LLMs with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. The current best-performing system (i.e., GPT-4 with Program-of-Thoughts) achieves only 45.4% accuracy, leaving substantial room for improvement. While knowledge-augmented LLMs can improve the performance (e.g., from 23.9% to 32.0% for GPT-3.5), it is still significantly lower the estimated human expert performance of 94%. We believe that KnowledgeMath can facilitate future research on domain-specific knowledge retrieval and augmentation into the math word problem-solving process. We will release the benchmark and code at https://github.com/yale-nlp/KnowledgeMath.

摘要
我们介绍 KnowledgeMath，一个新的评估大型自然语言处理（LLM）的能力应用金融知识解决复杂的数学问题的标准库。相比之前的研究，这些研究有三个核心进步：首先，KnowledgeMath 包含 1,259 个具有文本和表格内容的问题，需要大学学士学位水准的金融领域知识以解决。第二，我们提供了专家标注、详细的解决方案 refer 以 Python 程式码格式，以 Ensure 高品质的标准库 для LLM 评估。第三，我们评估了 14 种不同的 LLM，包括 Chain-of-Thoughts 和 Program-of-Thoughts 等提示策略。现有最高表现的系统 (即 GPT-4 with Program-of-Thoughts) 的精度为 45.4%，剩下许多空间供改善。而知识增强 LLM 可以提高表现 (例如从 23.9% 提升至 32.0% для GPT-3.5)，但仍然很低于估计的人类专家表现率 (94%)。我们认为 KnowledgeMath 可以促进未来专业知识抽取和增强在数学问题解决过程中的研究。我们将在 GitHub 上发布标准库和程式码。

More Samples or More Prompt Inputs? Exploring Effective In-Context Sampling for LLM Few-Shot Prompt Engineering

paper_url: http://arxiv.org/abs/2311.09782
repo_url: None
paper_authors: Bingsheng Yao, Guiming Chen, Ruishi Zou, Yuxuan Lu, Jiachen Li, Shao Zhang, Sijia Liu, James Hendler, Dakuo Wang
for: 提高LLM性能和信心度
methods: 利用多个ICL提示输入构建多个ICS提示输入，以提高LLM的预测性能和信心度
results: 实验结果表明，ICS可以一直提高LLM的预测性能和信心度，而且可以采用多样性基于的策略进一步提高LLM的性能。

Abstract
While most existing works on LLM prompt-engineering focus only on how to select a better set of data samples inside one single prompt input (In-Context Learning or ICL), why can't we design and leverage multiple prompt inputs together to further improve the LLM performance? In this work, we propose In-Context Sampling (ICS), a low-resource LLM prompt-engineering technique to produce the most confident prediction results by optimizing the construction of multiple ICL prompt inputs. Extensive experiments with two SOTA LLMs (FlanT5-XL and Mistral-7B) on three NLI datasets (e-SNLI, Multi-NLI, and ANLI) illustrate that ICS can consistently enhance LLM's prediction performance and confidence. An ablation study suggests that a diversity-based ICS strategy may further improve LLM's performance, which sheds light on a new yet promising future research direction.

摘要
“现有的大多数 LLM 提示工程化研究仅专注于选择更好的内部提示输入（内部学习或 ICL），那么我们不能设计和利用多个提示输入来进一步提高 LLM 的表现吗？在这个工作中，我们提出了内部抽象（ICS），一种低资源 LLM 提示工程化技术，以提高多个 ICL 提示输入的建构，以获得最高的预测结果和自信度。实验显示，使用 FlanT5-XL 和 Mistral-7B 两个 SOTA LLM 在 e-SNLI、Multi-NLI 和 ANLI 三个 NLI 数据集上，ICS 可以一致地提高 LLM 的预测性能和自信度。剖析研究表明，一种多样性基于的 ICS 策略可能会进一步提高 LLM 的表现，这照明了一个新的未来研究方向。”Note: "LLM" stands for "Large Language Model" in English.

To be or not to be? an exploration of continuously controllable prompt engineering

paper_url: http://arxiv.org/abs/2311.09773
repo_url: https://github.com/jettbrains/-L-
paper_authors: Yuhan Sun, Mukai Li, Yixin Cao, Kun Wang, Wenxiao Wang, Xingyu Zeng, Rui Zhao
for: 这篇论文旨在提供一种能够精确控制 Language Model 的问题提示（Prompt）影响，以便更好地自定义模型和处理其输出。
methods: 本文使用的方法包括 LoRA（Low-Rank Adaptation）和特定的问题提示蒸馏（Prompt distillation），以实现问题提示的精确控制。
results: 本文的实验结果显示，ControlPE 可以实现精确控制不同类型的问题提示（包括短回答问题、拒绝问题和推理链问题），并且能够在不同的任务上灵活地应用。

Abstract
As the use of large language models becomes more widespread, techniques like parameter-efficient fine-tuning and other methods for controlled generation are gaining traction for customizing models and managing their outputs. However, the challenge of precisely controlling how prompts influence these models is an area ripe for further investigation. In response, we introduce ControlPE (Continuously Controllable Prompt Engineering). ControlPE enables finer adjustments to prompt effects, complementing existing prompt engineering, and effectively controls continuous targets. This approach harnesses the power of LoRA (Low-Rank Adaptation) to create an effect akin to prompt weighting, enabling fine-tuned adjustments to the impact of prompts. Our methodology involves generating specialized datasets for prompt distillation, incorporating these prompts into the LoRA model, and carefully adjusting LoRA merging weight to regulate the influence of prompts. This provides a dynamic and adaptable tool for prompt control. Through our experiments, we have validated the practicality and efficacy of ControlPE. It proves to be a promising solution for control a variety of prompts, ranging from generating short responses prompts, refusal prompts to chain-of-thought prompts.

摘要
As the use of large language models becomes more widespread, techniques like parameter-efficient fine-tuning and other methods for controlled generation are gaining traction for customizing models and managing their outputs. However, the challenge of precisely controlling how prompts influence these models is an area ripe for further investigation. In response, we introduce ControlPE (Continuously Controllable Prompt Engineering). ControlPE enables finer adjustments to prompt effects, complementing existing prompt engineering, and effectively controls continuous targets. This approach harnesses the power of LoRA (Low-Rank Adaptation) to create an effect akin to prompt weighting, enabling fine-tuned adjustments to the impact of prompts. Our methodology involves generating specialized datasets for prompt distillation, incorporating these prompts into the LoRA model, and carefully adjusting LoRA merging weight to regulate the influence of prompts. This provides a dynamic and adaptable tool for prompt control. Through our experiments, we have validated the practicality and efficacy of ControlPE. It proves to be a promising solution for control a variety of prompts, ranging from generating short responses prompts, refusal prompts to chain-of-thought prompts.Here's the translation in Traditional Chinese:当大语言模型的使用越来越普及时，Parameter-efficient fine-tuning 和其他控制生成的技术也在广泛地应用，以适应化模型和管理其输出。然而， precisely controlling how prompts influence these models 是一个需要进一步的探索的领域。为此，我们引入 ControlPE (Continuously Controllable Prompt Engineering)。ControlPE 可以实现更细微的问题影响，与现有的问题工程相结合，并实现连续目标的控制。这个方法利用 LoRA (Low-Rank Adaptation) 的力量，实现问题权重的效果，并允许精确地调整问题的影响。我们的方法包括生成特殊的问题蒸馏集，将这些问题 integrate 到 LoRA 模型中，并精确地调整 LoRA 合并重量，以调控问题的影响。这提供了一个动态和适应的问题控制工具。经过我们的实验，我们已经 validate 了 ControlPE 的实用性和有效性。它证明是一个可靠的解决方案，可以控制多种问题，包括短回应问题、拒绝问题和链式思维问题。

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

paper_url: http://arxiv.org/abs/2311.09766
repo_url: None
paper_authors: Yiqi Liu, Nafise Sadat Moosavi, Chenghua Lin
for: This paper aims to investigate the potential bias of language model-driven evaluation metrics in the context of summarization tasks.
methods: The paper uses three popular language models (BART, T5, and GPT) to evaluate the quality of summaries generated by these models.
results: The paper finds that the evaluation metrics demonstrate a bias towards the underlying language models, particularly when used in a reference-free manner without gold summaries.

Abstract
Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics--namely, BARTScore, T5Score, and GPTScore--demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in an reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more dependable evaluation protocols in the future.

摘要
<>现代自然语言处理（NLP）领域中自动评估生成文本内容的挑战仍在继续。由于现代语言模型（LM）在多种NLP任务中表现出色，因此有增加使用这些模型来创造新的评估指标来自动评估生成任务。本文探讨一个重要问题：语言模型驱动的评估指标是否具有偏好于由同一个语言模型生成的文本？ Specifically，我们评估了三个主要LM-based评估指标——BARTScore、T5Score和GPTScore——在摘要任务中是否具有偏好于其所处理的语言模型。我们的发现显示，在不使用黄金摘要的情况下，这些评估指标具有明显的偏好，特别是当用于 reference-free 的情况下。这些结果表明，由生成评估模型提供的评估结果可能会受到 beyond 文本质量的因素的影响，高亮了未来需要开发更可靠的评估协议。

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

paper_url: http://arxiv.org/abs/2311.09763
repo_url: None
paper_authors: Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, Muhao Chen
for: This paper focuses on defending against backdoor attacks in large language models (LLMs) during the testing phase, which has been overlooked in previous studies that primarily focus on training-time defenses.
methods: The proposed method, called defensive demonstrations, involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are combined with user queries and presented to the model during testing, without requiring any modifications or tuning to the black-box model.
results: The paper shows that defensive demonstrations are effective in defending against both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.

Abstract
Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes particularly pronounced in the context of Large Language Models (LLMs) deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, our work introduces defensive demonstrations, an innovative backdoor defense strategy for blackbox large language models. Our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications/tuning to the black-box model or insights into its internal mechanisms. Defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations. Extensive experiments show that defensive demonstrations are effective in defending both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.

摘要
Our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications or tuning to the black-box model or insights into its internal mechanisms. Defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations.Our extensive experiments show that defensive demonstrations are effective in defending against both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.

OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking

paper_url: http://arxiv.org/abs/2311.09758
repo_url: None
paper_authors: Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf
for: 提高自然语言处理系统的计算效率，使用小语言模型（SLM）作为cost-effective的替代方案。
methods: 基于发现SLM和大语言模型（LLM）在结构化知识EXTRACTION任务中具有补做的优势，提出了一种SLM/LLM路由框架，通过在批处理中选择最佳路由来提高计算效率并提高任务性能。
results: 在对话状态追踪任务中，提出的路由框架substantially提高了性能，而且降低了计算成本超过50%。

Abstract
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive. To reduce the cost without sacrificing performance, previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts. Driven by findings that SLMs and LLMs exhibit complementary strengths in a structured knowledge extraction task, this work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance. First, exemplar pools are created to represent the types of contexts where each LM provides a more reliable answer, leveraging a sentence embedding fine-tuned so that context similarity is close to dialogue state similarity. Then, during inference, the k-nearest exemplars to the testing instance are retrieved, and the instance is routed according to majority vote. In dialogue state tracking tasks, the proposed routing framework enhances performance substantially compared to relying solely on LLMs, while reducing the computational costs by over 50%.

摘要
The framework begins by creating exemplar pools that represent the types of contexts where each LM provides a more reliable answer. This is achieved by fine-tuning a sentence embedding so that context similarity is close to dialogue state similarity. During inference, the k-nearest exemplars to the testing instance are retrieved, and the instance is routed according to majority vote.In dialogue state tracking tasks, the proposed routing framework enhances performance by over 50% compared to relying solely on LLMs, while reducing computational costs by over 50%. This framework demonstrates the potential of combining SLMs and LLMs to improve the efficiency and effectiveness of natural language processing systems.

FairytaleCQA: Integrating a Commonsense Knowledge Graph into Children’s Storybook Narratives

paper_url: http://arxiv.org/abs/2311.09756
repo_url: None
paper_authors: Jiaju Chen, Yuxuan Lu, Shao Zhang, Bingsheng Yao, Yuanzhe Dong, Ying Xu, Yunyao Li, Qianwen Wang, Dakuo Wang, Yuling Sun
for: 这个论文旨在提供适用于下游儿童教育应用的自定义问答功能，用于补充现有的故事书内容。
methods: 该论文使用了LLM模型，并通过外部知识图进行 Commonsense知识的扩展。
results: 对比较大的LLM模型（GPT-4），一个较小的T5-large模型在新的问答对组成任务（QAG）中表现出色，表明：1）我们的数据集对现有LLM模型带来了新的挑战，2）人工专家的数据注释仍然是关键，因为它们在儿童教育领域具有丰富的细节知识。

Abstract
AI models (including LLM) often rely on narrative question-answering (QA) datasets to provide customized QA functionalities to support downstream children education applications; however, existing datasets only include QA pairs that are grounded within the given storybook content, but children can learn more when teachers refer the storybook content to real-world knowledge (e.g., commonsense knowledge). We introduce the FairytaleCQA dataset, which is annotated by children education experts, to supplement 278 storybook narratives with educationally appropriate commonsense knowledge. The dataset has 5,868 QA pairs that not only originate from the storybook narrative but also contain the commonsense knowledge grounded by an external knowledge graph (i.e., ConceptNet). A follow-up experiment shows that a smaller model (T5-large) fine-tuned with FairytaleCQA reliably outperforms much larger prompt-engineered LLM (e.g., GPT-4) in this new QA-pair generation task (QAG). This result suggests that: 1) our dataset brings novel challenges to existing LLMs, and 2) human experts' data annotation are still critical as they have much nuanced knowledge that LLMs do not know in the children educational domain.

摘要
人工智能模型（包括LLM）经常利用叙事问答（QA）数据集来提供下游儿童教育应用程序中自定义的QA功能;然而，现有数据集只包含基于给定的故事书内容的QA对。然而，孩子们可以通过教师将故事书内容与实际世界知识相关联来学习更多。我们介绍了 FairytaleCQA 数据集，该数据集由儿童教育专家标注，用于补充 278 本故事书内容教育适用的常识知识。该数据集包含 5,868 对问答，其中不仅来自故事书内容，还由外部知识图（i.e., ConceptNet）补充了 Commonsense 知识。一项追加实验表明，一个较小的模型（T5-large）在 FairytaleCQA 上进行了可靠地超越了较大的Prompt-工程化 LLVM（例如 GPT-4）在这个新的问答对生成任务（QAG）中。这一结果表明：1）我们的数据集带来了现有的LLMs新的挑战，2）人类专家的数据标注仍然是关键的，因为它们在儿童教育领域中具有许多细节的知识，LLMs不知。

How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

paper_url: http://arxiv.org/abs/2311.09755
repo_url: None
paper_authors: Miles Williams, Nikolaos Aletras
for: 本文研究了大语言模型（LLM）压缩和量化的基础，包括各种压缩和量化技术的效果，以及这些技术如何影响 LLM 的性能。
methods: 本文使用了多种压缩和量化方法，包括各种压缩和量化技术，以及不同任务、模型和数据集。
results: 研究发现，使用不同的滤波数据会导致下游任务性能异常大，与现有研究不同，表明使用不同的滤波数据可能会导致 LLM 的性能变化。

Abstract
Pruning and quantization form the foundation of model compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated state-of-the-art performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples, to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of pruning and quantization methods, tasks, models, and datasets. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.

摘要
剪枝和量化是神经网络模型压缩的基础，启用高效的推理 для大语言模型（LLM）。近年，各种量化和剪枝技术在后处理环境中表现出了状态之冠。它们依赖于校准数据，一小量的无标示例，来生成层活动。然而，没有任何先前的工作系统atically investigated calibration data对LLM性能的影响。在这篇论文中，我们提供了首次对剪枝和量化方法的效果进行了广泛的实验研究。我们对各种剪枝和量化方法、任务、模型和数据集进行了试验。 surprisingly，我们发现了下游任务性能的显著差异，与现有的工作 suggessthat a greater level of robustness to the calibration data。最后，我们对LLM剪枝和量化中有效使用calibration data的推荐。

Translation Aligned Sentence Embeddings for Turkish Language

paper_url: http://arxiv.org/abs/2311.09748
repo_url: None
paper_authors: Eren Unlu, Unver Ciftci
for: 提高 sentence embedding 模型在 Turkish 语言上的表现
methods: 提出了一种两个阶段的训练方法，其中第一阶段通过对 embedding 空间进行对应的调整，以便在 sentence embedding 设置中使用 pretrained encoder-decoder 模型进行精度的 fine-tuning。
results: 通过这种方法，可以在短时间内使用有限的 target 语言数据进行高精度的 fine-tuning，并且可以提高 sentence embedding 模型在 Turkish 语言上的表现。

Abstract
Due to the limited availability of high quality datasets for training sentence embeddings in Turkish, we propose a training methodology and a regimen to develop a sentence embedding model. The central idea is simple but effective : is to fine-tune a pretrained encoder-decoder model in two consecutive stages, where the first stage involves aligning the embedding space with translation pairs. Thanks to this alignment, the prowess of the main model can be better projected onto the target language in a sentence embedding setting where it can be fine-tuned with high accuracy in short duration with limited target language dataset.

摘要
Here's the text in Simplified Chinese:由于土耳其语 sentence embedding 训练数据的可用性有限，我们提出了一种训练方法和日程，以提高 sentence embedding 模型的质量。中心思想简单 yet effective：在两个阶段中，首先对预训练的 encoder-decoder 模型进行了两个阶段的微调，其中第一个阶段是将 embedding 空间与翻译对照进行对齐。这样可以使得模型在 sentence embedding 设置下，通过短时间内使用有限的目标语言数据进行高精度的微调。

Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks

paper_url: http://arxiv.org/abs/2311.09743
repo_url: None
paper_authors: Negar Mokhberian, Myrl G. Marmarelis, Frederic R. Hopp, Valerio Basile, Fred Morstatter, Kristina Lerman
for: 这篇论文的目的是解决对主观分类任务中的多 annotator 问题，因为对于主观任务，可能会有多个真实的标签，导致模型偏向特定的标签。
methods: 这篇论文提出了一个新的方法，即 Annotator Aware Representations for Texts (AART)，这个方法可以将每个 annotator 的标签视为一个独立的标签，以便更好地捕捉 annotators 的看法。
results: 该方法可以提高模型在捕捉 annotators 的看法方面的表现，并且可以避免因为 annotators 的差异而导致的偏向。此外，这个方法还可以学习 annotators 的行为，以便进一步的探索。

Abstract
In most classification models, it has been assumed to have a single ground truth label for each data point. However, subjective tasks like toxicity classification can lead to genuine disagreement among annotators. In these cases aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with small samples. This problem is especially the case in crowd-sourced datasets. In this work, we propose Annotator Aware Representations for Texts (AART) for subjective classification tasks. We will show the improvement of our method on metrics that assess the performance on capturing annotators' perspectives. Additionally, our approach involves learning representations for annotators, allowing for an exploration of the captured annotation behaviors.

摘要
通常的分类模型假设每个数据点有单一的真实标签。然而，主观任务如攻击性评分可能会导致注释器之间真实的分歧。在这种情况下，聚合标签会导致偏执zh labels和模型，这些模型可能会忽略小量意见。先前的研究已经揭示了标签聚合的坑缺和提出了一些实用的方法来解决这个问题。我们最近提出的多注释模型，它预测每个注释器的标签，容易受到 annotators with small samples 的下降决策。这个问题特别是在大量数据集中存在。在这种情况下，我们提出了注释者意识表示（AART） для主观分类任务。我们将展示我们的方法在衡量注释者的观点性能指标上的改进。此外，我们的方法包括学习注释者的表示，以便探索被捕捉的注释行为。

What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization

paper_url: http://arxiv.org/abs/2311.09741
repo_url: https://github.com/lyh6560new/p3sum
paper_authors: Yuhan Liu, Shangbin Feng, Xiaochuang Han, Vidhisha Balachandran, Chan Young Park, Sachin Kumar, Yulia Tsvetkov
for: 这篇论文的目的是设计一个忠于作者意见和观点的摘要系统。
methods: 这篇论文使用了一种叫做P^3Sum的扩散模型基本摘要方法，这个方法使用政治观点分类器控制摘要的政治倾向。
results: 实验结果显示，P^3Sum比前一代摘要系统和大语言模型提高了11.4%的成功率，具有与标准摘要价值指标一样的性能。这些结果显示，即使是现有的模型，在新闻摘要中保持作者意见和观点仍然是一个挑战，而P^3Sum则是一个重要的第一步。

Abstract
In this work, we take a first step towards designing summarization systems that are faithful to the author's opinions and perspectives. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P^3Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P^3Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article's original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P^3Sum outperforms state-of-the-art summarization systems and large language models by up to 11.4% in terms of the success rate of stance preservation, with on-par performance on standard summarization utility metrics. These findings highlight the lacunae that even for state-of-the-art models it is still challenging to preserve author perspectives in news summarization, while P^3Sum presents an important first step towards evaluating and developing summarization systems that are faithful to author intent and perspectives.

摘要
在这项工作中，我们开始努力设计 faithful 的摘要系统，以保持作者的意图和观点。通过新闻摘要中保持政治立场的案例研究，我们发现现有方法在超过50%的摘要中改变了新闻文章的政治立场和意图，这些摘要不符合作者的意图和观点。我们因此提出 P^3Sum，一种基于扩散模型的摘要方法，该方法在摘要生成过程中控制政治观点分类器，以确保生成的摘要保持原文的政治立场。在 P^3Sum 中，生成的摘要中的政治倾向在每个解码步骤中被评估，如果摘要偏离原文的政治立场，就会产生损失，这些损失将在嵌入层传递给 embedding 层，以在推理时间控制摘要的政治倾向。我们对三个新闻摘要数据集进行了广泛的实验，结果表明，P^3Sum 在保持摘要的政治立场方面的成功率比现有的摘要系统和大语言模型高出11.4%，而与标准摘要用途指标具有相同的性能。这些发现表明，即使是当今最先进的模型，在新闻摘要中保持作者的意图和观点仍然是一项挑战，而 P^3Sum 则是一个重要的第一步。

CARE: Extracting Experimental Findings From Clinical Literature

paper_url: http://arxiv.org/abs/2311.09736
repo_url: None
paper_authors: Aakanksha Naik, Bailey Kuehl, Erin Bransom, Doug Downey, Tom Hope
for: 本研究旨在提供一个新的信息抽取 dataset，用于抽取生物医学文献中的临床发现。
methods: 本研究使用了一新的注解 schema， capture 了细化的发现作为 n-ary 关系 между实体和属性。该 schema 包括困难现象，如不连续实体跨 span、嵌入关系和变量 arity n-ary 关系。
results: 研究使用了两个来源的700个摘要进行广泛的注解，并对多种当前IE系统的性能进行了测试。结果表明，即使使用 SOTA 模型，如 GPT4，也很难在这个数据集上进行relation extraction。

Abstract
Extracting fine-grained experimental findings from literature can provide massive utility for scientific applications. Prior work has focused on developing annotation schemas and datasets for limited aspects of this problem, leading to simpler information extraction datasets which do not capture the real-world complexity and nuance required for this task. Focusing on biomedicine, this work presents CARE (Clinical Aggregation-oriented Result Extraction) -- a new IE dataset for the task of extracting clinical findings. We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes, which includes phenomena challenging for current IE systems such as discontinuous entity spans, nested relations, and variable arity n-ary relations. Using this schema, we collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports. We also benchmark the performance of various state-of-the-art IE systems on our dataset, including extractive models and generative LLMs in fully supervised and limited data settings. Our results demonstrate the difficulty of our dataset -- even SOTA models such as GPT4 struggle, particularly on relation extraction. We release our annotation schema and CARE to encourage further research on extracting and aggregating scientific findings from literature.

摘要
<>通过Literature中提取细致实验结果可以提供巨大的科学应用 utility。先前的工作主要集中在开发注解schema和数据集，以便解决这个问题的有限方面，导致的是更简单的信息抽取数据集，这些数据集不能捕捉实际世界中的复杂性和细节。在生物医学领域，本工作提出了CARE（临床结合 oriented Result Extraction）——一个新的IE数据集，用于提取临床发现。我们开发了一个新的注解schema，捕捉细致发现为n-ary关系 между实体和属性，该schemas包括现实困难 для当前IE系统的现象，如不连续实体跨度、嵌入关系和变量数学 n-ary关系。使用该schemas，我们收集了700个报告的广泛的注解，来自两个来源：临床试验和案例报告。我们还对我们的数据集进行了多种当前IE系统的性能测试，包括抽取模型和生成LLMs在完全超vised和有限数据设置下。我们的结果表明，我们的数据集具有很大的困难度——即使SOTA模型如GPT4，它们在关系提取方面尤其困难。我们发布了我们的注解schema和CARE，以促进Literature中的实验发现和抽取。

Tracking the Newsworthiness of Public Documents

paper_url: http://arxiv.org/abs/2311.09734
repo_url: None
paper_authors: Alexander Spangher, Emilio Ferrara, Ben Welsh, Nanyun Peng, Serdar Tumgoren, Jonathan May
for: 本研究ocuses on Local public policy coverage in the San Francisco Bay Area by the San Francisco Chronicle.
methods: The paper uses probabilistic relational modeling to link news articles, public policy documents, and meeting recordings.
results: The paper shows that different aspects of public policy discussion yield different newsworthiness signals, and their systems identify policies considered newsworthy with 68% F1 and their coverage recommendations are helpful with an 84% win-rate.Here’s the format you requested:
for: <本研究ocuses on Local public policy coverage in the San Francisco Bay Area by the San Francisco Chronicle.>
methods: <The paper uses probabilistic relational modeling to link news articles, public policy documents, and meeting recordings.>
results: <The paper shows that different aspects of public policy discussion yield different newsworthiness signals, and their systems identify policies considered newsworthy with 68% F1 and their coverage recommendations are helpful with an 84% win-rate.>

Abstract
Journalists must find stories in huge amounts of textual data (e.g. leaks, bills, press releases) as part of their jobs: determining when and why text becomes news can help us understand coverage patterns and help us build assistive tools. Yet, this is challenging because very few labelled links exist, language use between corpora is very different, and text may be covered for a variety of reasons. In this work we focus on news coverage of local public policy in the San Francisco Bay Area by the San Francisco Chronicle. First, we gather news articles, public policy documents and meeting recordings and link them using probabilistic relational modeling, which we show is a low-annotation linking methodology that outperforms other retrieval-based baselines. Second, we define a new task: newsworthiness prediction, to predict if a policy item will get covered. We show that different aspects of public policy discussion yield different newsworthiness signals. Finally we perform human evaluation with expert journalists and show our systems identify policies they consider newsworthy with 68% F1 and our coverage recommendations are helpful with an 84% win-rate.

摘要

MOKA: Moral Knowledge Augmentation for Moral Event Extraction

paper_url: http://arxiv.org/abs/2311.09733
repo_url: https://github.com/launchnlp/MOKA
paper_authors: Xinliang Frederick Zhang, Winston Wu, Nick Beauchamp, Lu Wang
For: This paper is written for studying the phenomenon of moral language in news media and the dynamics of moral events in shaping news content.* Methods: The paper uses a new dataset called MORAL EVENTS, which consists of 5,494 structured annotations on 474 news articles from diverse US media outlets. The authors also propose a moral event extraction framework called MOKA, which leverages knowledge derived from moral words and moral scenarios.* Results: The experimental results show that MOKA outperforms competitive baselines across three moral event understanding tasks. Additionally, the authors find that media outlets of different ideological leanings selectively report moral events, highlighting the significance of event-level morality analysis in news.

Abstract
News media employ moral language to create memorable stories, and readers often engage with the content that align with their values. Moral theories have been applied to news analysis studying moral values in isolation, while the intricate dynamics among participating entities in shaping moral events have been overlooked. This is mainly due to the use of obscure language to conceal evident ideology and values, coupled with the insufficient moral reasoning capability in most existing NLP systems, where LLMs are no exception. To study this phenomenon, we first annotate a new dataset, MORAL EVENTS, consisting of 5,494 structured annotations on 474 news articles by diverse US media across the political spectrum. We further propose MOKA, a moral event extraction framework with MOral Knowledge Augmentation, that leverages knowledge derived from moral words and moral scenarios. Experimental results show that MOKA outperforms competitive baselines across three moral event understanding tasks. Further analyses illuminate the selective reporting of moral events by media outlets of different ideological leanings, suggesting the significance of event-level morality analysis in news. Our datasets and codebase are available at https://github.com/launchnlp/MOKA.

摘要
新闻媒体使用道德语言创造深刻的故事，读者常与其价值观合而参与内容。道德理论在新闻分析中被应用，但是参与者之间的复杂关系和形成道德事件的过程受到了忽略。这主要是因为使用晦涩的语言隐藏了明确的意识形态和价值观，同时现有的NLP系统中的道德理解能力不够，LLMs也不例外。为研究这一现象，我们首先创建了新的数据集，道德事件集（MORAL EVENTS），包含474篇来自美国各种政见媒体的新闻文章5,494个结构化注释。我们还提出了MOKA，一个基于道德知识增强的道德事件抽取框架，利用道德词汇和道德enario来抽取道德事件。实验结果显示，MOKA在三个道德事件理解任务上与竞争对手相比表现出色。进一步的分析发现媒体不同政见倾向的报道道德事件是有偏见的，这表明事件级别的道德分析在新闻中的重要性。我们的数据集和代码库可以在GitHub上找到：https://github.com/launchnlp/MOKA。

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

paper_url: http://arxiv.org/abs/2311.09721
repo_url: None
paper_authors: Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, Arman Cohan
for: 这项研究是为了评估大自然语言模型（LLM）在数据库问答任务中的表现，并研究 LLM 如何使用多个 SQL 查询来获取数据库中的数据，进行Contextual reasoning，并将其总结成一份完整的分析报告。
methods: 本研究使用了一种新的长形数据库问答数据集，并提出了两种互动策略来解决问题。我们还进行了细腻的分析，探讨了不同阶段的互动过程中的瓶颈。
results: 我们的研究发现，当前的State-of-the-art GPT-4模型在这个任务中存在两个主要的瓶颈：规划能力和多个 SQL 查询的生成能力。我们还引入了一种多代理评估框架，以便更准确地评估答案质量。这种框架允许我们更好地理解当前 LLM 在复杂的检索和推理任务中的优劣点。

Abstract
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.

摘要

Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning

paper_url: http://arxiv.org/abs/2311.09712
repo_url: None
paper_authors: Athul Paul Jacob, Gabriele Farina, Jacob Andreas
for: 这篇论文旨在描述一种语言理解模型，即通过搜索信号游戏的 equilibria来生成和理解语言表达。
methods: 该模型使用搜索 equilibria来模拟语言交流中的信号游戏，并通过定制化的搜索策略来找到最佳的语言表达。
results: 在使用该模型的实验中，论文能够匹配或超越现有的最佳回应和理性演讲模型的预测，并且可以提供有关语言交流中的通信成功和自然性的理论保证。

Abstract
We present a model of pragmatic language understanding, where utterances are produced and understood by searching for regularized equilibria of signaling games. In this model (which we call ReCo, for Regularized Conventions), speakers and listeners search for contextually appropriate utterance--meaning mappings that are both close to game-theoretically optimal conventions and close to a shared, ''default'' semantics. By characterizing pragmatic communication as equilibrium search, we obtain principled sampling algorithms and formal guarantees about the trade-off between communicative success and naturalness. Across several datasets capturing real and idealized human judgments about pragmatic implicatures, ReCo matches or improves upon predictions made by best response and rational speech act models of language understanding.

摘要
我们提出了一种语言理解模型，其中讲话和理解都是通过搜索正则化平衡的信号游戏来实现的。我们称这种模型为ReCo（正则化会议）。在这个模型中，说话者和听众在语言上进行Contextually appropriate的讲话-意思映射搜索，以达到Game-theoretically optimal的会议和共同默认 semantics。通过将 Pragmatic communication 定义为平衡搜索，我们得到了原则性的抽样算法和Formal guarantees about the trade-off between communicative success and naturalness。在几个捕捉了真实和理想的人类评价的数据集上，ReCo匹配或超过了Best response和理性语言理解模型的预测。

Large Language Model Inference with Lexical Shortlisting

paper_url: http://arxiv.org/abs/2311.09709
repo_url: None
paper_authors: Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch
for: 本研究旨在提高大语言模型（LLM）的推理速度和计算资源使用效率，通过适应lexical shortlisting技术。
methods: 本研究使用Unicode字符集基于的脚本筛选和基于词库的选择方法来缩短子词库。
results: 研究发现，lexical shortlisting可以减少一些模型的内存使用量，最高可以减少50%，同时也有25%的提升的可能性。此外，研究还发现了这种词库选择方法的缺点，并提出了未来研究的可能性。

Abstract
Large language model (LLM) inference is computation and memory intensive, so we adapt lexical shortlisting to it hoping to improve both. While lexical shortlisting is well-explored in tasks like machine translation, it requires modifications before being suitable for LLMs as the intended applications vary significantly. Our work studies two heuristics to shortlist sub-vocabulary at LLM inference time: Unicode-based script filtering and corpus-based selection. We explore different LLM families and sizes, and we find that lexical shortlisting can reduce the memory usage of some models by nearly 50\% and has an upper bound of 25\% improvement in generation speed. In this pilot study, we also identify the drawbacks of such vocabulary selection methods and propose avenues for future research.

摘要
大型语言模型（LLM）的推理是计算和内存密集的，因此我们适应lexical shortlisting以提高它们。lexical shortlisting在机器翻译任务中广泛探索过，但是需要修改才能适用于LLM，因为它们的应用场景差异很大。我们的工作研究了两种决策指标来短list sub-vocabulary during LLM inference time：Unicode-based script filtering和corpus-based selection。我们在不同的LLM家族和大小上进行了 исследование，发现lexical shortlisting可以将一些模型的内存使用量减少到 nearly 50%，并且有一个 Upper bound的25%的提高 Speed of generation。在这个 Pilot study中，我们还发现了这种词汇选择方法的缺点并提出了未来研究的可能性。

A Self-enhancement Multitask Framework for Unsupervised Aspect Category Detection

paper_url: http://arxiv.org/abs/2311.09708
repo_url: None
paper_authors: Thi-Nhung Nguyen, Hoang Ngo, Kiem-Hieu Nguyen, Tuan-Dung Cao
for: addresses the problem of unsupervised Aspect Category Detection using a small set of seed words.
methods: proposes a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training, and jointly trains Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity.
results: surpasses strong baselines on standard datasets.

Abstract
Our work addresses the problem of unsupervised Aspect Category Detection using a small set of seed words. Recent works have focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.

摘要
我们的工作解决了无监督方面类检测问题，使用一小组种子词。 latest works focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.Here's a word-for-word translation of the text in Traditional Chinese:我们的工作解决了无监督方面类检测问题，使用一小组种子词。 latest works focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

paper_url: http://arxiv.org/abs/2311.09707
repo_url: None
paper_authors: Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp
for: 提高软件开发Productivity
methods: 使用大型生成模型进行代码生成和代码完成，使用小型encoder-only模型进行自然语言查询代码搜索
results: 提出了一个新的benchmark dataset called GenCodeSearchNet (GeCS)，以评估语言模型对不同编程语言的理解能力，并引入了一个新的手动审核 subsets StatCodeSearch，Focus on R编程语言，以增强模型对不同编程语言的适应能力。

Abstract
Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries.These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et.\,al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systemically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.

摘要
Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries. These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systematically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.Here's the translation in Traditional Chinese:语言模型可以serve as a valuable tool for software developers to increase productivity。大型生成模型可以用于代码生成和代码完成，而小型encoder-only模型则可以进行代码搜寻任务使用自然语言查询。这些能力受到训练数据的质量和多样性的影响。通常的源代码资料集用于训练通常会针对最受欢迎的语言进行集中，而测试通常会在同一个分布上进行，往往忽略低资源的编程语言。驱动了Hupkes等人提出的NLG概念分类，我们提议一个新的benchmarkdatasetcalled GenCodeSearchNet (GeCS)，这个dataset建立在现有的自然语言代码搜寻dataset之上，以系统地评估语言模型对程式语言理解的扩展能力。这个dataset中，我们引入了一个新的、手动精心筛选的子集StatCodeSearch，它针对R语言，这是一个受欢迎但现在尚未得到充分关注的编程语言，经常被computer科学以外的研究人员使用。为了评估和比较，我们收集了一些基准结果使用精心翻译BERT类型模型和GPT类型大型语言模型，这些模型在零条件设定下进行比较。

Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability

paper_url: http://arxiv.org/abs/2311.09696
repo_url: None
paper_authors: Wei-Rui Chen, Ife Adebara, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed
for: investigate ChatGPT’s language identification abilities
methods: compile Babel-670 benchmark, study ChatGPT’s ability to identify language names and language codes under zero- and few-shot conditions with and without label set
results: ChatGPT lags behind smaller finetuned language identification tools, indicating potential for enhancement before serving diverse communities.Here is the same information in Traditional Chinese text:
for: 探访ChatGPT的语言识别能力
methods: 编译Babel-670 benchmark，研究ChatGPT在零条件和几条件下进行语言名称和语言代码识别
results: ChatGPT落后于小型训练语言识别工具，显示需要进一步改进以应对多元社区。

Abstract
Recently, ChatGPT has emerged as a powerful NLP tool that can carry out several tasks. However, the range of languages ChatGPT can handle remains largely a mystery. In this work, we investigate ChatGPT's language identification abilities. For this purpose, we compile Babel-670, a benchmark comprising $670$ languages representing $23$ language families. Languages in Babel-670 run the gamut between the very high-resource to the very low-resource and are spoken in five continents. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify both language names and language codes (ii) under both zero- and few-shot conditions (iii) with and without provision of label set. When compared to smaller finetuned language identification tools, we find that ChatGPT lags behind. Our empirical analysis shows the reality that ChatGPT still resides in a state of potential enhancement before it can sufficiently serve diverse communities.

摘要

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

paper_url: http://arxiv.org/abs/2311.09694
repo_url: None
paper_authors: Ashim Gupta, Rishanth Rajendhran, Nathan Stringham, Vivek Srikumar, Ana Marasović
for: 本研究是为了解决NLPToday的大型和高性能模型是否已经解决了长期稳定性问题。
methods: 作者使用了19种不同的模型，包括不同的架构选择和预训练目标。他们使用了OOD和挑战测试集、CheckLists、对比集和抗敌输入来进行评估。
results: 研究发现，不是所有的OOD测试都能够提供更深入的稳定性评估。使用CheckLists和对比集的评估显示了模型的性能差距，并且尚未充分提高模型的稳定性。此外，作者还指出了当前对模型 robustness的评估方法存在问题，这些方法可以被轻松地骗过，并且当前的评估方法不够深入。因此，作者 conclude that NLP中的稳定性问题仍未得到解决，甚至一些用于评估稳定性的方法需要重新评估。

Abstract
Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

摘要
是否已经解决了自然语言处理（NLP）领域的长期稳定性问题？为了回答这个问题，我们进行了19种不同大小和架构的模型的完整调查。我们使用了（a）跨模型测试集（Out-of-distribution，OOD）和挑战测试集，（b）CheckLists，（c）对比集和（d）敌意输入来进行评估。我们的分析发现，不是所有的OOD测试都能够提供更多的 robustness 信息。使用 CheckLists 和对比集的评估显示了模型的显著性能差异；即便模型的大小增加，也不能 garantuee 其 sufficient robustness。最后，我们指出了当前对模型 adversarial 评估的方法存在问题：它们可以轻松地被阻断，并且在当前的形式下，不能深入探索模型的 robustness。我们结论是，NLP 领域的 robustness 问题仍未得到解决，而且一些用于评估 robustness 的方法也需要重新评估。

Inducing Political Bias Allows Language Models Anticipate Partisan Reactions to Controversies

paper_url: http://arxiv.org/abs/2311.09687
repo_url: None
paper_authors: Zihao He, Siyi Guo, Ashwin Rao, Kristina Lerman
for: 本研究旨在使用大型自然语言模型（LLM）更好地理解政治偏见在数字化对话中。
methods: 本研究采用了一种新的办法，即使用一个单一的指令驱动的LLM来反映政治 идеологи的范围。
results: 研究发现模型能够准确地捕捉到情感和道德上的细节，但在姿势检测方面存在一些挑战。

Abstract
Social media platforms are rife with politically charged discussions. Therefore, accurately deciphering and predicting partisan biases using Large Language Models (LLMs) is increasingly critical. In this study, we address the challenge of understanding political bias in digitized discourse using LLMs. While traditional approaches often rely on finetuning separate models for each political faction, our work innovates by employing a singular, instruction-tuned LLM to reflect a spectrum of political ideologies. We present a comprehensive analytical framework, consisting of Partisan Bias Divergence Assessment and Partisan Class Tendency Prediction, to evaluate the model's alignment with real-world political ideologies in terms of stances, emotions, and moral foundations. Our findings reveal the model's effectiveness in capturing emotional and moral nuances, albeit with some challenges in stance detection, highlighting the intricacies and potential for refinement in NLP tools for politically sensitive contexts. This research contributes significantly to the field by demonstrating the feasibility and importance of nuanced political understanding in LLMs, particularly for applications requiring acute awareness of political bias.

摘要
社交媒体平台上的政治话题非常普遍，因此正确地理解和预测政治偏见使用大型自然语言模型（LLM）变得越来越重要。在这项研究中，我们解决了政治偏见在数字化言语中的理解挑战，使用一个单一、指导 instru 的 LLM，以反映政治意识形态的谱系。我们提出了一个完整的分析框架，包括政治偏见分化评估和政治类倾向预测，以评估模型与实际世界政治意识形态之间的吻合程度。我们的发现表明模型能够很好地捕捉情感和道德上的细 nuances，但在姿势检测方面存在一些挑战，这 highlights NL 工具在政治敏感上的复杂性和可能的改进。本研究对于场景中的政治偏见理解的重要性和可行性作出了重要贡献，特别是在需要精准的政治偏见认知应用场景下。

R-Tuning: Teaching Large Language Models to Refuse Unknown Questions

paper_url: http://arxiv.org/abs/2311.09677
repo_url: None
paper_authors: Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, Tong Zhang
for: 本研究旨在改进语言模型（LLM）的问答能力，特别是避免模型生成非存在的信息（hallucination）。
methods: 我们提出了一种新的approach，即Refusal-Aware Instruction Tuning（R-Tuning），通过初步确定知识差距，然后使用知识交叉构建拒绝意识数据，以便训练LLMs可以回答知道的问题而不回答未知的问题。
results: 实验结果表明，R-Tuning方法可以有效地提高模型回答知道问题的能力，同时避免回答未知问题。此外，在域外数据集上进行测试，发现模型学习不确定性的能力可以通过训练来提高。

Abstract
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. A predominant issue is the propensity for these models to generate non-existent facts, a concern termed hallucination. Our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. When the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. In this paper, we present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the knowledge gap between parametric knowledge and the instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from responding to questions beyond its parametric knowledge. Experimental results demonstrate this new instruction tuning approach effectively improves a model's ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty during training displays a better ability to estimate uncertainty than uncertainty-based testing. Our code will be released at https://github.com/shizhediao/R-Tuning.

摘要
大型语言模型（LLM）已经革命化了许多领域，但仍面临一些挑战。一个主要问题是这些模型的倾向于生成不存在的事实，被称为幻觉。我们的研究受到了以前的 instrucion 级别调整方法会让模型完成一个句子，无论模型知道这些知识还是不知道。当问题出现在模型的参数知识之外时，它会尝试 fabricate 一个答案，并且无法指示当前缺乏知识。在这篇论文中，我们提出了一种新的方法，即 Refusal-Aware Instruction Tuning（R-Tuning）。这种方法由首先标识模型的参数知识和 instrucion 调整数据之间的知识差异而始。然后，我们将基于知识交叉的 refusal-aware 数据进行调整，以使模型不再回答 beyond 其参数知识的问题。实验结果表明，这种新的 instrucion 调整方法可以有效地提高模型回答知道的问题能力，并且不再回答不知道的问题。此外，当测试在域外数据集时，发现了一个叫做 "拒绝能力" 的元技能，可以在其他任务上generalize。进一步的分析显示，在训练时学习不确定性实际上比测试时 uncertainty-based 测试时更好地估计不确定性。我们的代码将在 GitHub 上发布，请参考。

Where Do People Tell Stories Online? Story Detection Across Online Communities

paper_url: http://arxiv.org/abs/2311.09675
repo_url: https://github.com/maria-antoniak/stories-online-communities
paper_authors: Maria Antoniak, Joel Mire, Maarten Sap, Elliott Ash, Andrew Piper
for: 这篇论文是为了研究在线社区中的故事tellding，以便更好地理解社会运动、意识形态、诱导策略等的动态。
methods: 这篇论文使用了一份codebook和Storytelling in Online Communities Corpus，一个专家标注的数据集，以及在线故事检测模型的训练和评估，来研究在线故事tellding的特点和社会上的应用。
results: 根据这篇论文的研究结果，在线故事tellding的特点包括：不同社区中的故事tellding频率不同，各种社区中的故事tellding具有共同的特征，以及在线故事tellding可以跨越不同话题和场景进行交互。

Abstract
People share stories online for a myriad of purposes, whether as a means of self-disclosure, processing difficult personal experiences, providing needed information or entertainment, or persuading others to share their beliefs. Better understanding of online storytelling can illuminate the dynamics of social movements, sensemaking practices, persuasion strategies, and more. However, unlike other media such as books and visual content where the narrative nature of the content is often overtly signaled at the document level, studying storytelling in online communities is challenging due to the mixture of storytelling and non-storytelling behavior, which can be interspersed within documents and across diverse topics and settings. We introduce a codebook and create the Storytelling in Online Communities Corpus, an expert-annotated dataset of 502 English-language posts and comments with labeled story and event spans. Using our corpus, we train and evaluate an online story detection model, which we use to investigate the role storytelling of in different social contexts. We identify distinctive features of online storytelling, the prevalence of storytelling among different communities, and the conversational patterns of storytelling.

摘要
To address this challenge, we introduce a codebook and create the Storytelling in Online Communities Corpus, an expert-annotated dataset of 502 English-language posts and comments with labeled story and event spans. Using our corpus, we train and evaluate an online story detection model, which we use to investigate the role of storytelling in different social contexts. Our findings reveal distinctive features of online storytelling, the prevalence of storytelling among different communities, and the conversational patterns of storytelling.

Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

paper_url: http://arxiv.org/abs/2311.09668
repo_url: None
paper_authors: Yuhang Li, Yihan Wang, Zhouxing Shi, Cho-Jui Hsieh
for: 防止大语言模型（LLMs）的不良用户访问。
methods: 使用Token-level watermarking技术，并提出了三种方法来预测重要性分数。
results: 实验结果表明，我们的方法可以生成高质量的文本，同时保持了相同的检测率。

Abstract
The strong general capabilities of Large Language Models (LLMs) bring potential ethical risks if they are unrestrictedly accessible to malicious users. Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions with a private random number generator seeded by its prefix tokens. However, this watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality if it chooses to promote tokens that are less relevant given the input. In this work, we propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS). At each generation step, we estimate the importance of the token to generate, and prevent it from being impacted by watermarking if it is important for the semantic correctness of the output. We further propose three methods to predict importance scoring, including a perturbation-based method and two model-based methods. Empirical experiments show that our method can generate texts with better quality with comparable level of detection rate.

摘要
强大的普通语言模型（LLM）具有潜在的道德风险，如果这些模型在黑客用户手中不受限制。Token-level watermarking在生成文本时插入水印，通过修改token概率分布来增加一个私有随机数生成器。然而，这种水印算法在生成过程中改变了logits，可能会导致生成的文本质量下降，如果它选择推荐不太相关的token。在这种情况下，我们提出了通过水印 scoring（WIS）来改进生成的文本质量。在每个生成步骤中，我们估算token需要生成的重要性，并在生成过程中避免由水印所影响的重要token。我们还提出了三种方法来预测重要性分配，包括一种干扰基本方法和两种模型基本方法。实验表明，我们的方法可以生成文本质量更高，同时保持检测率在相同水平。

Evaluating LLM Agent Group Dynamics against Human Group Dynamics: A Case Study on Wisdom of Partisan Crowds

paper_url: http://arxiv.org/abs/2311.09665
repo_url: None
paper_authors: Yun-Shiuan Chuang, Siddharth Suresh, Nikunj Harlalka, Agam Goyal, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, Timothy T. Rogers
for: 本研究探讨了大自然语言模型（LLM）是否可以模拟人类群体动态，特别在政治敏感背景下。
methods: 我们使用LLM agents扮演为民主和共和两个政治人物，在一种类似于人类群体研究的结构化互动中进行了模拟。我们的方法评估了代理者响应如何随社会影响而发展。
results: 我们发现，不含链条思维（CoT）的LLM代理者具有与人类行为高度相似的Alignment，而含CoT的代理者则受到了Alignment的降低。此外，在人类数据进行精细调整LLM代理者后，可以实现人类样式的行为，但也存在过拟合特定行为的风险。这些发现表明了LLM代理者在模型人类群体现象方面的潜力和局限性。

Abstract
This study investigates the potential of Large Language Models (LLMs) to simulate human group dynamics, particularly within politically charged contexts. We replicate the Wisdom of Partisan Crowds phenomenon using LLMs to role-play as Democrat and Republican personas, engaging in a structured interaction akin to human group study. Our approach evaluates how agents' responses evolve through social influence. Our key findings indicate that LLM agents role-playing detailed personas and without Chain-of-Thought (CoT) reasoning closely align with human behaviors, while having CoT reasoning hurts the alignment. However, incorporating explicit biases into agent prompts does not necessarily enhance the wisdom of partisan crowds. Moreover, fine-tuning LLMs with human data shows promise in achieving human-like behavior but poses a risk of overfitting certain behaviors. These findings show the potential and limitations of using LLM agents in modeling human group phenomena.

摘要
Translated into Simplified Chinese:这个研究 investigate LLM（大语言模型）能模拟人类群体动态，尤其在政治敏感的背景下。我们使用LLM代理人物，模拟人类群体中的决策过程，并评估代理人物如何受社会影响。我们的关键发现表明，没有Chain-of-Thought（CoT）解释的LLM代理人物和人类行为高度相似，而CoT解释会降低对应性。然而，向代理人物添加明确的偏见不一定提高群体智慧。此外，使用人类数据进行LLM fine-tuning显示 promise in achieving human-like behavior，但也存在适应特定行为的风险。这些发现表明LLM代理人物在模拟人类群体现象的潜在和局限性。

Evolving Domain Adaptation of Pretrained Language Models for Text Classification

paper_url: http://arxiv.org/abs/2311.09661
repo_url: None
paper_authors: Yun-Shiuan Chuang, Yi Wu, Dhruv Gupta, Rheeya Uppaal, Ananya Kumar, Luhang Sun, Makesh Narsimhan Sreedhar, Sijia Yang, Timothy T. Rogers, Junjie Hu
for: 维护语言模型的精度在应用中，特别是在观点探测中。
methods: 本研究探讨了对于语言模型进行不断更新的方法，以适应语言模型在不断变化的语言环境中的应用。
results: 研究发现，对于语言模型进行自我训练方法能够优化语言模型在不断变化的语言环境中的性能，并且比传统的领域适应技术高效。

Abstract
Adapting pre-trained language models (PLMs) for time-series text classification amidst evolving domain shifts (EDS) is critical for maintaining accuracy in applications like stance detection. This study benchmarks the effectiveness of evolving domain adaptation (EDA) strategies, notably self-training, domain-adversarial training, and domain-adaptive pretraining, with a focus on an incremental self-training method. Our analysis across various datasets reveals that this incremental method excels at adapting PLMs to EDS, outperforming traditional domain adaptation techniques. These findings highlight the importance of continually updating PLMs to ensure their effectiveness in real-world applications, paving the way for future research into PLM robustness against the natural temporal evolution of language.

摘要
这篇研究评估了对于时间序列文本分类中的预训语言模型（PLM）进行演进领域变化（EDS）的适用性，以维持准确性。研究对于自我训练、领域抗战术和领域适应训练等不同的演进领域整合策略进行比较，并着重于增量自我训练方法。我们的分析发现，这种增量方法在处理EDS时表现出色，超过传统领域整合策略。这些发现显示了预训语言模型的更新和调整的重要性，以确保它们在实际应用中的效能。这将推动未来关于PLM的研究，以探索它们对自然时间演进的语言的Robustness。

ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification

paper_url: http://arxiv.org/abs/2311.09649
repo_url: https://github.com/yaxinzhuars/icxml
paper_authors: Yaxin Zhu, Hamed Zamani
for: 这篇论文主要针对EXTREME多标签分类任务（XMC），目标是为每个实例预测多个标签，从极其大的标签空间中预测。
methods: 该论文提出了一种两阶段方法，称为In-Context Extreme Multilabel Learning（ICXML），通过在上下文学习中生成候选标签并进行重新排序，以降低搜索空间。
results: 对于两个公共评分平台，实验结果表明ICXML已经提高了状态之arte。

Abstract
This paper focuses on the task of Extreme Multi-Label Classification (XMC) whose goal is to predict multiple labels for each instance from an extremely large label space. While existing research has primarily focused on fully supervised XMC, real-world scenarios often lack complete supervision signals, highlighting the importance of zero-shot settings. Given the large label space, utilizing in-context learning approaches is not trivial. We address this issue by introducing In-Context Extreme Multilabel Learning (ICXML), a two-stage framework that cuts down the search space by generating a set of candidate labels through incontext learning and then reranks them. Extensive experiments suggest that ICXML advances the state of the art on two diverse public benchmarks.

摘要

Event Causality Is Key to Computational Story Understanding

paper_url: http://arxiv.org/abs/2311.09648
repo_url: None
paper_authors: Yidan Sun, Qin Chao, Boyang Li
for: 本研究旨在探讨人类故事理解中事件 causality 的中心作用，以及如何利用这些事件 causality 进行符号故事生成。
methods: 我们采用了最新的大语言模型 (LLMs) 来开发一种用于事件 causality 识别的方法，并通过设计特定的提示来提取 GPT 中的事件 causal 关系。
results: 我们的方法在比较 human-annotated 的事件 causal 关系集合 GLUCOSE 中表现出类似的水平，同时能够轻松地扩展到不同类型和长度的故事。这些EXTRACTED causal 关系导致了对故事质量评价的提高（5.7%）和对故事视频文本对应性的提高（8.7%）。

Abstract
Psychological research suggests the central role of event causality in human story understanding. Further, event causality has been heavily utilized in symbolic story generation. However, few machine learning systems for story understanding employ event causality, partially due to the lack of reliable methods for identifying open-world causal event relations. Leveraging recent progress in large language models (LLMs), we present the first method for event causality identification that leads to material improvements in computational story understanding. We design specific prompts for extracting event causal relations from GPT. Against human-annotated event causal relations in the GLUCOSE dataset, our technique performs on par with supervised models, while being easily generalizable to stories of different types and lengths. The extracted causal relations lead to 5.7\% improvements on story quality evaluation and 8.7\% on story video-text alignment. Our findings indicate enormous untapped potential for event causality in computational story understanding.

摘要
心理研究表明人类故事理解中心stage causality的重要性。此外，event causality在Symbolic story generation中得到了广泛使用。然而，现代机器学习系统 для故事理解 rarely employ event causality，部分原因是没有可靠的方法来确定开放世界的 causal event relations。基于最近的大语言模型（LLMs），我们提出了首个事件 causality identification的方法，该方法在计算机故事理解中产生了Material improvements。我们为GPT设计了特定的提示，以EXTRACT event causal relations。与人类标注的事件 causal relations在GLUCOSE数据集中，我们的技术与超级vised模型相当，而且可以轻松扩展到不同的类型和长度的故事。提取的 causal relations导致了5.7%的故事质量评估提高和8.7%的故事视频-文本对齐提高。我们的发现表明事件 causality在计算机故事理解中存在巨大的untapped potential。

Evaluating In-Context Learning of Libraries for Code Generation

paper_url: http://arxiv.org/abs/2311.09635
repo_url: https://github.com/jettbrains/-L-
paper_authors: Arkil Patel, Siva Reddy, Dzmitry Bahdanau, Pradeep Dasigi
for: 本研究旨在系统地评估不同领域特化的大型自然语言模型（LLMs）在基于受欢迎库模块的代码生成中的能力和限制。
methods: 本研究使用了多种场景，反映不同领域的特化，以评估不同大型LLMs在基于受欢迎库模块的代码生成中的能力和限制。
results: 研究结果显示，即使使用小型开源LLMs如Llama-2和StarCoder，也能够很好地理解新的代码库模块，基于受欢迎库模块的 спецификации进行受欢迎库模块的代码生成。此外，研究还发现，LLMs可以通过自然语言描述或 raw code 实现来学习新的库模块，这些资源通常比示例更加便宜。总之，本研究的结果铺平了在更多的适应和动态编程环境中使用LLMs的道路。

Abstract
Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from unfamiliar libraries for solving user-instructed tasks. Recent work has shown that large proprietary LLMs can learn novel library usage in-context from demonstrations. These results raise several open questions: whether demonstrations of library usage is required, whether smaller (and more open) models also possess such capabilities, etc. In this work, we take a broader approach by systematically evaluating a diverse array of LLMs across three scenarios reflecting varying levels of domain specialization to understand their abilities and limitations in generating code based on libraries defined in-context. Our results show that even smaller open-source LLMs like Llama-2 and StarCoder demonstrate an adept understanding of novel code libraries based on specification presented in-context. Our findings further reveal that LLMs exhibit a surprisingly high proficiency in learning novel library modules even when provided with just natural language descriptions or raw code implementations of the functions, which are often cheaper to obtain than demonstrations. Overall, our results pave the way for harnessing LLMs in more adaptable and dynamic coding environments.

摘要
现代大型语言模型（LLM）表现出了高度的代码生成和理解能力。特别是在解决用户指令下的代码模块解释方面表现出了极高的能力。最近的研究表明，大型专有LLM可以通过示例学习新的库使用。这些结果提出了多个开放问题：是否需要示例学习，小型（更开放）模型也具备这种能力等。在这项工作中，我们采取了更广泛的方法，系统地评估了多种LLM在不同领域专业化的三个场景中代码生成能力。我们的结果显示，即使使用小型开源LLM like Llama-2和StarCoder，也能够很好地理解新的代码库，基于场景中提供的规范进行解释。我们的发现还表明，LLM在只有自然语言描述或Raw code实现函数时仍然能够学习新的库模块，这些函数经常比示例更容易获得。总的来说，我们的结果为使用LLM在更适应和动态编程环境中做出了重要贡献。

paper_url: http://arxiv.org/abs/2311.09630
repo_url: None
paper_authors: Yanchen Liu, Mingyu Derek Ma, Wenna Qin, Azure Zhou, Jiaao Chen, Weiyan Shi, Wei Wang, Diyi Yang
for: 这个研究的目的是提出一种计算模型，以推测用户受到谣言的程度。
methods: 该模型基于用户的活动记录，利用 observable sharing behavior 进行监督，以推算用户的受到谣言程度。
results: 评估表示，该模型的估计与人类判断相吻合度很高。此外，该研究还发现了不同社会因素对受到谣言程度的相关性。

Abstract
Susceptibility to misinformation describes the extent to believe unverifiable claims, which is hidden in people's mental process and infeasible to observe. Existing susceptibility studies heavily rely on the self-reported beliefs, making any downstream applications on susceptability hard to scale. To address these limitations, in this work, we propose a computational model to infer users' susceptibility levels given their activities. Since user's susceptibility is a key indicator for their reposting behavior, we utilize the supervision from the observable sharing behavior to infer the underlying susceptibility tendency. The evaluation shows that our model yields estimations that are highly aligned with human judgment on users' susceptibility level comparisons. Building upon such large-scale susceptibility labeling, we further conduct a comprehensive analysis of how different social factors relate to susceptibility. We find that political leanings and psychological factors are associated with susceptibility in varying degrees.

摘要
人们的信息受感染度描述了他们信任未经验证的说法的程度，这个程度隐藏在人们的思维过程中，无法直接观察。现有的受感染性研究主要基于自我报告的信念，这使得下游应用困难扩大。为解决这些限制，在这项工作中，我们提出了一种计算模型，用于根据用户的活动来推断他们的受感染性水平。由于用户的受感染性是共享行为的关键指标，我们利用共享行为的监督来推断受感染性的倾向。我们的评估结果显示，我们的模型可以提供与人类判断高度一致的用户受感染性水平的估计。基于大规模的受感染性标签，我们进一步进行了社会因素如政治倾向和心理因素与受感染性之间的全面分析。我们发现，政治倾向和心理因素在不同程度上与受感染性相关。

Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning

paper_url: http://arxiv.org/abs/2311.09619
repo_url: None
paper_authors: Kazuma Hashimoto, Karthik Raman, Michael Bendersky
for: 本研究旨在分析不同标签策略对目标任务的影响。
methods: 本研究使用了LLMs的输出概率和任务特定的奖励来评估不同的标策略。
results: 研究发现，当输出概率分布在整个值范围内时，概率是有效的（在分类任务上），而在 segmentation 和翻译任务上，提供细化的奖励值和长输出可以使下游指标更加稳定。此外，提出了一种新的标策方法——增量有用性，可以评估LLMs中带入的新知识增量。

Abstract
In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs' outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs' output probability given ground-truth output, and task-specific reward given LLMs' prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations.

摘要
大型语言模型（LLMs）的嵌入式学习（ICL）是一种出现的能力。只需要几个示例，LLMs 就可以作为黑obox для新任务使用。先前的研究表明，使用 LLMs 的输出作为标签可以有效地培训模型选择示例。这个标签预期能够估算示例在 ICL 中的用于性能。然而，不同的标签策略对目标任务的影响还未得到很好的理解。这篇论文分析了不同的用于性能的标签策略，并对 LLMs 的输出概率和任务特定的奖励给出了分析。与先前的工作不同，我们提出了一种新的标签方法，即增量用处，可以评估示例带来 LLMS 中的增量知识。我们在使用 instruction-tuned LLMs 进行了 binary/多类分类、分割和翻译任务，并在阿拉伯语、英语、芬兰语、日语和西班牙语等语言上进行了实验。我们的结果表明：1）在分类任务中，当概率值分布在整个值范围内时，概率效果非常高；2）在分割和翻译任务中，提供细化的奖励值和长输出可以使下游指标更加稳定。然后，我们表明了我们提出的增量用处可以进一步帮助 ICL。

Simulating Opinion Dynamics with Networks of LLM-based Agents

paper_url: http://arxiv.org/abs/2311.09618
repo_url: None
paper_authors: Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, Timothy T. Rogers
for: 这篇论文旨在 simulating human opinion dynamics 以及 understanding societal phenomena, such as polarization and the spread of misinformation.
methods: 本文使用 Large Language Models (LLMs) 来模拟意见动态, 并通过 prompt engineering 来导致confirmation bias.
results: 研究发现 LLM agents 具有强烈的倾向 towards accurate information, leading to consensus in line with scientific reality. However, this bias limits the simulation of individuals with resistant views on issues like climate change, leading to opinion fragmentation.

Abstract
Accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. However, the agent-based models (ABMs) commonly used for such simulations lack fidelity to human behavior. We propose a new approach to simulating opinion dynamics based on populations of Large Language Models (LLMs). Our findings reveal a strong inherent bias in LLM agents towards accurate information, leading to consensus in line with scientific reality. However, this bias limits the simulation of individuals with resistant views on issues like climate change. After inducing confirmation bias through prompt engineering, we observed opinion fragmentation in line with existing agent-based research. These insights highlight the promise and limitations of LLM agents in this domain and suggest a path forward: refining LLMs with real-world discourse to better simulate the evolution of human beliefs.

摘要
准确模拟人类意见动态对社会现象的理解具有重要意义，包括分化和信息的快速传播。然而，常用的Agent-based模型（ABM）在模拟人类行为方面缺乏准确性。我们提出一种基于大语言模型（LLM）的新方法来模拟意见动态。我们的发现表明LLM代理具有准确信息的强烈偏好，导致与科学实际相符的共识。然而，这种偏好限制了对抵抗看法的个体模拟，如气候变化。通过引入确认偏见通过提示工程，我们观察到意见分化与现有的ABM研究相符。这些发现表明LLM代理在这个领域的承诺和局限性，并建议通过与现实世界的对话来更好地模拟人类信念的演化。

On Retrieval Augmentation and the Limitations of Language Model Training

paper_url: http://arxiv.org/abs/2311.09615
repo_url: None
paper_authors: Ting-Rui Chiang, Xinyan Velocity Yu, Joshua Robinson, Ollie Liu, Isabelle Lee, Dani Yogatama
for: 这个论文的目的是探讨一种语言模型（LM）的改进方法，即通过k-最近邻（kNN） Retrieval来降低LM的复杂度。
methods: 该论文使用了一些新的数据集和分析方法来研究LM的各种特性和表现。其中包括了“软max瓶颈”的排除和“多层perceptron（MLP）障碍现象”的发现。
results: 研究发现，通过将kNN Retrieval incorporated into vanilla GPT-2 117M可以有效地提高LM的性能，特别是在针对不相关的训练数据进行探索和泛化时。

Abstract
Augmenting a language model (LM) with $k$-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remains elusive. In this work, we first rule out one previously posited possibility -- the "softmax bottleneck." We further identify the MLP hurdle phenomenon, where the final MLP layer in LMs may impede LM optimization early on. We explore memorization and generalization in language models with two new datasets, where advanced model like GPT-3.5-turbo find generalizing to irrelevant information in the training data challenging. However, incorporating kNN retrieval to vanilla GPT-2 117M can consistently improve performance in this setting.

摘要
Language model (LM) 可以通过 $k$-nearest neighbors（kNN） Retrieval on its training data alone 降低其plexity，但其下面的原因仍然不明确。在这项工作中，我们首先排除了一个先前提出的可能性——“softmax瓶颈”。我们进一步发现了 MLP 障碍现象，即 LM 的最后一层 MLP 层可能会阻碍 LM 优化的初始阶段。我们通过使用两个新的数据集进行了Memorization和Generalization的探索，发现高级模型如 GPT-3.5-turbo 在training数据中分配 irrelevant information 的泛化很困难。然而，在vanilla GPT-2 117M中添加 kNN Retrieval 可以一直提高性能在这种设定下。

Efficient End-to-End Visual Document Understanding with Rationale Distillation

paper_url: http://arxiv.org/abs/2311.09612
repo_url: None
paper_authors: Wang Zhu, Alekh Agarwal, Mandar Joshi, Robin Jia, Jesse Thomason, Kristina Toutanova
for: Visual document understanding benchmarks
methods: 使用小型预训练图像到文本模型进行选择性文本或布局认识和理解，作为末端模型的中间推理步骤。
results: Student model based on Pix2Struct achieved consistent improvements on three visual document understanding benchmarks, with improvements of more than 4% absolute over a comparable Pix2Struct model that predicts answers directly.

Abstract
Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.

摘要
理解图文需要识别文本和视觉元素，并解释复杂的布局。现代方法通常使用专门的预处理工具，如光学字符识别（OCR）系统，将文档图像输入映射到提取的信息空间中的文本 токен中，并有时还使用大型语言模型（LLM）来在文本 токен空间中进行理解。然而，外部工具和LLM的成本是计算和工程复杂性的增加。在这篇论文中，我们问 Whether small pretrained image-to-text模型可以学习选择性的文本或布局认识和理解作为末端模型的中间推理步骤。我们将OCR工具、LLM和更大的多Modal模型的输出作为训练数据中的中间“理由”，并训练一个小型学生模型来根据输入问题预测 rationales和答案。一个基于 Pix2Struct 的小型学生模型（282M参数）在三个视觉文档理解标准准中表现出了逐渐提高，超过4%的绝对提升。

GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks

paper_url: http://arxiv.org/abs/2311.09606
repo_url: None
paper_authors: Shivanshu Gupta, Clemens Rosenbaum, Ethan R. Elenberg
for: 这paper aimed to improve the in-context learning (ICL) performance of large language models (LLMs) by selecting the best examples from a candidate pool.
methods: The authors proposed a novel metric called GistScore, which is based on Example Gisting, a technique for training example retrievers using an attention bottleneck. They also experimented with fine-tuning gist models on each dataset and multi-task training a single model on a large collection of datasets.
results: The authors achieved state-of-the-art ICL performance on 21 diverse datasets spanning 9 tasks, with an average absolute gain of 20% over off-the-shelf retrievers and 7% over the best prior methods. Their multi-task model also generalizes well out-of-the-box to new task categories, datasets, and prompt templates, with retrieval speeds that are consistently thousands of times faster than the best prior training-free method.

Abstract
Large language models (LLMs) have the ability to perform in-context learning (ICL) of new tasks by conditioning on prompts comprising a few task examples. This work studies the problem of selecting the best examples given a candidate pool to improve ICL performance on given a test input. Existing approaches either require training with feedback from a much larger LLM or are computationally expensive. We propose a novel metric, GistScore, based on Example Gisting, a novel approach for training example retrievers for ICL using an attention bottleneck via Gisting, a recent technique for compressing task instructions. To tradeoff performance with ease of use, we experiment with both fine-tuning gist models on each dataset and multi-task training a single model on a large collection of datasets. On 21 diverse datasets spanning 9 tasks, we show that our fine-tuned models get state-of-the-art ICL performance with 20% absolute average gain over off-the-shelf retrievers and 7% over the best prior methods. Our multi-task model generalizes well out-of-the-box to new task categories, datasets, and prompt templates with retrieval speeds that are consistently thousands of times faster than the best prior training-free method.

摘要
大型语言模型（LLM）具有培根学习（ICL）新任务的能力，通过条件Prompt中的一些任务示例来实现。这项工作研究如何选择最佳示例集来提高ICL性能，以便对给定输入进行测试。现有方法可能需要与更大的LLM进行培训或者计算成本较高。我们提出了一个新的指标——GistScore，基于Example Gisting，一种新的培训示例检索器 для ICL 使用注意力瓶颈via Gisting，一种最近的技术用于压缩任务说明。为了让性能和使用方便进行权衡，我们进行了练习 fine-tuning gist模型 на每个数据集和多任务训练单个模型在一个大量数据集上。在21个多样化的数据集和9个任务上，我们显示了我们的精心调整模型可以达到状态之最ICL性能，与各种off-the-shelf retrievers相比，具有20%的绝对均值提升，并且与最佳先前方法相比，具有7%的提升。我们的多任务模型在新的任务类别、数据集和提示模板上具有良好的泛化能力，并且在输入速度上与最佳先前无需培训的方法相比，保持了一定的速度优势。

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

paper_url: http://arxiv.org/abs/2311.09605
repo_url: None
paper_authors: Yanai Elazar, Bhargavi Paranjape, Hao Peng, Sarah Wiegreffe, Khyathi Raghavi, Vivek Srikumar, Sameer Singh, Noah A. Smith
for: 本研究旨在检验现有的超vised和in-context学习模型是否过分依赖于训练数据中的偶合关系，以及Counterfactual Attentiveness Test（CAT）是否能够改善模型的抽象能力。
methods: 本研究使用Counterfactual Attentiveness Test（CAT）来系统地检验了十个dataset上四个任务（自然语言推理、阅读理解、句子重构、视觉语言理解）上的established supervised和in-context learning模型。CAT使用对应的counterfactual来替换训练数据中的一部分，并期望模型能够根据这些counterfactual进行不同的预测。
results: 研究发现，依赖于训练数据中的偶合关系的依赖性是主要的数据依赖性。另外，研究发现GPT3在增加示例数量后变得更加不够注意力，而其测试数据上的准确率提高。结果表明，在训练或示例数据中添加counterfactual可以提高模型的抽象能力。此外，CAT测试表明，模型的注意力测量不同于 solely measuring correlations in data。

Abstract
The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to correlations between a specific part of the input (e.g., the hypothesis in NLI) and the label; consequently, models trained only on those outperform chance. Are these correlations picked up by models trained on the full input data? To address this question, we propose a new evaluation method, Counterfactual Attentiveness Test (CAT). CAT uses counterfactuals by replacing part of the input with its counterpart from a different example (subject to some restrictions), expecting an attentive model to change its prediction. Using CAT, we systematically investigate established supervised and in-context learning models on ten datasets spanning four tasks: natural language inference, reading comprehension, paraphrase detection, and visual & language reasoning. CAT reveals that reliance on such correlations is mainly data-dependent. Surprisingly, we find that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves. Our results demonstrate that augmenting training or demonstration data with counterfactuals is effective in improving models' attentiveness. We show that models' attentiveness measured by CAT reveals different conclusions from solely measuring correlations in data.

摘要
“训练数据中偶现的假象相关性会对NLP模型的泛化性产生负面影响。先前的研究发现，带有对应输入部分（如NLI中的假设）和标签之间存在相关性，导致使用只有这些输入训练的模型能够超过偶散。现在我们提出了一种新的评估方法：对比性注意力测试（CAT）。CAT使用对比例的方法，替换输入中的一部分（保留一些限制），期望一个注意力强的模型会改变其预测。通过CAT，我们系统地研究了多种supervised和in-context学习模型在十个 datasets 上，涵盖四个任务：自然语言推理、阅读理解、句子重写检测和视觉语言理解。结果表明，模型对这些相关性的依赖性是数据висиendent的。另外，我们发现GPT3在增加示例数量后，其注意力度会降低，而测试数据上的准确率会提高。我们的结果表明，在训练或示例数据中添加对比例可以提高模型的注意力度。我们的结果还表明，通过CAT评估模型的注意力度可以从数据中的相关性中分离出不同的结论。”

SCORE: A framework for Self-Contradictory Reasoning Evaluation

paper_url: http://arxiv.org/abs/2311.09603
repo_url: None
paper_authors: Ziyi Liu, Isabelle Lee, Yongkang Du, Soumya Sanyal, Jieyu Zhao
for: 这 paper 旨在分析大语言模型（LLM）是否真的具备良好的理解能力，以及这种能力是如何影响下游任务的性能。
methods: 这 paper 使用了一种名为 \textsc{SCORE} 的框架来分析 LLM 的理解能力。特别是，它关注自相矛盾的理解，即 LLM 在处理含有上下文信息和常识的任务时，可能会出现自相矛盾的行为。
results: 研究发现，LLM 在多个视点 Setting 下表现不稳定，甚至对正确预测也可能表现出含糊不清的理解。这些结果指出了 LLM 的理解能力有很大的改进空间，并且需要进一步的研究来确定评价reasoning的最佳实践。

Abstract
Large language models (LLMs) have demonstrated impressive reasoning ability in various language-based tasks. Despite many proposed reasoning methods aimed at enhancing performance in downstream tasks, two fundamental questions persist: Does reasoning genuinely support predictions, and how reliable is the quality of reasoning? In this paper, we propose a framework \textsc{SCORE} to analyze how well LLMs can reason. Specifically, we focus on self-contradictory reasoning, where reasoning does not support the prediction. We find that LLMs often contradict themselves when performing reasoning tasks that involve contextual information and commonsense. The model may miss evidence or use shortcuts, thereby exhibiting self-contradictory behaviors. We also employ the Point-of-View (POV) method, which probes models to generate reasoning from multiple perspectives, as a diagnostic tool for further analysis. We find that though LLMs may appear to perform well in one-perspective settings, they fail to stabilize such behavior in multi-perspectives settings. Even for correct predictions, the reasoning may be messy and incomplete, and LLMs can easily be led astray from good reasoning. \textsc{SCORE}'s results underscore the lack of robustness required for trustworthy reasoning and the urgency for further research to establish best practices for a comprehensive evaluation of reasoning beyond accuracy-based metrics.

摘要
Translation notes:* "Large language models" is translated as "大型语言模型" (dàxìng yǔyán módel).* "Reasoning" is translated as "理解" (lǐjiě) or "解释" (jiějie).* "Self-contradictory reasoning" is translated as "自相矛盾的理解" (zìxiāng dòuduō de lǐjiě).* "Point-of-View" is translated as "视角" (wénjiàng).* "Multi-perspectives" is translated as "多视角" (duōwénjiàng).* "Messy and incomplete" is translated as "杂乱不完整" (zàilàng bù qiáncháng).* "Trustworthy reasoning" is translated as "可靠的理解" (kěkuài de lǐjiě).

Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion

paper_url: http://arxiv.org/abs/2311.09602
repo_url: None
paper_authors: Smriti Singh, Cornelia Caragea, Junyi Jessy Li
for: 这个研究是为了检验大型自然语言模型（LLM）和精度调整模型（Fine-tuned models）是否能够正确地识别情绪诱发因素（emotion triggers）。
methods: 该研究使用了一个新的数据集EmoTrigger，该数据集包含900个社交媒体文章，来源于三个不同的数据集，并由专家 manually annotated为情绪诱发因素。
results: 研究发现，情绪诱发因素并不是情绪预测模型中考虑的重要特征，而是存在详细的相互作用 между各种特征和情绪检测任务。

Abstract
Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? Prior work in emotion trigger or cause identification focused on training models to recognize events that trigger an emotion. Instead, this work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.

摘要
情感Trigger的情况和事件会让人们表现出不同的情感，但到底这些事件会如何影响情感探测模型的预测呢？先前的工作主要集中在训练模型可以识别引发情感的事件上，而这个工作则是 investigate how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions。我们首先介绍了一个新的数据集EmotTrigger，该数据集包含900个社交媒体文章，来自三个不同的数据集，这些文章由专家进行情感触发点的注释，注释具有高度一致性。使用EmotTrigger数据集，我们评估了大型自然语言模型（LLMs）能够识别情感触发点，并对这些任务之间的细节进行比较分析。我们的分析发现，情感触发点并不是情感预测模型考虑的重要特征，而是各种特征之间的细节很复杂地相互作用，以实现情感预测任务。

LifeTox: Unveiling Implicit Toxicity in Life Advice

paper_url: http://arxiv.org/abs/2311.09585
repo_url: https://github.com/minbeomkim/LifeTox
paper_authors: Minbeom Kim, Jahyun Koo, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung
for: 这个论文的目的是为了检测生活中的隐式恶意言语。
methods: 这个论文使用了RoBERTa模型，并在LifeTox数据集上进行了微调。
results: 实验表明，RoBERTa模型在隐式恶意言语分类任务中匹配或超过了现有的大语言模型的零shot性能。

Abstract
As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce LifeTox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox comprises diverse contexts derived from personal experiences through open-ended questions. Experiments demonstrate that RoBERTa fine-tuned on LifeTox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of LifeTox in addressing the complex challenges inherent in implicit toxicity.

摘要
Large language models are becoming increasingly integrated into daily life, so detecting implicit toxicity across diverse contexts is crucial. To address this challenge, we introduce LifeTox, a dataset designed for identifying implicit toxicity in a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox includes diverse contexts derived from personal experiences through open-ended questions. Experimental results show that RoBERTa fine-tuned on LifeTox performs equally well or even better than large language models in toxicity classification tasks, demonstrating the effectiveness of LifeTox in addressing the complex challenges of implicit toxicity.

Enhancing Medical Text Evaluation with GPT-4

paper_url: http://arxiv.org/abs/2311.09581
repo_url: None
paper_authors: Yiqing Xie, Sheng Zhang, Hao Cheng, Zelalem Gero, Cliff Wong, Tristan Naumann, Hoifung Poon
for: 针对医疗文本生成评估中的准确性评价。
methods: 提出基于GPT-4的医疗文本评估方法，包括细致性评估方面和相关医疗领域模型训练。
results: 与现有评价 metric 比较，提出的GPT-4基于评价方法在医疗笔记生成和医疗报告摘要任务上显示了substantially higher的一致性。

Abstract
In the evaluation of medical text generation, it is essential to scrutinize each piece of information and ensure the utmost accuracy of the evaluation. Existing evaluation metrics either focus on coarse-level evaluation that assigns one score for the whole generated output or rely on evaluation models trained on general domain, resulting in inaccuracies when adapted to the medical domain. To address these issues, we propose a set of factuality-centric evaluation aspects and design corresponding GPT-4-based metrics for medical text generation. We systematically compare these metrics with existing ones on clinical note generation and medical report summarization tasks, revealing low inter-metric correlation. A comprehensive human evaluation confirms that the proposed GPT-4-based metrics exhibit substantially higher agreement with human judgments than existing evaluation metrics. Our study contributes to the understanding of medical text generation evaluation and offers a more reliable alternative to existing metrics.

摘要
在医学文本生成评估中，必须仔细检查每个信息并确保评估的准确性。现有的评估指标可能会将整个生成输出的评估授予一个分数，或者基于通用领域的评估模型，导致在医学领域中出现不准确的评估。为解决这些问题，我们提出了一组中心于事实的评估方面和基于GPT-4的评估指标，用于医学文本生成。我们系统比较了这些指标与现有指标的相关性，发现它们在医学报告摘要和医学病历生成任务上显示了低相关性。人工评估表明，我们提出的GPT-4基于的评估指标与人类判断更为一致，与现有指标相比，具有更高的一致性。我们的研究增进了医学文本生成评估的理解，并提供了更可靠的评估方法。

MMOE: Mixture of Multimodal Interaction Experts

paper_url: http://arxiv.org/abs/2311.09580
repo_url: None
paper_authors: Haofei Yu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
for: 本研究旨在解决现实世界中新型多modal交互的问题，例如让机器学习模型更好地理解混乱的语言和手势之间的互动关系。
methods: 本研究提出了一种新的方法 called MMOE（多modal交互专家杂合），它可以自动将数据点分类为不同的交互类型，并采用特定交互类型的专门模型进行处理。
results: 根据实验结果，MMOE方法可以提高对困难交互的表现，比如让机器学习模型更好地预测讽刺语言。总的来说，这种方法可以提高 dataset 分析的新视角，并且实现了当前最佳性能。

Abstract
Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. However, this is just a portion of the potential multimodal interactions seen in the real world and does not include new interactions between conflicting utterances and gestures in predicting sarcasm, for example. Notably, the current methods for capturing shared information often do not extend well to these more nuanced interactions, sometimes performing as low as 50% in binary classification. In this paper, we address this problem via a new approach called MMOE, which stands for a mixture of multimodal interaction experts. Our method automatically classifies data points from unlabeled multimodal datasets by their interaction type and employs specialized models for each specific interaction. Based on our experiments, this approach improves performance on these challenging interactions by more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, interaction quantification provides new insights for dataset analysis and yields simple approaches that obtain state-of-the-art performance.

摘要
多模式机器学习，研究不同输入模式之间的信息和互动，在理解图像和描述文本之间的关系方面做出了重要进步。然而，这只是实际世界中多模式互动的一部分，不包括新型的对话和手势冲突的互动，如讲述嘲讽的例子。当前的共享信息捕捉方法经常不能够很好地扩展到这些更复杂的互动，有时performance只有50%级别的binary分类。在这篇论文中，我们解决这个问题通过一种新的方法，即MMOE（多模式互动专家混合）。我们的方法可以自动将数据点从无标签多模式数据集分类为互动类型，并采用特殊的模型来处理每种特定的互动。根据我们的实验，这种方法可以提高对这些复杂的互动的性能，增加总性能约2%，如讲述嘲讽预测等任务。因此，互动量化提供了新的数据分析途径，并且实现了简单的方法，达到了现状之前的最佳性能。Here is the translation of the text into Simplified Chinese:多模式机器学习，研究不同输入模式之间的信息和互动，在理解图像和描述文本之间的关系方面做出了重要进步。然而，这只是实际世界中多模式互动的一部分，不包括新型的对话和手势冲突的互动，如讲述嘲讽的例子。当前的共享信息捕捉方法经常不能够很好地扩展到这些更复杂的互动，有时performance只有50%级别的binary分类。在这篇论文中，我们解决这个问题通过一种新的方法，即MMOE（多模式互动专家混合）。我们的方法可以自动将数据点从无标签多模式数据集分类为互动类型，并采用特殊的模型来处理每种特定的互动。根据我们的实验，这种方法可以提高对这些复杂的互动的性能，增加总性能约2%，如讲述嘲讽预测等任务。因此，互动量化提供了新的数据分析途径，并且实现了简单的方法，达到了现状之前的最佳性能。

Crafting In-context Examples according to LMs’ Parametric Knowledge

paper_url: http://arxiv.org/abs/2311.09579
repo_url: None
paper_authors: Yoonsang Lee, Pranav Atreya, Xi Ye, Eunsol Choi
for: 本研究探讨了如何构建受Context的示例集，以便在语言模型中触发行为，即 surface parametric knowledge。
methods: 研究使用了受Context示例集，并进行了分类和分析，以了解模型对于受Context示例的 parametric knowledge。
results: 实验结果表明，使用包含知识和未知信息的示例集可以最佳地在多种设置下进行表现。此外，研究还发现，使用模型的 parametric knowledge 来排序答案集可以提高表现。

Abstract
In-context learning has been applied to knowledge-rich tasks such as question answering. In such scenarios, in-context examples are used to trigger a behaviour in the language model: namely, it should surface information stored in its parametric knowledge. We study the construction of in-context example sets, with a focus on the parametric knowledge of the model regarding in-context examples. We identify 'known' examples, where models can correctly answer from its parametric knowledge, and 'unknown' ones. Our experiments show that prompting with 'unknown' examples decreases the performance, potentially as it encourages hallucination rather than searching its parametric knowledge. Constructing an in-context example set that presents both known and unknown information performs the best across diverse settings. We perform analysis on three multi-answer question answering datasets, which allows us to further study answer set ordering strategies based on the LM's knowledge about each answer. Together, our study sheds lights on how to best construct in-context example sets for knowledge-rich tasks.

摘要
启用上下文学习应用于知识充沛的任务，如问答。在这些场景下，上下文示例被用来触发语言模型的行为：即它应该Surface其参数知识中的信息。我们研究上下文示例集的建构，强调语言模型对上下文示例的参数知识。我们分类了“知道”的示例和“不知道”的示例。我们的实验表明，向语言模型提供“不知道”的示例会降低其性能，可能是因为它鼓励了幻化而不是搜索其参数知识。构建包含知道和不知道信息的上下文示例集最佳，我们在多种场景中进行了分析。我们还研究了基于语言模型对每个答案的知识来排序答案集的策略。ogether，我们的研究为知识充沛任务中的上下文示例集建构提供了新的灯光。

A Reevaluation of Event Extraction: Past, Present, and Future Challenges

paper_url: http://arxiv.org/abs/2311.09562
repo_url: https://github.com/ej0cl6/textee
paper_authors: Kuan-Hao Huang, I-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Premkumar Natarajan, Kai-Wei Chang, Nanyun Peng, Heng Ji
For: The paper is written for the purpose of proposing a standardized, fair, and reproducible benchmark for event extraction, and to address the evaluation challenges in recent studies.* Methods: The paper uses standardized data preprocessing scripts and splits for more than ten datasets across different domains, and aggregates and re-implements over ten event extraction approaches published in recent years.* Results: The paper conducts a comprehensive reevaluation of event extraction approaches using the proposed benchmark, and explores the capability of large language models in event extraction. The results are expected to provide a reliable benchmark for future research in the field.

Abstract
Event extraction has attracted much attention in recent years due to its potential for many applications. However, recent studies observe some evaluation challenges, suggesting that reported scores might not reflect the true performance. In this work, we first identify and discuss these evaluation challenges, including the unfair comparisons resulting from different assumptions about data or different data preprocessing steps, the incompleteness of the current evaluation framework leading to potential dataset bias or data split bias, and low reproducibility of prior studies. To address these challenges, we propose TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE contains standardized data preprocessing scripts and splits for more than ten datasets across different domains. In addition, we aggregate and re-implement over ten event extraction approaches published in recent years and conduct a comprehensive reevaluation. Finally, we explore the capability of large language models in event extraction and discuss some future challenges. We expect TextEE will serve as a reliable benchmark for event extraction, facilitating future research in the field.

摘要
Event extraction 在最近几年内受到了广泛关注，因为它在多个应用领域中具有潜在的潜力。然而，最近的研究发现了评估挑战，表明报告的分数可能不准确反映实际表现。在这项工作中，我们首先标识和讨论了评估挑战，包括数据假设不同或数据预处理步骤不同导致的不公正比较，当前评估框架不完整，导致可能的数据偏见或数据拆分偏见，以及过去研究的低可重现性。为解决这些挑战，我们提出了 TextEE，一个标准化、公平、可重现的事件抽取benchmark。 TextEE包含了标准化的数据预处理脚本和分割，以及多个领域的超过十个数据集。此外，我们对过去十年以来发表的十多个事件抽取方法进行了汇总和重新实现，并进行了全面的重评。最后，我们探讨了大语言模型在事件抽取中的能力，并讨论了未来的挑战。我们期望 TextEE 能成为事件抽取领域的可靠 benchmark，促进未来的研究。

Pachinko: Patching Interpretable QA Models through Natural Language Feedback

paper_url: http://arxiv.org/abs/2311.09558
repo_url: https://github.com/chaitanyamalaviya/pachinko
paper_authors: Chaitanya Malaviya, Subin Lee, Dan Roth, Mark Yatskar
for: 本研究旨在提高NL模型的评估，通过从用户反馈中收集改进模型。
methods: 研究使用了分解式问答模型，首先从 контек斯和问题中提取中间理由，然后使用这个理由来回答问题。
results: 研究发现，不同的理由格式对于用户提供反馈和理解模型回答的能力有显著影响。 certain formats significantly enhance user reported understanding and trust of model outputs.

Abstract
Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales generated by QA models to support their answers. We specifically consider decomposed question-answering models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample these rationales from large language models using few-shot prompting for two reading comprehension datasets, and then perform two user studies. In the first one, we present users with incorrect answers and corresponding rationales of various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition to influencing critiquablity, certain formats significantly enhance user reported understanding and trust of model outputs.

摘要
找到用户对NL理解模型的反馈可以有助于改进模型。然而，如何在给用户显示模型回答以便他们可以更好地修改它？而且，用户关心什么样的特性来信任和理解模型的回答呢？我们通过分析QA模型生成的论证来回答这些问题。我们专门考虑了基于上下文和问题的分解Question answering模型，它们首先从上下文和问题中提取中间论证，然后只使用这个论证回答问题。论证描述模型回答问题的方法。我们使用大型语言模型通过几个提示来采样这些论证，然后对两个阅读理解dataset进行两项用户研究。在第一项研究中，我们给用户显示错误的回答和相应的论证不同格式，并询问他们提供自然语言反馈来修改论证。我们然后测量这些反馈是否可以通过上下文学习来修复论证。第二项研究检验了不同的论证格式对用户理解和信任模型输出的影响。我们发现，不同的论证格式对用户提供反馈的容易度和模型执行这些反馈的能力有很大影响。此外，某些格式可以明显提高用户报告的理解和信任度。

Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition

paper_url: http://arxiv.org/abs/2311.09552
repo_url: None
paper_authors: Tariq Alhindi, Smaranda Muresan, Preslav Nakov
for: 提高现有的谬误认识模型，以便更好地处理多种频率不均的谬误类型。
methods: incorporating additional context and leveraging大语言模型生成Synthetic数据，以增加较少seen classes的表现。
results: 在不同的谬误类型、数据集和生成器上进行了评估，得到了一致的提高。

Abstract
Recognizing fallacies is crucial for ensuring the quality and validity of arguments across various domains. However, computational fallacy recognition faces challenges due to the diverse genres, domains, and types of fallacies found in datasets. This leads to a highly multiclass, and even multi-label, setup with substantial class imbalance. In this study, we aim to enhance existing models for fallacy recognition by incorporating additional context and by leveraging large language models to generate synthetic data, thus increasing the representation of the infrequent classes. We experiment with GPT3.5 to generate synthetic examples and we examine the impact of prompt settings for this. Moreover, we explore zero-shot and few-shot scenarios to evaluate the effectiveness of using the generated examples for training smaller models within a unified fallacy recognition framework. Furthermore, we analyze the overlap between the synthetic data and existing fallacy datasets. Finally, we investigate the usefulness of providing supplementary context for detecting fallacy types that need such context, e.g., diversion fallacies. Our evaluation results demonstrate consistent improvements across fallacy types, datasets, and generators.

摘要
识别谬误是确保不同领域的论据质量和有效性的关键。然而，计算机谬误识别受到数据集中多种类型、领域和类别的多种谬误的挑战。这导致了一个高度多类、甚至多标签的设置，以及巨大的类别偏度问题。在这种情况下，我们想要提高现有的谬误识别模型，通过添加更多的 контекст和利用大语言模型生成 sintetic数据，以增加轻度类的表现。我们使用GPT3.5生成 sintetic例子，并考虑Prompt设置的影响。此外，我们还探索零shot和几shotenario来评估使用生成的例子来训练更小的模型在一个简化的谬误识别框架中。此外，我们还分析了生成的数据和现有的谬误数据集之间的重叠。最后，我们 investigate了在检测某些谬误类型时提供补充的 контекст的有用性，例如误导谬误。我们的评估结果表明，无论谬误类型、数据集或生成器，我们的方法都能够实现了一致的改进。

A Speed Odyssey for Deployable Quantization of LLMs

paper_url: http://arxiv.org/abs/2311.09550
repo_url: None
paper_authors: Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu, Xiangxiang Chu, Yerui Sun, Yuchen Xie
for: 这个研究旨在提高大语言模型的推理速度和成本效益。
methods: 本研究使用硬件对应的数字化方法，排除不切实际的算法选择，同时将最大化硬件加速的效益。
results: 实验结果显示，我们的W4A8方法可以提高实际推理速度至多达4倍于Hugging Face FP16推理和2.23倍于TensorRT-LLM在FP16推理中，并在INT8推理中与TensorRT-LLM相比提高了1.45倍，而不会对性能造成严重干扰。

Abstract
The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.

摘要
大型语言模型时代强调更快速且成本更低的推导。先前的模型压缩方法倾向于以软件中心的方式进行，主要侧重在模拟量化性能。但这些方法通常在实际应用中被禁用，因为它们通常会降低量化比例，使得主流硬件无法支持或增加了复杂的算法或内存访问开销。我们认为在量化算法的建立中，应该将硬件考虑为核心。在这方面，我们将我们的压缩方法建立在硬件意识之上，排除不可行的算法选择，同时将硬件加速器的最大优化。我们的方法 OdysseyLLM 搭配了一个新的 W4A8 核心实现 FastGEMM，以及一种结合的量化策略。实验结果显示 OdysseyLLM 的实际速度提升为 \textbf{4$\times$} 比 Hugging Face FP16 推导，并且与现有的推导引擎 TensorRT-LLM 在 FP16 下的速度提升为 \textbf{2.23$\times$}，并且在 INT8 下的速度提升为 \textbf{1.45$\times$}，但不会对性能造成严重的损害。

Towards Pragmatic Awareness in Question Answering: A Case Study in Maternal and Infant Health

paper_url: http://arxiv.org/abs/2311.09542
repo_url: None
paper_authors: Neha Srikanth, Rupak Sarkar, Rachel Rudinger, Jordan Boyd-Graber
for: 这个论文主要是为了解决问答系统在高风险领域 like maternal and infant health 中能够更好地回答用户问题。
methods: 该论文使用大量语言模型来检测问题中含义的推理，以便在回答用户问题时能够更加准确地理解用户的需求。
results: 研究发现，通过检测问题中含义的推理，可以生成更加准确和有用的回答，从而避免了在回答用户问题时可能产生的危害。

Abstract
Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study pragmatic inferences made when mothers ask questions about pregnancy and infant care. Some of the inferences in these questions evade detection by existing methods, risking the possibility of QA systems failing to address them which can have dangerous health and policy implications. We explore the viability of detecting inferences from questions using large language models and illustrate that informing existing QA pipelines with pragmatic inferences produces responses that can mitigate the propagation of harmful beliefs.

摘要
常见于信息寻求用户的问题中的隐含假设或潜在危险假设，在高风险领域如母婴健康，一个问答系统必须认识这些实用限制，不仅回答用户的问题，更要在上下文中检查它们，以对用户提供有用的回答。为了实现这一目标，我们研究了怀孕和婴儿护理中妈妈提出的假设推理。一些这些问题中的假设逃避现有的方法检测，这可能会导致问答系统失败 Addressing them, which can have serious health and policy implications. We explore the feasibility of detecting inferences from questions using large language models and show that incorporating pragmatic inferences into existing QA pipelines can mitigate the propagation of harmful beliefs.

Reducing Privacy Risks in Online Self-Disclosures with Language Models

paper_url: http://arxiv.org/abs/2311.09538
repo_url: None
paper_authors: Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, Wei Xu
For: 保护在线自透泄的用户端隐私* Methods: 发展19种自透泄类划分，精度 fine-tune语言模型，并进行人工测试* Results: 实现Token F$_1$的过程优于75%，并通过用户反馈引入自透泄抽象任务，实现多种 fine-tuning 策略，生成具有较高实用性和Moderate隐私风险的抽象结果。

Abstract
Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. We develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for identification, achieving over 75% in Token F$_1$. We further conduct a HCI user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction. We experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. Our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.

摘要
自我披露在社交媒体交互中很常见和奖励，但也存在隐私风险。在这篇论文中，我们主动保护用户端隐私相关于在线自我披露的权益。我们开发了19种自我披露类别的taxonomy，并采集了4.8K注释化的披露跨度。我们然后精细调整语言模型，实现了Token F$_1$的过 75%。我们进一步进行了人机交互研究，82%的参与者视为模型有利可图，这反映了其在实际世界中的可行性。受用户反馈 inspirited，我们引入了自我披露抽象任务。我们在一span抽象和三span抽象的设置下进行了实验，并探索了多种调整策略。我们最佳模型可以生成多样化的抽象， moderately reducing privacy risks while maintaining high utility according to human evaluation.

Effective Large Language Model Adaptation for Improved Grounding

paper_url: http://arxiv.org/abs/2311.09533
repo_url: None
paper_authors: Xi Ye, Ruoxi Sun, Sercan Ö. Arik, Tomas Pfister
for: 提高大型自然语言模型（LLMs）在实际应用中的广泛部署，因为它们可能会生成“幻想”的答案。
methods: 提出了一种新的框架AGREE，即Adaptation of LLMs for GRounding EnhancEment，以改进grounding的问题从一个整体的角度。
results: 比较prompting-based方法，通过调整LLMs来ground它们的答案，可以得到更好地参照的答案，并且可以减少对数据的需求。

Abstract
Large language models (LLMs) have achieved remarkable advancements in natural language understanding, generation, and manipulation of text-based data. However, one major issue towards their widespread deployment in the real world is that they can generate "hallucinated" answers that are not factual. Towards this end, this paper focuses on improving grounding from a holistic perspective with a novel framework, AGREE, Adaptation of LLMs for GRounding EnhancEment. We start with the design of an iterative test-time adaptation (TTA) capability that takes into account the support information generated in self-grounded responses. To effectively enable this capability, we tune LLMs to ground the claims in their responses to retrieved documents by providing citations. This tuning on top of the pre-trained LLMs requires a small amount of data that needs to be constructed in a particular way to learn the grounding information, for which we introduce a data construction method. Our results show that the tuning-based AGREE framework generates better grounded responses with more accurate citations compared to prompting-based approaches.

摘要
The AGREE framework focuses on improving grounding from a holistic perspective by incorporating an iterative test-time adaptation (TTA) capability that considers the support information generated in self-grounded responses. To enable this capability, we fine-tune LLMs to ground their claims in their responses to retrieved documents by providing citations. This fine-tuning process requires a small amount of specially constructed data to learn the grounding information.Our results show that the tuning-based AGREE framework generates more accurate and better grounded responses compared to prompting-based approaches. This demonstrates the effectiveness of the AGREE framework in improving the factual accuracy of LLMs' responses.

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-driven Training Data Generation

paper_url: http://arxiv.org/abs/2311.09521
repo_url: None
paper_authors: Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, Nanyun Peng
for: 本研究的目的是提高抽象摘要中的事实准确性，尤其是在抽象摘要 task 中，保持信息的精度是非常重要的。
methods: 本研究使用了 Abstract Meaning Representation (AMR) 来生成不一致的摘要，并使用了自然语言判断和 BARTScore 来选择高质量的负例。
results: 实验结果表明，本研究的方法在 AggreFact-SOTA 数据集上显著超越了之前的系统，这说明了其在检测抽象摘要中的事实准确性的能力。

Abstract
Ensuring factual consistency is crucial in various natural language processing tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior entailment-based approaches often generate factually inconsistent summaries and then train a classifier on the generated data. However, summaries produced by these approaches are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a novel framework that generates factually inconsistent summaries using Abstract Meaning Representation (AMR). Our approach parses factually correct summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate that our approach significantly outperforms previous systems on the AggreFact-SOTA dataset, showcasing its efficacy in assessing factuality in abstractive summarization.

摘要
保持事实一致性在各种自然语言处理任务中非常重要，特别是在抽象概念摘要中，因为保持信息完整性非常重要。先前基于前提推理的方法通常会生成不一致的摘要，然后对生成的数据进行训练。然而，这些方法生成的摘要通常是低凝结的或缺乏错误类型覆盖。为解决这些问题，我们提出了 AMRFact 框架，它使用抽象意义表示（AMR）来生成不一致的摘要。我们的方法将事实正确的摘要转换为 AMR 图并在其中注入控制的不一致性，以生成高错误类型覆盖的不一致摘要。此外，我们还提出了一个名为 NegFilter 的数据选择模块，它根据自然语言推理和 BARTScore 来确保生成的负样本的质量。实验结果表明，我们的方法与之前系统相比显著提高了 AggreFact-SOTA 数据集上的表现，这表明我们的方法在抽象概念摘要中评估事实性的有效性。

Leveraging Code to Improve In-context Learning for Semantic Parsing

paper_url: http://arxiv.org/abs/2311.09519
repo_url: None
paper_authors: Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal
for: 提高semantic parsing的效果，尤其是在受限的数据量下
methods: 使用通用编程语言如Python，并将提问添加结构化域描述
results: 在三个 популяр的数据集上显著提高了准确率（例如，从7.9%提升到66.5%），降低了需要大量示例的要求，并减少了语言的 популяр度对于预训练 corpora 的影响。

Abstract
In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs, and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets. Combined, they lead to dramatic improvements (e.g. 7.9% to 66.5% on SMCalFlow compositional split), nearly closing the performance gap between easier i.i.d.\ and harder compositional splits when used with a strong model, and reducing the need for a large number of demonstrations. We find that the resemblance of the target parse language to general-purpose code is a more important factor than the language's popularity in pre-training corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.

摘要
启发式学习（ICL）是 semantic parsing 方法的一种吸引人的方式，因为它可以通过几次示例学习来达到更好的泛化性。然而，学习到特定领域语言（DSL）的语义分析仍然是挑战，尤其是使用只有几个示例的情况下。在这种情况下，我们改进了 ICLL 的效iveness，通过以下两点：1. 使用通用编程语言，如 Python，而不是特定领域语言。2. 在提示中添加结构化领域描述，包括可用的类和函数。我们发现，这两点都会显著提高准确性，并在三个流行的数据集上达到了显著提高（例如，从 7.9% 提高到 66.5% 在 SMCalFlow compositional split 上）。这些改进使得模型在更难的 compositional split 上表现更好，并减少了需要大量示例的需求。我们发现，目标语义分析语言与通用编程语言之间的相似性是更重要的因素，而不是语言的 популярность。我们的发现可以提供一种改进的方法来在现代 ICLL 中建立 semantic parser。

GEE! Grammar Error Explanation with Large Language Models

paper_url: http://arxiv.org/abs/2311.09517
repo_url: None
paper_authors: Yixiao Song, Kalpesh Krishna, Rajesh Bhatt, Kevin Gimpel, Mohit Iyyer
for: 这个论文是为了解决语言学习者的 grammatical error correction 问题而写的。
methods: 这个论文使用的方法包括使用 GPT-4 生成一个一个 sentence explanation 的 pipeline，以及使用 fine-tuned 和提示的大型语言模型进行 structured atomic token edit extraction。
results: 人工评估表明，这个pipeline在德语和中文 grammar error correction 数据上的正确率分别为 93.9% 和 98.0%。

Abstract
Grammatical error correction tools are effective at correcting grammatical errors in users' input sentences but do not provide users with \textit{natural language} explanations about their errors. Such explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006). To address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting. To improve upon this performance, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to generate explanations. We evaluate our pipeline on German and Chinese grammar error correction data sampled from language learners with a wide range of proficiency levels. Human evaluation reveals that our pipeline produces 93.9% and 98.0% correct explanations for German and Chinese data, respectively. To encourage further research in this area, we will open-source our data and code.

摘要
grammatical error correction tools can correct grammatical errors in users' input sentences, but they do not provide users with 自然语言 explanations about their errors. these explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006). to address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. we analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting. to improve upon this performance, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to generate explanations. we evaluate our pipeline on German and Chinese grammar error correction data sampled from language learners with a wide range of proficiency levels. human evaluation reveals that our pipeline produces 93.9% and 98.0% correct explanations for German and Chinese data, respectively. to encourage further research in this area, we will open-source our data and code.

Sequencing Matters: A Generate-Retrieve-Generate Model for Building Conversational Agents

paper_url: http://arxiv.org/abs/2311.09513
repo_url: None
paper_authors: Quinn Patwardhan, Grace Hui Yang
for: This paper describes the Georgetown InfoSense group’s approach to solving the challenges of TREC iKAT 2023.
methods: The approach uses a Generate-Retrieve-Generate method, which is found to outperform Retrieve-Then-Generate approaches. The solution involves using Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
results: The submitted runs outperform the median runs by a significant margin, with superior performance in nDCG across various cut numbers and overall success rate. The official results of the TREC evaluation contradict the initial self-evaluation, but the findings suggest that the sequence of involving different components matters, with LLMs being essential before using search engines.

Abstract
This paper contains what the Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023. Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate. Our approach uses a Generate-Retrieve-Generate method, which we've found to greatly outpace Retrieve-Then-Generate approaches for the purposes of iKAT. Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again. We leverage several purpose-built Language Models, including BERT, Chat-based, and text-to-transfer-based models, for text understanding, classification, generation, and summarization. The official results of the TREC evaluation contradict our initial self-evaluation, which may suggest that a decrease in the reliance on our retrieval and classification methods is better. Nonetheless, our findings suggest that the sequence of involving these different components matters, where we see an essentiality of using LLMs before using search engines.

摘要
Our solution employs Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again. We utilize several purpose-built Language Models, including BERT, Chat-based, and text-to-transfer-based models, for text understanding, classification, generation, and summarization.While the official results of the TREC evaluation differ from our initial self-evaluation, our findings suggest that the sequence of involving these different components is crucial. Specifically, we find that using LLMs before search engines is essential.

One Size Does Not Fit All: Customizing Open-Domain Procedures

paper_url: http://arxiv.org/abs/2311.09510
repo_url: None
paper_authors: Yash Kumar Lal, Li Zhang, Faeze Brahman, Bodhisattwa Prasad Majumder, Peter Clark, Niket Tandon
for: 这研究是关于如何使用自然语言处理机器人（LLM）来自动化开放领域过程定制。
methods: 研究使用了一个名为CustomPlans的探测数据集，该数据集包含多种用户定制需求，以测试LLM的定制能力。
results: 研究发现，在Sequential设置下使用LLM作为定制代理和执行代理时，可以很好地满足用户的定制需求，但是LLM并不充分考虑用户的定制需求，导致错误率为~51%。

Abstract
How-to procedures, such as how to plant a garden, are ubiquitous. But one size does not fit all - humans often need to customize these procedural plans according to their specific needs, e.g., planting a garden without pesticides. While LLMs can fluently generate generic procedures, we present the first study on how well LLMs can customize open-domain procedures. We introduce CustomPlans, a probe dataset of customization hints that encodes diverse user needs for open-domain How-to procedures. Using LLMs as CustomizationAgent and ExecutionAgent in different settings, we establish their abilities to perform open-domain procedure customization. Human evaluation shows that using these agents in a Sequential setting is the best, but they are good enough only ~51% of the time. Error analysis shows that LLMs do not sufficiently address user customization needs in their generated procedures.

摘要
各种如何程序（如植 garden）是普遍存在的。但是一个size不适用于所有人——人们常需要根据自己的具体需求自定义这些程序，例如不使用杀虫剂植 garden。 LLMS可以轻松生成通用的程序，但我们的研究表明，LLMS可以如何自定义开放领域的程序。我们介绍了一个名为CustomPlans的探索数据集，该数据集包含多种用户需求的自定义提示。我们使用LLMS作为自定义代理和执行代理在不同的设置下，并证明了它们在开放领域程序自定义方面的能力。人工评估表明，使用这些代理在顺序设置下是最好的，但它们只能成功约51%的时间。错误分析表明，LLMS在生成的程序中不充分考虑用户自定义需求。

SQATIN: Supervised Instruction Tuning Meets Question Answering for Improved Dialogue NLU

paper_url: http://arxiv.org/abs/2311.09502
repo_url: None
paper_authors: Evgeniia Razumovskaia, Goran Glavaš, Anna Korhonen, Ivan Vulić
for: 本研究旨在提高对话自然语言理解（NLU）的性能，尤其是在 Labelled NLU 数据稀缺的情况下。
methods: 本研究提出了一种新的对话 NLU 框架，名为 SQATIN，它基于 instruction tuning 和问答模型来解决 Intent Detection 和 Value Extraction 任务。
results: 根据评估结果，SQATIN 在已有NLU benchmark上设置了新的状态对话NLU性能，大幅超越了现有的模型基于标准精度优化目标的表现，尤其是在跨领域传递中。

Abstract
Task-oriented dialogue (ToD) systems help users execute well-defined tasks across a variety of domains (e.g., $\textit{flight booking}$ or $\textit{food ordering}$), with their Natural Language Understanding (NLU) components being dedicated to the analysis of user utterances, predicting users' intents ($\textit{Intent Detection}$, ID) and extracting values for informational slots ($\textit{Value Extraction}$, VE). In most domains, labelled NLU data is scarce, making sample-efficient learning -- enabled with effective transfer paradigms -- paramount. In this work, we introduce SQATIN, a new framework for dialog NLU based on (i) instruction tuning and (ii) question-answering-based formulation of ID and VE tasks. According to the evaluation on established NLU benchmarks, SQATIN sets the new state of the art in dialogue NLU, substantially surpassing the performance of current models based on standard fine-tuning objectives in both in-domain training and cross-domain transfer. SQATIN yields particularly large performance gains in cross-domain transfer, owing to the fact that our QA-based instruction tuning leverages similarities between natural language descriptions of classes (i.e., slots and intents) across domains.

摘要
任免对话（ToD）系统可以帮助用户完成具体的任务（如飞行订票或食物订单），其自然语言理解（NLU）组件专门用于分析用户言语，预测用户的意图（Intent Detection，ID）和提取信息槽的值（Value Extraction，VE）。在大多数领域中，标注的NLU数据 scarce，因此使得样本效率学习 -- 通过有效的传输方法 -- 是非常重要的。在这项工作中，我们介绍了SQATIN，一种新的对话NLU框架，基于（i）指令调整和（ii）问答题解法来实现ID和VE任务。根据评估已知NLU标准准的评估 benchmark，SQATIN将对话NLU的新状态划定，大幅超过了现有基于标准精度调整目标的当前模型在领域培训和跨领域传输中的性能。SQATIN在跨领域传输中的性能提升特别大，这是因为我们的QA-based instruction tuning利用了不同领域的自然语言描述中的类 similarities（即槽和意图）。

Personalized Jargon Identification for Enhanced Interdisciplinary Communication

paper_url: http://arxiv.org/abs/2311.09481
repo_url: None
paper_authors: Yue Guo, Joseph Chee Chang, Maria Antoniak, Erin Bransom, Trevor Cohen, Lucy Lu Wang, Tal August
for: 本研究旨在提高科研人员对技术术语的认知和理解，以便在不同领域之间进行交互和合作。
methods: 本研究使用了一组超过10000个术语熟悉度标注数据，并分析了这些数据以识别具有不同熟悉度的术语。
results: 研究发现，科研人员对术语熟悉度和信息需求之间存在很大差异，即使在同一个子领域内。研究还找到了个人、子领域和领域知识等特征，以便预测个人对术语熟悉度的认知。

Abstract
Scientific jargon can impede researchers when they read materials from other domains. Current methods of jargon identification mainly use corpus-level familiarity indicators (e.g., Simple Wikipedia represents plain language). However, researchers' familiarity of a term can vary greatly based on their own background. We collect a dataset of over 10K term familiarity annotations from 11 computer science researchers for terms drawn from 100 paper abstracts. Analysis of this data reveals that jargon familiarity and information needs vary widely across annotators, even within the same sub-domain (e.g., NLP). We investigate features representing individual, sub-domain, and domain knowledge to predict individual jargon familiarity. We compare supervised and prompt-based approaches, finding that prompt-based methods including personal publications yields the highest accuracy, though zero-shot prompting provides a strong baseline. This research offers insight into features and methods to integrate personal data into scientific jargon identification.

摘要
科学技术术语可能会阻碍研究人员在不同领域的文献中阅读。现有的词汇识别方法主要使用文库级 familiarness 指标（例如简单的wikipedia）。然而，研究人员对于一个词汇的熟悉程度可能会很大差异，基于他们的背景知识。我们收集了超过10,000个词汇熟悉标注from 11名计算机科学研究人员，来自100篇摘要中的词汇。我们对这些数据进行分析发现，词汇熟悉度和信息需求在审题人员中很大差异，甚至在同一个子领域（例如NLP）内。我们研究个人、子领域和领域知识的特征，以预测个人词汇熟悉度。我们比较了经过学习和提示方法，发现提示方法，包括个人出版物，可以达到最高的准确率，虽然零开始提示方法提供了强大的基准。这项研究对个人数据集成 scientific jargon 识别提供了新的想法和方法。

Show Your Work with Confidence: Confidence Bands for Tuning Curves

paper_url: http://arxiv.org/abs/2311.09480
repo_url: https://github.com/nalourie/opda
paper_authors: Nicholas Lourie, Kyunghyun Cho, He He
for: 本文旨在提供一种用于比较自然语言处理方法的有效方法，以及一种用于确定这些方法之间的关系的有效方法。
methods: 本文使用了一种新的方法来建立有效的比较 curves，这种方法可以快速地确定不同方法之间的关系。
results: 本文的实验结果表明，新提出的方法可以准确地建立比较 curves，并且可以与现有的bootstrapconfidence bands进行比较。

Abstract
The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .

摘要
“选择超参数会对自然语言处理性能产生深远影响。然而，常常难以判断一方法是哪一方法更好，这是因为它们的优化努力不同。对于这问题，曲线数据可以提供解答。 Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. 许多估计器存在这些曲线上，但是常用的是点估计，我们展示它们会在有限数据情况下失败并给出矛盾的结果。 beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at .”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs

paper_url: http://arxiv.org/abs/2311.09469
repo_url: None
paper_authors: Michael J. Q. Zhang, Eunsol Choi
for: 这paper的目的是研究LMs中的ambiguity解决方法，以提高AI助手的性能。
methods: 该paper提出了一个任务agnostic的框架，通过向用户提问clarifying questions来解决ambiguity。这个框架包括三个子任务：确定需要clarification时，确定需要clarification的问题，以及基于新的信息回答正确。
results: 该paper的实验结果表明，intent-sim可以更好地确定需要clarification的时候，并且可以double randomly select的性能。此外，intent-sim在多种NLP任务和LMs中都表现了良好的稳定性。

Abstract
Resolving ambiguities through interaction is a hallmark of natural language, and modeling this behavior is a core challenge in crafting AI assistants. In this work, we study such behavior in LMs by proposing a task-agnostic framework for resolving ambiguity by asking users clarifying questions. Our framework breaks down this objective into three subtasks: (1) determining when clarification is needed, (2) determining what clarifying question to ask, and (3) responding accurately with the new information gathered through clarification. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. For the first subtask, we present a novel uncertainty estimation approach, intent-sim, that determines the utility of querying for clarification by estimating the entropy over user intents. Our method consistently outperforms existing uncertainty estimation approaches at identifying predictions that will benefit from clarification. When only allowed to ask for clarification on 10% of examples, our system is able to double the performance gains over randomly selecting examples to clarify. Furthermore, we find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs. Together, our work lays foundation for studying clarifying interactions with LMs.

摘要
解决冲突通过互动是自然语言的特征，模拟这种行为是AI助手设计的核心挑战。在这项工作中，我们研究LM中的这种行为，通过提出任务无关的框架来解决冲突。我们将这个目标分解为三个互动任务：（1）确定是否需要准确化，（2）确定需要准确化的问题，以及（3）通过准确化获取新信息并准确回答。我们在三种NLP应用中评估系统：问答、机器翻译和自然语言推理。对于第一个任务，我们提出了一种新的uncertainty estimation方法，即意图sim，该方法根据用户意图的Entropy来判断是否需要准确化。我们的方法在identifying需要准确化的预测中表现出色，并且在只允许问 clarification 10%的示例中，我们的系统能够double Performance gain。此外，我们发现intent-sim是可靠的，在各种NLP任务和LM上都能够达到显著改进。总之，我们的工作为 изуча clarify interactions with LMs 提供了基础。

2023-11-16

Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study

The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese

JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing

A Computationally Efficient Sparsified Online Newton Method

Characterizing Tradeoffs in Language Model Decoding with Informational Interpretations

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

Unambiguity and Fewness for Nonuniform Families of Polynomial-Size Nondeterministic Finite Automata

Hijacking Large Language Models via Adversarial In-Context Learning

An Attention-Based Denoising Framework for Personality Detection in Social Media Texts

Language Generation from Human Brain Activities

Which Modality should I use – Text, Motif, or Image? : Understanding Graphs with Large Language Models

GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection

X-Mark: Towards Lossless Watermarking Through Lexical Redundancy

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

Large Language Models for Propaganda Span Annotation

PixT3: Pixel-based Table To Text generation

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

How Far Can We Extract Diverse Perspectives from Large Language Models? Criteria-Based Diversity Prompting!

KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains

More Samples or More Prompt Inputs? Exploring Effective In-Context Sampling for LLM Few-Shot Prompt Engineering

To be or not to be? an exploration of continuously controllable prompt engineering

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking

FairytaleCQA: Integrating a Commonsense Knowledge Graph into Children’s Storybook Narratives

How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Translation Aligned Sentence Embeddings for Turkish Language

Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks

What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization

CARE: Extracting Experimental Findings From Clinical Literature

Tracking the Newsworthiness of Public Documents

MOKA: Moral Knowledge Augmentation for Moral Event Extraction

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning

Large Language Model Inference with Lexical Shortlisting

A Self-enhancement Multitask Framework for Unsupervised Aspect Category Detection

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Inducing Political Bias Allows Language Models Anticipate Partisan Reactions to Controversies

R-Tuning: Teaching Large Language Models to Refuse Unknown Questions

Where Do People Tell Stories Online? Story Detection Across Online Communities

Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring

Evaluating LLM Agent Group Dynamics against Human Group Dynamics: A Case Study on Wisdom of Partisan Crowds

Evolving Domain Adaptation of Pretrained Language Models for Text Classification

ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification

Event Causality Is Key to Computational Story Understanding

Evaluating In-Context Learning of Libraries for Code Generation

From Scroll to Misbelief: Modeling the Unobservable Susceptibility to Misinformation on Social Media

Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning

Simulating Opinion Dynamics with Networks of LLM-based Agents

On Retrieval Augmentation and the Limitations of Language Model Training

Efficient End-to-End Visual Document Understanding with Rationale Distillation

GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

SCORE: A framework for Self-Contradictory Reasoning Evaluation

Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion

LifeTox: Unveiling Implicit Toxicity in Life Advice

Enhancing Medical Text Evaluation with GPT-4

MMOE: Mixture of Multimodal Interaction Experts

Crafting In-context Examples according to LMs’ Parametric Knowledge

A Reevaluation of Event Extraction: Past, Present, and Future Challenges

Pachinko: Patching Interpretable QA Models through Natural Language Feedback

Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition

A Speed Odyssey for Deployable Quantization of LLMs

Towards Pragmatic Awareness in Question Answering: A Case Study in Maternal and Infant Health

Reducing Privacy Risks in Online Self-Disclosures with Language Models

Effective Large Language Model Adaptation for Improved Grounding

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-driven Training Data Generation

Leveraging Code to Improve In-context Learning for Semantic Parsing

GEE! Grammar Error Explanation with Large Language Models