2023-10-16

cs.CL

cs.CL - 2023-10-16

IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models

paper_url: http://arxiv.org/abs/2310.10873
repo_url: https://github.com/skzhang1/IDEAL
paper_authors: Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu
for: This paper aims to address the challenge of high annotation costs in in-context learning by introducing an influence-driven selective annotation method.methods: The proposed method constructs a directed graph to represent unlabeled data, quantifies the influence of candidate unlabeled subsets using a diffusion process, and selects the most influential subsets using a greedy algorithm.results: The proposed method achieves better performance under lower time consumption during subset selection compared to previous efforts on selective annotations. Experiments confirm the superiority of the proposed method on various benchmarks.Here’s the Chinese translation of the three points:for: 这篇论文目标是解决启动学习中的注释成本高的挑战，提出了一种基于影响的选择性注释方法。methods: 该方法首先构建了一个指向图来表示无标示数据，然后使用扩散过程来衡量候选无标示子集的影响，并使用一种简单 yet effective的批处算法来选择最有影响的子集。results: 该方法在不同的标准 bencmark 上达到了更高的性能，而且在选择子集时间上具有更低的时间投入。实验证明了该方法的优越性。

Abstract
In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection. The project page is available at https://skzhang1.github.io/IDEAL/.

摘要
内容学习是一种有前途的概念，它利用内容例子作为大型语言模型的预测提示。这些提示是实现强制性的关键，但是因为需要从大量的标注例子中抽取提示，因此找到正确的提示可能会带来高的标注成本。为解决这个挑战，本研究将引入一种影响驱动的选择性标注方法，以降低标注成本而提高内容例子的质量。本方法的核心思想是从大规模的未标注数据池中选择一个关键子集，并将其标注以供后续的提示抽取。首先， constructed 一个导向的图来表示未标注数据。接着， candidate 的未标注子集之间的影响被评估通过一个传播过程。最后，一个简单 yet effective 的对不标注数据选择法是引入，它在每次选择时会选择具有最大 MARGINAL 增长的数据。与先前的选择性标注方法不同，我们的影响驱动方法在端到端方式下进行，避免了一个不可能的明确平衡 между 数据多样性和代表性，并且受到了理论支持。实验确认了我们提出的方法在不同的benchmark上的超越性，在选择subset时间consumption下得到了更好的性能。更多信息可以通过我们的项目页面（https://skzhang1.github.io/IDEAL/）了解。

Will the Prince Get True Love’s Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts

paper_url: http://arxiv.org/abs/2310.10865
repo_url: None
paper_authors: Christina Chance, Da Yin, Dakuo Wang, Kai-Wei Chang
for: 本研究旨在探讨传统童话中存在的性别偏见，以及语言模型学习到的这些偏见是如何影响其性别认知的。
methods: 本研究使用Counterfactual数据增强技术来评估语言模型对性别变化的Robustness。Specifically，我们使用FairytaleQA数据集进行问答任务，并在训练时引入Counterfactual性别刻板印象以降低学习到的偏见。
results: 我们的实验结果显示，模型对性别变化具有敏感性，在原始测试集比较性别偏见的情况下，模型的性能会明显下降。但是，在先进行Counterfactual训练 dataset的 fine-tuning 后，模型对后来引入的Anti-性别刻板文本变得更加敏感。

Abstract
Recent studies show that traditional fairytales are rife with harmful gender biases. To help mitigate these gender biases in fairytales, this work aims to assess learned biases of language models by evaluating their robustness against gender perturbations. Specifically, we focus on Question Answering (QA) tasks in fairytales. Using counterfactual data augmentation to the FairytaleQA dataset, we evaluate model robustness against swapped gender character information, and then mitigate learned biases by introducing counterfactual gender stereotypes during training time. We additionally introduce a novel approach that utilizes the massive vocabulary of language models to support text genres beyond fairytales. Our experimental results suggest that models are sensitive to gender perturbations, with significant performance drops compared to the original testing set. However, when first fine-tuned on a counterfactual training dataset, models are less sensitive to the later introduced anti-gender stereotyped text.

摘要

CoTFormer: More Tokens With Attention Make Up For Less Depth

paper_url: http://arxiv.org/abs/2310.10845
repo_url: None
paper_authors: Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
for: 本文目的是提出一种基于链式思维（Chain-of-Thought，CoT）机制的 transformer 变体，以实现与更深的模型性能相似的表现。
methods: 本文使用了一种做为链式思维机制的假设，并基于此假设提出了一种名为 CoTFormer 的 transformer 变体。
results: 实验结果表明，CoTFormer 能够与更深的标准 transformer 相比，在多个任务上表现更好。

Abstract
The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this work, we establish an approximate parallel between using chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve capacity comparable to a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.

摘要
“Foundational models”的竞赛不断地进行开发，但“Chain-of-Thought”（CoT）方法仍然扮演着关键的角色，以获得最佳的下游性能。在这个研究中，我们发现使用Chain-of-thought和使用更深的transformer之间存在一种近似的关系。基于这个意识，我们介绍CoTFormer，一种使用隐式CoT-like机制的transformer变体，以获得与更深的模型相同的容量。我们的实验结果显示CoTFormer具有明显的超越性，与标准的transformer模型相比。

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

paper_url: http://arxiv.org/abs/2310.10844
repo_url: None
paper_authors: Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh
for: 这篇论文探讨了对大语言模型（LLMs）的敌意攻击的研究，以及如何使AI系统更加可靠。
methods: 论文使用了多种学习结构，包括文本只攻击、多模态攻击和复杂系统特有的攻击方法，以探讨LLMs的安全性问题。
results: 论文提供了LLMs的安全性问题的概述，以及现有研究的总结和可能的防御策略。

Abstract
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

摘要
大型自然语言模型（LLM）在architecture和能力方面快速进步，因此批判它们的安全性特性的必要性也在增加。这篇论文对抗AI系统中的攻击进行了评估，这是一种涉及自然语言处理和安全的新兴领域。先前的研究表明，即使通过 instrucion 调整和人工反馈来实现安全性的LLM也可能受到攻击，这些攻击利用模型的弱点并诱导AI系统出错，例如 chatGPT 和 Bard 上的 "监狱" 攻击。在这篇论文中，我们首先提供了大型语言模型的概述，描述了它们的安全性，然后根据不同的学习结构进行了分类：只有文本攻击、多模态攻击以及特定复杂系统的攻击方法，如联合学习或多代理系统。我们还提供了关于漏洞的基本来源和防御措施的评论。为了让这个领域更加Accessible，我们提供了一个系统性的回顾现有工作，一种结构化的攻击概念 typology，以及其他资源，包括与相关话题的PowerPoint演示在ACL'24年会上。

Fake News in Sheep’s Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks

paper_url: http://arxiv.org/abs/2310.10830
repo_url: None
paper_authors: Jiaying Wu, Bryan Hooi
for: 这篇论文旨在解决大型自然语言模型（LLM）驱动的新闻假消息探测问题，以提高在线新闻环境中自动检测的精度。
methods: 这篇论文提出了一种基于新闻媒体的探测方法，通过使用 style-oriented reframing 技术和内置的大型自然语言模型（LLM），实现对新闻写作风格的适应性。
results: 实验结果表明，这种方法可以在三个 benchmark 数据集上提供显著的改进，并增强对 LLM 驱动的新闻假消息探测的抗性。

Abstract
It is commonly perceived that online fake news and reliable news exhibit stark differences in writing styles, such as the use of sensationalist versus objective language. However, we emphasize that style-related features can also be exploited for style-based attacks. Notably, the rise of powerful Large Language Models (LLMs) has enabled malicious users to mimic the style of trustworthy news outlets at minimal cost. Our analysis reveals that LLM-camouflaged fake news content leads to substantial performance degradation of state-of-the-art text-based detectors (up to 38% decrease in F1 Score), posing a significant challenge for automated detection in online ecosystems. To address this, we introduce SheepDog, a style-agnostic fake news detector robust to news writing styles. SheepDog achieves this adaptability through LLM-empowered news reframing, which customizes each article to match different writing styles using style-oriented reframing prompts. By employing style-agnostic training, SheepDog enhances its resilience to stylistic variations by maximizing prediction consistency across these diverse reframings. Furthermore, SheepDog extracts content-focused veracity attributions from LLMs, where the news content is evaluated against a set of fact-checking rationales. These attributions provide supplementary information and potential interpretability that assist veracity prediction. On three benchmark datasets, empirical results show that SheepDog consistently yields significant improvements over competitive baselines and enhances robustness against LLM-empowered style attacks.

摘要
通常认为在线假新闻和可靠新闻的写作风格有很大差异，如使用感人化语言 versus объектив语言。然而，我们强调的是风格相关特征也可以被利用于风格基本攻击。尤其是现在强大的大语言模型（LLMs）的出现，使得恶意用户可以轻松地模仿可靠新闻机构的风格，对于自动检测在线环境中具有重大挑战。为了解决这一问题，我们介绍了羊狗（SheepDog），一种不受风格限制的假新闻检测器，可以在不同的新闻风格下保持高度的稳定性。羊狗通过使用 LLMS 进行新闻重 framings，以适应不同的新闻风格，并通过风格无关的训练来增强其对风格变化的抗性。此外，羊狗使用 LLMS 提供的内容相关的真实性评估，对新闻内容进行了实际的真实性评估，以提供可靠的假新闻检测。在三个 benchmark 数据集上，实验结果表明，羊狗可以与竞争对手相比，提供显著的改善，并增强了对 LLMS 风格基本攻击的抗性。

SD-HuBERT: Self-Distillation Induces Syllabic Organization in HuBERT

paper_url: http://arxiv.org/abs/2310.10803
repo_url: None
paper_authors: Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli
for: 这篇论文旨在探讨自然语言处理中的自适应学习（SSL）技术，具体来说是检测和分析发音中的句子水平表示。
methods: 作者采用了自我混合对象函数（self-distillation）来练化预训练的HuBERT模型，并使用汇集token来概括整个句子。无需任何监督，模型能够自动从发音中找到定义的边界，并在不同帧中显示出standing的句子结构。
results: 作者的模型在无监督情况下自动找到了发音中的句子结构，并且与实际的句子结构大致匹配。此外，作者还提出了一个新的评价任务——Spoken Speech ABX，用于评估发音中的句子表示。与之前的模型相比，作者的模型在这两个任务中表现出色。

Abstract
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space, limiting the utility of SSL representations. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames show salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.

摘要
<>自动发现单元在自注意力学习（SSL）中的语音处理已经进入了新的时代。然而，发现的单元经常保留在音位空间，这限制了SSL表示的用途。我们在这里示出，在学习 sentence-level 表示的语音中，一种 syllabic 组织structure emerges。具体来说，我们采用 "self-distillation" 目标来练化预训练 HuBERT 的汇总符号，该符号概括整个句子。无需任何超级视图，得到的模型可以在语音中画定界限，并且在帧中的表示显示出了鲜明的 syllabic 结构。我们示出，这 emergent structure 与真实的 syllables 大致匹配。此外，我们提出了一个新的 benchmark 任务，Spoken Speech ABX，用于评估 sentence-level 表示的语音。与前一代模型相比，我们的模型在无监督 syllable 发现和 sentence-level 表示学习方面表现出色。总之，我们示出了 HuBERT 的自注意力学习可以不依赖于外部标签或模式，并可能提供一种新的数据驱动单元 для spoken language modeling。

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

paper_url: http://arxiv.org/abs/2310.10788
repo_url: https://github.com/Hermannovski/React
paper_authors: Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K. Anumanchipalli
for: 这个论文旨在探讨自动学习（SSL）基于模型在语音识别任务上的表现，以及这些模型内部表征与语音相关的关系。
methods: 这个论文使用了许多现代的探索技术来探索SSL模型的内部表征，包括HuBERT模型。
results: 研究发现，SSL模型具有一种叫做“语音生成动力学”的基本属性，即将语音信号转换为生成语音的动力学过程。此外，这种属性在不同语言训练数据上具有相似性，并且可以通过简单的仿射变换转移到不同的发音者、语言和方言上。这些结果为语音工程领域中SSL模型的性能提供了新的理解和应用前景，同时也为语音科学领域的研究提供了新的可能性。

Abstract
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.

摘要
自顾学学习（SSL）基于模型的语音表现非常出色，但这些顶尖模型一直保持了黑盒模型的状态，许多最近的研究开始使用 HuBERT 等模型进行探测，以 correlate 其内部表示与不同的语音特征。在这篇论文中，我们显示了 SSL 模型中的 "语音生成动态推理" 的基本性质，即将听音信号转化为生成语音的 causal 生成动态。此外，我们还发现这种抽象在训练数据语言上具有很大的 overlap，尤其是在语音系统相似性方面。此外，我们还发现通过简单的仿射变换，可以在不同的说话者、语言和方言之间传递 AAI，这表明这种性质具有普适性。总之，这些结果为 SSL 模型的内部结构提供了新的灯光，并开启了新的语言不受限制的通用模型，这些模型可以解释并基于语音科学。

BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali

paper_url: http://arxiv.org/abs/2310.10781
repo_url: None
paper_authors: Saumajit Saha, Albert Nanda
for: 这个论文是关于推荐一种用于检测孟加拉语挑衅文本的系统。
methods: 这个系统使用了传统和现代方法，以便让模型学习。
results: 我们的提议系统可以判断给定文本是否含有任何威胁。我们对数据增强的影响进行了研究，并对多种转换器-基础模型进行了评估。我们在测试集上 obtained a macro F1 score of 68.11%，在共享任务中排名第23名。

Abstract
This paper presents the system that we have developed while solving this shared task on violence inciting text detection in Bangla. We explain both the traditional and the recent approaches that we have used to make our models learn. Our proposed system helps to classify if the given text contains any threat. We studied the impact of data augmentation when there is a limited dataset available. Our quantitative results show that finetuning a multilingual-e5-base model performed the best in our task compared to other transformer-based architectures. We obtained a macro F1 of 68.11\% in the test set and our performance in this shared task is ranked at 23 in the leaderboard.

摘要

Towards reducing hallucination in extracting information from financial reports using Large Language Models

paper_url: http://arxiv.org/abs/2310.10760
repo_url: None
paper_authors: Bhaskarjit Sarmah, Tianjie Zhu, Dhagash Mehta, Stefano Pasquali
for: 提高财务报告中问答部分的信息提取效率和准确率，以便更好地进行投资决策和分析。
methods: 使用大语言模型（LLMs）来快速和高精度地提取财务报告 транскрипts中的信息，并通过结合检索增强生成技术和元数据来减少幻觉。
results: 对多种LLMs进行比较，并employs objective metrics for evaluating Q&A systems to demonstrate the superiority of our proposed approach.

Abstract
For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method.

摘要

Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.10735
repo_url: https://github.com/ryanshea10/personachat_offline_rl
paper_authors: Ryan Shea, Zhou Yu
for: 提高对话系统的自然语言对话品质和个性化度
methods: 使用离线学习 reinforcement learning 方法，将supervised learning 和 online reinforcement learning 的优点结合在一起，并 introduce 一种减少重要性权重的自适应重要性 sampling 方法
results: 对一个现有的社交聊天机器人进行自动和人类评估，结果显示，该方法可以提高对话系统的自然语言对话品质和个性化度

Abstract
Maintaining a consistent persona is a key quality for any open domain dialogue system. Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL). However, systems trained with supervised learning often lack consistency as they are never punished for uttering contradictions. Additional training with RL can alleviate some of these issues, however the training process is expensive. Instead, we propose an offline RL framework to improve the persona consistency of dialogue systems. Our framework allows us to combine the advantages of previous methods as we can inexpensively train our model on existing data as in supervised learning, while punishing and rewarding specific utterances as in RL. We also introduce a simple importance sampling method to reduce the variance of importance weights in offline RL training which we call Variance-Reducing MLE-Initialized (VaRMI) importance sampling. Our automatic and human evaluations show that our framework improves both the persona consistency and dialogue quality of a state-of-the-art social chatbot.

摘要
保持一致的人格是对任何开放领域对话系统的关键质量。现状之 artifical intelligence 系统通常通过经过监督学习或在线强化学习（RL）训练来实现这一目标。然而，通过监督学习训练的系统经常缺乏一致性，因为它们从来没有受到违反的惩罚。额外的 RL 训练可以减轻一些这些问题，但训练过程是昂贵的。因此，我们提出了一个Offline RL框架，以提高对话系统的人格一致性。我们的框架允许我们将supervised learning中的优点与RL中的优点结合起来，并且可以廉价地在现有数据上训练我们的模型。我们还提出了一种简单的重要性抽样方法，以减少偏移重要性抽样的方差，我们称之为“Variance-Reducing MLE-Initialized”（VaRMI）重要性抽样。我们的自动和人类评估表明，我们的框架可以提高一个现有社交聊天机器人的人格一致性和对话质量。

“Mistakes Help Us Grow”: Facilitating and Evaluating Growth Mindset Supportive Language in Classrooms

paper_url: http://arxiv.org/abs/2310.10637
repo_url: None
paper_authors: Kunal Handa, Margaret Clapper, Jessica Boyle, Rose E Wang, Diyi Yang, David S Yeager, Dorottya Demszky
for: 这个论文目的是探讨使用大自然语言模型（LLM）提供自动化、个性化的教师培训，以促进教师的成长心理语言支持（GMSL）。
methods: 这个论文使用了以下方法：（1）建立了一个平行数据集，其中包含GMSL培训的教师重构不支持性语言的示例，并提供了一个批注指南；（2）开发了GMSL提问框架，用于修改教师的不支持性语言；（3）采用了基于心理理论的评价框架，用于评价GMSL的效果。
results: 这个论文的研究结果显示，both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. In addition, model-generated reframings outperform those from the GMSL-trained teachers. These results demonstrate the promise of using LLMs to provide automated GMSL feedback for teachers, and more broadly, the potential of LLMs for supporting students’ learning in the classroom.

Abstract
Teachers' growth mindset supportive language (GMSL)--rhetoric emphasizing that one's skills can be improved over time--has been shown to significantly reduce disparities in academic achievement and enhance students' learning outcomes. Although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.

摘要
教师的增长心理支持语言（GMSL）——强调一个人的技能可以逐渐提高——已经显著减少学生学习成绩的差距和提高学生的学习效果。 although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due to the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.

Data Contamination Through the Lens of Time

paper_url: http://arxiv.org/abs/2310.10628
repo_url: https://github.com/abacusai/to-the-cutoff
paper_authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley
for: This paper aims to investigate the issue of data contamination in large language models (LLMs) by analyzing the trends in LLM pass rates and their relationship with GitHub popularity and release date.
methods: The authors use a natural experiment of training cutoffs in GPT models to examine benchmarks released over time, specifically focusing on two code/mathematical problem-solving datasets, Codeforces and Project Euler. They employ a longitudinal analysis approach to identify statistically significant trends in LLM pass rates.
results: The authors find strong evidence of data contamination in LLMs, as reflected in the statistically significant trends in LLM pass rates vs. GitHub popularity and release date. They also open-source their dataset, raw results, and evaluation framework to facilitate rigorous analyses of data contamination in modern models.

Abstract
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

摘要
In this study, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination.By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a protein language diffusion model

paper_url: http://arxiv.org/abs/2310.10605
repo_url: None
paper_authors: Bo Ni, David L. Kaplan, Markus J. Buehler
for: 本研究旨在开发一种可预测性强的蛋白质设计模型，以满足复杂非线性机械性质设计目标。
methods: 该模型基于先前训练的蛋白质语言模型，利用蛋白质序列深度知识，将机械 unfolding 响应映射到创造新蛋白质。
results: 通过全原子分子动力学 simulate，证明设计出的蛋白质是新的，并满足目标的机械性质，包括 unfolding energy 和机械强度，以及细致的 unfolding force-separation 曲线。

Abstract
Through evolution, nature has presented a set of remarkable protein materials, including elastins, silks, keratins and collagens with superior mechanical performances that play crucial roles in mechanobiology. However, going beyond natural designs to discover proteins that meet specified mechanical properties remains challenging. Here we report a generative model that predicts protein designs to meet complex nonlinear mechanical property-design objectives. Our model leverages deep knowledge on protein sequences from a pre-trained protein language model and maps mechanical unfolding responses to create novel proteins. Via full-atom molecular simulations for direct validation, we demonstrate that the designed proteins are novel, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, as well as the detailed unfolding force-separation curves. Our model offers rapid pathways to explore the enormous mechanobiological protein sequence space unconstrained by biological synthesis, using mechanical features as target to enable the discovery of protein materials with superior mechanical properties.

摘要

Motion2Language, Unsupervised learning of synchronized semantic motion segmentation

paper_url: http://arxiv.org/abs/2310.10594
repo_url: https://github.com/rd20karim/M2T-Segmentation
paper_authors: Karim Radouane, Andon Tchechmedjiev, Sylvie Ranwez, Julien Lagarde
for: 这个论文的目的是建立一种序列到序列架构，用于将动作捕获输入翻译成英语自然语言描述，并同时生成描述和动作的同步。
methods: 论文提出了一种新的循环式注意力形式，适用于同步生成文本，以及一种改进的动作编码器架构，适用于更小的数据集和同步生成。
results: 经过测试，提出的注意力机制和编码器架构都有加成效果，可以提高生成文本的质量（BLEU和Semantic Equivalence）以及同步性。

Abstract
In this paper, we investigate building a sequence to sequence architecture for motion to language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed, enabling semantic segmentation as a byproduct, but without requiring synchronized training data. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture better suited to smaller data and for synchronous generation. We evaluate both contributions in individual experiments, using the standard BLEU4 metric, as well as a simple semantic equivalence measure, on the KIT motion language dataset. In a follow-up experiment, we assess the quality of the synchronization of generated text in our proposed approaches through multiple evaluation metrics. We find that both contributions to the attention mechanism and the encoder architecture additively improve the quality of generated text (BLEU and semantic equivalence), but also of synchronization. Our code will be made available at \url{https://github.com/rd20karim/M2T-Segmentation/tree/main}

摘要
本文 investigate 建立一种序列到序列架构，用于动作到语言翻译和同步。目标是将动作捕获输入翻译成英语自然语言描述，以便在动作发生时同步生成描述，而无需同步训练数据。我们提出了一种新的循环形式的本地注意力表示，适合同步生成文本，以及一种改进的动作编码建立，更适合小型数据和同步生成。我们在使用标准的BLEU4指标和简单的 semantics 等价度量进行评估，并在 KIT 动作语言数据集上进行单独的实验。在后续实验中，我们评估了我们的提议中的同步生成文本质量，通过多种评价指标。我们发现， both 注意力机制和编码建立增加了生成文本质量（BLEU和semantics），同时也提高了同步生成的质量。我们的代码将在 \url{https://github.com/rd20karim/M2T-Segmentation/tree/main} 上提供。

Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment

paper_url: http://arxiv.org/abs/2310.10590
repo_url: None
paper_authors: Ji Qi, Kaixuan Ji, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Lei Hou, Juanzi Li, Bin Xu
for: 解决对自然语言文本中的 объектив结构知识抽取任务的问题，以建立专门的模型。
methods: 使用语言模型进行启发式学习，并提出一种方法来评估语言模型与测试样本之间的语法分布差异，以作为准备证明。
results: 通过在标准 CaRB benchmark上进行 $6$-shot 方法，实现了超过现有监督方法的 $55.3$ $F_1$ 分数，并在 TACRED 和 ACE05 上进行了natural generalization，实现了 $5.7$ 和 $6.8$ $F_1$ 分数的提高。

Abstract
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts, which has attracted growing attention to build dedicated models with human experience. As the large language models (LLMs) have exhibited remarkable in-context learning capabilities, a question arises as to whether the task of OIE can be effectively tackled with this paradigm? In this paper, we explore solving the OIE problem by constructing an appropriate reasoning environment for LLMs. Specifically, we first propose a method to effectively estimate the discrepancy of syntactic distribution between a LLM and test samples, which can serve as correlation evidence for preparing positive demonstrations. Upon the evidence, we introduce a simple yet effective mechanism to establish the reasoning environment for LLMs on specific tasks. Without bells and whistles, experimental results on the standard CaRB benchmark demonstrate that our $6$-shot approach outperforms state-of-the-art supervised method, achieving an $55.3$ $F_1$ score. Further experiments on TACRED and ACE05 show that our method can naturally generalize to other information extraction tasks, resulting in improvements of $5.7$ and $6.8$ $F_1$ scores, respectively.

摘要

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

paper_url: http://arxiv.org/abs/2310.10586
repo_url: None
paper_authors: Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li
for: 本文旨在提出一种快速适应性框架，以便基于视频进行文本回答。
methods: 本文使用了大量语言模型（LLM）来进行视频理解和知识推理。 Specifically, 我们发现回答特定指令的关键在于关注相关视频事件，并使用了两种视觉工具：结构化场景图生成和描述性图像标题生成来收集和表示事件信息。然后，一个搭载了世界知识的 LLM 被用作理解代理，通过多个理解步骤来实现回答。
results: 我们的框架在两个常见的视频文本生成任务上表现出STATE-OF-THE-ART的性能，并且不需要训练。

Abstract
Building models that generate textual responses to user instructions for videos is a practical and challenging topic, as it requires both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos aligned with brief descriptions. In this paper, we introduce BiLL-VTG, a fast adaptive framework that leverages large language models (LLMs) to reasoning on videos based on essential lightweight visual tools. Specifically, we reveal the key to response specific instructions is the concentration on relevant video events, and utilize two visual tools of structured scene graph generation and descriptive image caption generation to gather and represent the events information. Thus, a LLM equipped with world knowledge is adopted as the reasoning agent to achieve the response by performing multiple reasoning steps on specified video events.To address the difficulty of specifying events from agent, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm based on the efficient Hungarian matching to localize corresponding video events using linguistic instructions, enabling LLMs to interact with long videos. Extensive experiments on two typical video-based texts generations tasks show that our tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance.

摘要
Translated into Simplified Chinese:建立基于视频的模型，以生成用户 instrucion 的文本响应是一个实用和挑战的话题，因为它需要视觉理解和知识推理。相比语言和图像模式，训练效率仍然是一个严重的问题，因为现有的研究通常使用大量稀疏的视频和简短的描述进行训练。在这篇论文中，我们介绍了 BiLL-VTG 框架，该框架利用大型语言模型（LLM）来基于视频中的关键事件进行推理。我们发现关键在于响应特定的 instrucion 是关注相关的视频事件，并使用两种视觉工具：结构化场景图生成和描述性图像标签生成来收集和表示事件信息。然后，一个装备了世界知识的 LLM 作为推理代理来实现响应，通过多个推理步骤来处理指定的视频事件。为了解决指定事件的困难，我们还提出了一种基于有效的匈牙利匹配的 Instruction-oriented Video Events Recognition（InsOVER）算法，以便 LLMS 与长视频进行交互。我们在两个典型的视频基于文本生成任务上进行了广泛的实验，结果显示，我们的自适应框架在与 Flamingo-80B 等预训练模型进行比较时，具有更高的性能。

Who Are All The Stochastic Parrots Imitating? They Should Tell Us!

paper_url: http://arxiv.org/abs/2310.10583
repo_url: None
paper_authors: Sagi Shaier, Lawrence E. Hunter, Katharina von der Wense
for: 这篇论文主要是关于语言模型（LM）的可靠性问题。
methods: 作者建议使用LM可以引用其训练数据的方法，以便快速验证LM生成的声明的真实性。
results: 作者认为，当前的LM在重要场景中永远不会被完全信任，并建议一种新的策略来解决这个问题，即建立LM可以引用其训练数据的能力。

Abstract
Both standalone language models (LMs) as well as LMs within downstream-task systems have been shown to generate statements which are factually untrue. This problem is especially severe for low-resource languages, where training data is scarce and of worse quality than for high-resource languages. In this opinion piece, we argue that LMs in their current state will never be fully trustworthy in critical settings and suggest a possible novel strategy to handle this issue: by building LMs such that can cite their sources - i.e., point a user to the parts of their training data that back up their outputs. We first discuss which current NLP tasks would or would not benefit from such models. We then highlight the expected benefits such models would bring, e.g., quick verifiability of statements. We end by outlining the individual tasks that would need to be solved on the way to developing LMs with the ability to cite. We hope to start a discussion about the field's current approach to building LMs, especially for low-resource languages, and the role of the training data in explaining model generations.

摘要
各种自然语言处理（NLP）任务中的语言模型（LM）都有可能生成不准确的陈述，特别是 для低资源语言，训练数据稀缺，质量也较差。在这篇意见文章中，我们 argue that LMs 在当前状态下从不能在重要场景中得到完全信任，并提出一种可能的新策略来解决这个问题：建立LMs 可以指明其所基于的训练数据部分，即用户可以通过点击LMs 的输出来找到相应的训练数据。我们首先讨论了当前NLP任务中哪些任务可以或不可以受益于这种模型，然后描述了这种模型带来的预期优势，例如快速验证陈述的可靠性。最后，我们列出了需要解决的任务，以开发LMs 可以指明其所基于的训练数据部分。我们希望通过这篇文章引发关于当前LMs 建设的讨论，特别是低资源语言的LMs，以及训练数据的角色在解释模型生成中。

Emerging Challenges in Personalized Medicine: Assessing Demographic Effects on Biomedical Question Answering Systems

paper_url: http://arxiv.org/abs/2310.10571
repo_url: None
paper_authors: Sagi Shaier, Kevin Bennett, Lawrence Hunter, Katharina von der Wense
for: 本研究旨在检测生物医学问答模型是否受到人群特征影响，以确保医疗公平。
methods: 研究使用了不同类型的问答模型，包括基于知识图（KG）和文本基于的模型，并对它们进行了测试。
results: 研究发现， irrelevant demographic information可以导致问答模型的答案发生变化，变化的比例可达15%（基于知识图）和23%（基于文本）。这些变化可能会影响准确性。

Abstract
State-of-the-art question answering (QA) models exhibit a variety of social biases (e.g., with respect to sex or race), generally explained by similar issues in their training data. However, what has been overlooked so far is that in the critical domain of biomedicine, any unjustified change in model output due to patient demographics is problematic: it results in the unfair treatment of patients. Selecting only questions on biomedical topics whose answers do not depend on ethnicity, sex, or sexual orientation, we ask the following research questions: (RQ1) Do the answers of QA models change when being provided with irrelevant demographic information? (RQ2) Does the answer of RQ1 differ between knowledge graph (KG)-grounded and text-based QA systems? We find that irrelevant demographic information change up to 15% of the answers of a KG-grounded system and up to 23% of the answers of a text-based system, including changes that affect accuracy. We conclude that unjustified answer changes caused by patient demographics are a frequent phenomenon, which raises fairness concerns and should be paid more attention to.

摘要
现代问答（QA）模型表现出多种社会偏见（例如与性别或种族相关），通常可以归因于训练数据中的类似问题。然而，到目前为止忽略了在重要领域生物医学中，任何不当的模型输出变化因为病人特征是问题：它会导致患者不公正地处理。我们选择仅考虑不依赖性别、性别或性 orientation 的生物医学问题，并提出以下研究问题：（RQ1）QA 模型在接受无关的民族信息时是否发生变化？（RQ2）对知识图（KG）基础的 QA 系统和文本基础的 QA 系统而言，RQ1 的答案是否不同？我们发现，无关民族信息可以改变 KG 基础系统的答案，占总答案的 15%，而文本基础系统的答案则占 23%，包括影响准确性的变化。我们 conclude 这种不当的答案变化是常见的，这引发公平问题，需要更多的注意。

On Position Bias in Summarization with Large Language Models

paper_url: http://arxiv.org/abs/2310.10570
repo_url: None
paper_authors: Mathieu Ravaut, Shafiq Joty, Aixin Sun, Nancy F. Chen
for: 本研究旨在探讨语言模型在多文档问答 задании中如何利用输入Context，以及这些模型在摘要生成任务中的表现。
methods: 本研究使用了10个数据集、4个语言模型和5个评价指标来分析语言模型在摘要生成任务中如何利用其输入。
results: 研究发现，语言模型倾向于使用 introduce content（以及一定程度的 final content），导致摘要生成性能呈U型曲线。这种偏好对多种多样化的摘要任务提出了挑战。

Abstract
Large language models (LLMs) excel in zero-shot abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, surpassing token limits of 32k or more. However, in the realm of multi-document question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization tasks where crucial content may be dispersed throughout the source document(s). This paper presents a comprehensive investigation encompassing 10 datasets, 4 LLMs, and 5 evaluation metrics to analyze how these models leverage their input for abstractive summarization. Our findings reveal a pronounced bias towards the introductory content (and to a lesser extent, the final content), posing challenges for LLM performance across a range of diverse summarization benchmarks.

摘要
大型语言模型（LLM）在零shot摘要任务中表现出色，提供流畅和有关的摘要。最近的进步使其能处理长输入上下文，超过32k个Token的限制。然而，在多文档问答任务中，语言模型表现出输入上下文不均匀的问题。它们倾向于初始和 final段，导致摘要性能形成U型曲线，其中答案位于输入中的任何位置。这种偏见存在问题，特别是在摘要任务中，重要的内容可能会分散在源文档中。本文通过10个数据集、4个LLM和5个评价指标进行全面的调查，分析这些模型如何使用其输入进行摘要。我们发现，LLM偏向于引言内容（以及一定 extent的 final content），这会影响LLM在多种多样的摘要benchmark上的表现。

RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling

paper_url: http://arxiv.org/abs/2310.10567
repo_url: None
paper_authors: Jingcheng Deng, Liang Pang, Huawei Shen, Xueqi Cheng
for: 提高语言模型（LM）的表达质量和减少幻觉
methods: 使用检索增强的语言模型（RegaVAE），其基于变量自动编码器（VAE），并在检索和生成过程中使用嵌入空间来捕捉当前和未来文本的信息
results: 在多个 dataset 上实现了显著提高表达质量和幻觉的除去

Abstract
Retrieval-augmented language models show promise in addressing issues like outdated information and hallucinations in language models (LMs). However, current research faces two main problems: 1) determining what information to retrieve, and 2) effectively combining retrieved information during generation. We argue that valuable retrieved information should not only be related to the current source text but also consider the future target text, given the nature of LMs that model future tokens. Moreover, we propose that aggregation using latent variables derived from a compact latent space is more efficient than utilizing explicit raw text, which is limited by context length and susceptible to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE). It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Additionally, we leverage the VAE to initialize the latent space and adopt the probabilistic form of the retrieval generation paradigm by expanding the Gaussian prior distribution into a Gaussian mixture distribution. Theoretical analysis provides an optimizable upper bound for RegaVAE. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.

摘要
Translation note:* "outdated information" is translated as "过时信息" (guòshí xīnxiàng)* "hallucinations" is translated as "幻见" (hénjiàn)* "latent variables" is translated as "隐变量" (yǐbiàn yuán)* "compact latent space" is translated as "紧凑的隐藏空间" (jìchōng de yǐnmo yòngkōng)* "raw text" is translated as "原始文本" (yuánshi wén tiān)* "Gaussian prior distribution" is translated as "高斯先验分布" (gāosī xiān yì fāngbù)* "Gaussian mixture distribution" is translated as "高斯混合分布" (gāosī hùn yì fāngbù)* "theoretical analysis" is translated as "理论分析" (lǐlùn fāng'àn)* "upper bound" is translated as "上限" (shàngjìn)

ViPE: Visualise Pretty-much Everything

paper_url: http://arxiv.org/abs/2310.10543
repo_url: https://github.com/Hazel1994/ViPE-Videos
paper_authors: Hassan Shahmohammadi, Adhiraj Ghosh, Hendrik P. A. Lensch
for: This paper aims to address the issue of text-to-image models struggling to depict non-literal expressions, by introducing a new method called ViPE.
methods: ViPE uses a series of lightweight and robust language models trained on a large-scale set of lyrics with noisy visual descriptions generated by GPT3.5.
results: ViPE effectively expresses any arbitrary piece of text into a visualisable description, and exhibits an understanding of figurative expressions comparable to human experts. It also provides a powerful and open-source backbone for downstream applications such as music video and caption generation.

Abstract
Figurative and non-literal expressions are profoundly integrated in human communication. Visualising such expressions allow us to convey our creative thoughts, and evoke nuanced emotions. Recent text-to-image models like Stable Diffusion, on the other hand, struggle to depict non-literal expressions. Recent works primarily deal with this issue by compiling humanly annotated datasets on a small scale, which not only demands specialised expertise but also proves highly inefficient. To address this issue, we introduce ViPE: Visualise Pretty-much Everything. ViPE offers a series of lightweight and robust language models that have been trained on a large-scale set of lyrics with noisy visual descriptions that represent their implicit meaning. The synthetic visual descriptions are generated by GPT3.5 relying on neither human annotations nor images. ViPE effectively expresses any arbitrary piece of text into a visualisable description, enabling meaningful and high-quality image generation. We provide compelling evidence that ViPE is more robust than GPT3.5 in synthesising visual elaborations. ViPE also exhibits an understanding of figurative expressions comparable to human experts, providing a powerful and open-source backbone to many downstream applications such as music video and caption generation.

摘要
人类communication中的 figurative 和非Literal 表达是极其深入地融合在一起。Visualizing这些表达可以帮助我们表达创造性的思想，并触发细腻的情感。然而，现有的文本-图像模型，如Stable Diffusion，在描绘非Literal表达方面几乎无法表现出来。现有的工作主要采取了 compile humanly annotated datasets的方法，这不仅需要专业知识，还证明高效率。为解决这个问题，我们引入了 ViPE：Visualize Pretty-much Everything。ViPE 提供了一系列轻量级和可靠的语言模型，这些模型在大规模的歌词中生成了噪音的视觉描述。这些synthetic visual descriptions 由 GPT3.5 生成，不需要人类注释也不需要图像。ViPE 可以将任何文本转换成可视化的描述，从而实现了高质量的图像生成。我们提供了吸引人的证明，表明 ViPE 比 GPT3.5 更加稳定在生成视觉 elaborations 方面。ViPE 还表现出了对 figurative expressions 的理解，与人类专家相当，提供了一个强大且开源的基础结构，可以推动多个下游应用，如音乐视频和caption生成。

One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual Transfer

paper_url: http://arxiv.org/abs/2310.10532
repo_url: https://github.com/fdschmidt93/ofa-xlt
paper_authors: Fabian David Schmidt, Ivan Vulić, Goran Glavaš
for: 这个论文主要探讨了零例转移跨语言传递（ZS-XLT）的效iveness，以及如何选择最佳的模型和 hyperparameter。
methods: 该论文使用了多语言模型，并在不同的语言上进行了针对性的训练和测试。具体来说， authors 使用了不同的 hyperparameter 和模型Snapshot来进行训练和测试，并通过accumulative run-by-run averaging来提高 ZS-XLT 的性能。
results: 研究发现，传统的模型选择方法 based on source-language validation 很快就达到了下降的 ZS-XLT 性能。然而，通过accumulative run-by-run averaging来提高 ZS-XLT 性能，并与 “oracle” ZS-XLT 表现高度相关。

Abstract
Multilingual language models enable zero-shot cross-lingual transfer (ZS-XLT): fine-tuned on sizable source-language task data, they perform the task in target languages without labeled instances. The effectiveness of ZS-XLT hinges on the linguistic proximity between languages and the amount of pretraining data for a language. Because of this, model selection based on source-language validation is unreliable: it picks model snapshots with suboptimal target-language performance. As a remedy, some work optimizes ZS-XLT by extensively tuning hyperparameters: the follow-up work then routinely struggles to replicate the original results. Other work searches over narrower hyperparameter grids, reporting substantially lower performance. In this work, we therefore propose an unsupervised evaluation protocol for ZS-XLT that decouples performance maximization from hyperparameter tuning. As a robust and more transparent alternative to extensive hyperparameter tuning, we propose to accumulatively average snapshots from different runs into a single model. We run broad ZS-XLT experiments on both higher-level semantic tasks (NLI, extractive QA) and a lower-level token classification task (NER) and find that conventional model selection based on source-language validation quickly plateaus to suboptimal ZS-XLT performance. On the other hand, our accumulative run-by-run averaging of models trained with different hyperparameters boosts ZS-XLT performance and closely correlates with "oracle" ZS-XLT, i.e., model selection based on target-language validation performance.

摘要
多语言语模型可以实现零码跨语言传递（ZS-XLT）：经过精心适应源语言任务数据，它们可以在目标语言中完成任务无需标注实例。ZS-XLT的有效性取决于语言之间的语言相似性和语言预训练数据的量。因此，基于源语言验证的模型选择是不可靠的：它可能会选择模型快照中的产生性能不佳的模型。为了解决这个问题，一些研究者们在ZS-XLT中进行了广泛的超参数优化：然而，继续的研究往往难以复制原来的结果。其他研究者们在 narrower 的超参数格上进行了搜索，并报告了较低的性能。在这个研究中，我们因此提出了一种无监督的评估协议，以减少精度优化和超参数优化之间的关系。我们提议通过在不同的run中训练不同的超参数，并将这些run中的模型快照相加，以获得一个更加 robust 和 transparent 的ZS-XLT模型。我们在高级semantic任务（NLI、抽取式问答）和 lower-level 字符串分类任务（NER）上进行了广泛的ZS-XLT实验，并发现了以下结论：在源语言验证中选择模型的方法很快就到达了低效的ZS-XLT性能，而我们的积累run-by-run相加的模型快照则可以提高ZS-XLT性能，并与“oracle” ZS-XLT（基于目标语言验证性能进行选择）高度相关。

Metric Ensembles For Hallucination Detection

paper_url: http://arxiv.org/abs/2310.10495
repo_url: https://github.com/parthk279/Hallucination-Research
paper_authors: Grant C. Forbes, Parth Katlana, Zeydy Ortiz
for: 这篇论文主要研究了对摘要的自动生成中减少“幻”信息（不在原始文档中出现的信息）的问题，以及关于这个问题的评估方法。
methods: 该论文使用了许多不同的无监督度量来评估摘要的一致性，并对这些度量之间的相关性和人工评估分数的相关性进行了分析。
results: 研究发现，使用LLM（大型语言模型）基于的方法可以更好地检测摘要中的幻信息，而且 ensemble方法可以进一步提高这些分数。此外，研究还发现，要使ensemble方法有所提高，则需要确保度量在ensemble中具有足够相似的错误率，而不需要完全相同的错误率。

Abstract
Abstractive text summarization has garnered increased interest as of late, in part due to the proliferation of large language models (LLMs). One of the most pressing problems related to generation of abstractive summaries is the need to reduce "hallucinations," information that was not included in the document being summarized, and which may be wholly incorrect. Due to this need, a wide array of metrics estimating consistency with the text being summarized have been proposed. We examine in particular a suite of unsupervised metrics for summary consistency, and measure their correlations with each other and with human evaluation scores in the wiki_bio_gpt3_hallucination dataset. We then compare these evaluations to models made from a simple linear ensemble of these metrics. We find that LLM-based methods outperform other unsupervised metrics for hallucination detection. We also find that ensemble methods can improve these scores even further, provided that the metrics in the ensemble have sufficiently similar and uncorrelated error rates. Finally, we present an ensemble method for LLM-based evaluations that we show improves over this previous SOTA.

摘要
抽象摘要生成技术在最近几年来得到了更多的关注，一部分这是因为大语言模型（LLM）的普及。摘要生成中最大的问题之一是减少“幻觉”，即文档中没有包含的信息，而且可能完全错误。由于这一需求，一系列用于摘要与文档之间的一致性的度量被提出。我们专门研究了这些无监督度量的套件，并测量它们之间的相关性和人工评价分数在wiki_bio_gpt3_hallucination数据集中的相关性。然后，我们比较了这些评价与模型中的其他无监督度量和LLM-based方法的性能。我们发现LLM-based方法在幻觉检测方面表现出色，而且 ensemble方法可以进一步提高这些分数，只要 ensemble中的度量具有相似的错误率。最后，我们提出了一种ensemble方法，可以further improve sobre la última SOTA。

UNO-DST: Leveraging Unlabelled Data in Zero-Shot Dialogue State Tracking

paper_url: http://arxiv.org/abs/2310.10492
repo_url: https://github.com/lichuangnus/uno-dst
paper_authors: Chuang Li, Yan Zhang, Min-Yen Kan, Haizhou Li
for: 这篇论文是为了提出一种基于少量数据的零shot对话状态跟踪（DST）方法，以便在目标领域中进行自动标注。
methods: 该方法使用了 auxiliary tasks 生成槽类作为主任务的 inverse prompt，通过联合自我训练来使用无标记数据来增强 DST 模型的训练和精度。
results: 在 MultiWOZ 多语言对话场景中，该方法可以提高平均联合目标任务准确率 by 8%，表明该方法可以有效地提高 DST 模型在零shot 情况下的性能。

Abstract
Previous zero-shot dialogue state tracking (DST) methods only apply transfer learning, but ignore unlabelled data in the target domain. We transform zero-shot DST into few-shot DST by utilising such unlabelled data via joint and self-training methods. Our method incorporates auxiliary tasks that generate slot types as inverse prompts for main tasks, creating slot values during joint training. Cycle consistency between these two tasks enables the generation and selection of quality samples in unknown target domains for subsequent fine-tuning. This approach also facilitates automatic label creation, thereby optimizing the training and fine-tuning of DST models. We demonstrate this method's effectiveness on large language models in zero-shot scenarios, improving average joint goal accuracy by $8\%$ across all domains in MultiWOZ.

摘要
Translation notes:* "zero-shot" is translated as "无需标注的" (wú shí biāo yì)* "few-shot" is translated as "几个shot" (jī gè shòu)* "transfer learning" is translated as "传输学习" (chuán xiū xué xí)* "joint training" is translated as "共同训练" (gòng tóng xiǎo xíng)* "self-training" is translated as "自我训练" (zi wo xiǎo xíng)* "auxiliary tasks" is translated as "辅助任务" (bù zhù zhì gōng)* "slot types" is translated as "槽类型" (shí kè yì)* "inverse prompts" is translated as "反向提示" (fǎn xiàng tím shì)* "main tasks" is translated as "主要任务" (zhǔ yào zhì gōng)* "cycle consistency" is translated as "循环一致" (xún huán yī zhì)* "quality samples" is translated as "高质量的样本" (gāo zhì yàng yī xiǎng)* "unknown target domains" is translated as "未知目标领域" (wèi zhī mù bì yì zhòng)* "subsequent fine-tuning" is translated as "后续精度调整" (hòu xù jīng dù jiǎo yì)* "large language models" is translated as "大型自然语言模型" (dà xíng zì rán yǔ yán mó delì)* "improving" is translated as "提高" (tí gāo)* "average joint goal accuracy" is translated as "平均共同目标准确率" (píng jìn gòng tóng mù zhì jīn yì)Note: The translation is based on the standard Simplified Chinese language and may vary depending on the specific dialect or register used in the target domain.

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection

paper_url: http://arxiv.org/abs/2310.10482
repo_url: None
paper_authors: Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, André F. T. Martins
for: 本文旨在bridge sentence-level评估和错误span检测两种方法之间，提供更加细节的翻译评估方法。
methods: 本文提出了一种开源的学习型评估方法xCOMET，可以同时进行 sentence-level评估和错误span检测。
results: xCOMET在所有类型的评估中表现出状元，并能够高亮和分类错误 span，从而增强翻译评估的细节性。

Abstract
Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xCOMET, an open-source learned metric designed to bridge the gap between these approaches. xCOMET integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.

摘要
Translation (Simplified Chinese):广泛使用的学习型评估指标，如COMET和BLEURT，用单句级分数评估翻译假设，无法提供翻译错误的细节信息（例如，翻译错误的类型和严重程度）。然而，大型自然语言模型（LLMs）正在推广更细化的评估策略，尝试 detail和 categorize 翻译错误。在这种情况下，我们介绍了 xCOMET，一个开源的学习指标，用于bridging这些方法之间的差异。xCOMET integrate了句子级评估和错误异常检测功能，在所有类型的评估中表现出state-of-the-art的性能（句子级、系统级和错误异常检测）。此外，它还可以高亮和 categorize 错误异常，因此可以增加质量评估的深度。我们还提供了一个Robustness分析，使用压力测试，并显示 xCOMET 可以识别和报告局部重要的错误和幻觉。

G-SPEED: General SParse Efficient Editing MoDel

paper_url: http://arxiv.org/abs/2310.10480
repo_url: https://github.com/banner-z/g-speed
paper_authors: Haoke Zhang, Yue Wang, Juntao Li, Xiabing Zhou, Min Zhang
for: 提高工作效率，自动理解人类发出的指令并生成预期的内容。
methods: 提出了一种基于无监督文本编辑数据 clustering 算法的一种新型精简编辑模型建立方法，以及一种使用 sparse 编辑模型架构来缓解小语言模型的学习限制。
results: 对比 LLMS Equipped with 175B parameters，G-SPEED 的508M参数可以超越它们，并且可以满足多种编辑需求。

Abstract
Large Language Models~(LLMs) have demonstrated incredible capabilities in understanding, generating, and manipulating languages. Through human-model interactions, LLMs can automatically understand human-issued instructions and output the expected contents, which can significantly increase working efficiency. In various types of real-world demands, editing-oriented tasks account for a considerable proportion, which involves an interactive process that entails the continuous refinement of existing texts to meet specific criteria. Due to the need for multi-round human-model interaction and the generation of complicated editing tasks, there is an emergent need for efficient general editing models. In this paper, we propose \underline{\textbf{G}eneral \underline{\textbf{SP}arse \underline{\textbf{E}fficient \underline{\textbf{E}diting Mo\underline{\textbf{D}el~(\textbf{G-SPEED}), which can fulfill diverse editing requirements through a single model while maintaining low computational costs. Specifically, we first propose a novel unsupervised text editing data clustering algorithm to deal with the data scarcity problem. Subsequently, we introduce a sparse editing model architecture to mitigate the inherently limited learning capabilities of small language models. The experimental outcomes indicate that G-SPEED, with its 508M parameters, can surpass LLMs equipped with 175B parameters. Our code and model checkpoints are available at \url{https://github.com/Banner-Z/G-SPEED}.

摘要
大型语言模型~(LLMs) 已经表现出了惊人的能力，包括理解、生成和修改语言。通过人机交互，LLMs 可以自动理解人类发布的指令，并输出预期的内容，这可能会提高工作效率。在各种实际应用中，修改任务占了一定的比重，这些任务涉及到人机交互的互动过程，需要不断细化现有的文本，以满足特定的标准。由于需要多轮人机交互和复杂的修改任务，有一种急需高效的通用修改模型。在这篇论文中，我们提出了 \underline{\textbf{G}eneral \underline{\textbf{SP}arse \underline{\textbf{E}fficient \underline{\textbf{E}diting Mo\underline{\textbf{D}el~(\textbf{G-SPEED})，它可以满足多样化的修改需求，而且保持低的计算成本。 Specifically，我们首先提出了一种新的无监督文本修改数据归类算法，以解决数据稀缺问题。然后，我们引入了稀疏修改模型架构，以降低小语言模型的内置学习能力限制。实验结果表明，G-SPEED，具有508M参数，可以超越配备175B参数的LLMs。我们的代码和模型检查点可以在 \url{https://github.com/Banner-Z/G-SPEED} 上获取。

MechGPT, a language-based strategy for mechanics and materials modeling that connects knowledge across scales, disciplines and modalities

paper_url: http://arxiv.org/abs/2310.10445
repo_url: None
paper_authors: Markus J. Buehler
for: 本研究旨在探索人工智能技术以连接不同领域知识，以便更好地探索多 scales 材料失效的问题。
methods: 本研究使用了一个精度调整的大型自然语言模型（LLM），从 raw 源料中提取问题和答案对，然后使用 LLM 微调。研究还使用了 Ontological Knowledge Graphs 提取结构性信息，以及在不同大小和上下文长度下运行多种计算实验。
results: 研究发现，LLMs 能够提取多 scales 材料失效的结构性信息，并且可以用于新的研究问题框架和可读性图表。三个版本的 MechGPT 被讨论，它们在不同的参数大小和上下文长度下运行，可以实现复杂的检索增强策略和多模态探索。

Abstract
For centuries, researchers have sought out ways to connect disparate areas of knowledge. While early scholars (Galileo, da Vinci, etc.) were experts across fields, specialization has taken hold later. With the advent of Artificial Intelligence, we can now explore relationships across areas (e.g., mechanics-biology) or disparate domains (e.g., failure mechanics-art). To achieve this, we use a fine-tuned Large Language Model (LLM), here for a subset of knowledge in multiscale materials failure. The approach includes the use of a general-purpose LLM to distill question-answer pairs from raw sources followed by LLM fine-tuning. The resulting MechGPT LLM foundation model is used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas. While the model has some ability to recall knowledge from training, we find that LLMs are particularly useful to extract structural insights through Ontological Knowledge Graphs. These interpretable graph structures provide explanatory insights, frameworks for new research questions, and visual representations of knowledge that also can be used in retrieval-augmented generation. Three versions of MechGPT are discussed, featuring different sizes from 13 billion to 70 billion parameters, and reaching context lengths of more than 10,000 tokens. This provides ample capacity for sophisticated retrieval augmented strategies, as well as agent-based modeling where multiple LLMs interact collaboratively and/or adversarially, the incorporation of new data from the literature or web searches, as well as multimodality.

摘要
Traditionally, researchers have sought to connect diverse areas of knowledge. While early scholars (such as Galileo and da Vinci) were experts across multiple fields, specialization has become more prevalent in recent times. With the advent of Artificial Intelligence, we can now explore relationships between different areas (such as mechanics and biology) or disparate domains (such as failure mechanics and art). To achieve this, we use a fine-tuned Large Language Model (LLM), specifically for a subset of knowledge in multiscale materials failure. The approach involves using a general-purpose LLM to distill question-answer pairs from raw sources, followed by LLM fine-tuning. The resulting MechGPT LLM foundation model is then used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas. While the model has some ability to recall knowledge from training, we find that LLMs are particularly useful for extracting structural insights through Ontological Knowledge Graphs. These interpretable graph structures provide explanatory insights, frameworks for new research questions, and visual representations of knowledge that can also be used in retrieval-augmented generation. Three versions of MechGPT are discussed, featuring different sizes ranging from 13 billion to 70 billion parameters, and reaching context lengths of more than 10,000 tokens. This provides ample capacity for sophisticated retrieval-augmented strategies, as well as agent-based modeling where multiple LLMs interact collaboratively and/or adversarially, the incorporation of new data from the literature or web searches, as well as multimodality.

Exploiting User Comments for Early Detection of Fake News Prior to Users’ Commenting

paper_url: http://arxiv.org/abs/2310.10429
repo_url: None
paper_authors: Qiong Nan, Qiang Sheng, Juan Cao, Yongchun Zhu, Danding Wang, Guang Yang, Jintao Li, Kai Shu
for: 探讨了现有方法中的准确性vs快速性之间的负担，并提出了一种可行但尚未得到广泛研究的解决方案，即利用历史新闻的社交背景（如评论）进行模型训练，并将其应用于新出现的新闻中。
methods: 提出了一种名为Comment Assisted Fake News Detection（CAS-FEND）的方法，该方法利用历史新闻的评论来帮助一个内容只的检测模型提高检测精度。特别是，该方法在训练阶段将有用的知识从教师模型中传递给学生模型，以便在新出现的新闻中进行检测。
results: 实验表明，CAS-FEND学生模型在检测新出现的假新闻方面表现出色，比内容只方法和使用1/4的评论作为输入的方法更高效。这示出了CAS-FEND的超越性，并证明了它在早期检测中的优势。

Abstract
Both accuracy and timeliness are key factors in detecting fake news on social media. However, most existing methods encounter an accuracy-timeliness dilemma: Content-only methods guarantee timeliness but perform moderately because of limited available information, while social context-based ones generally perform better but inevitably lead to latency because of social context accumulation needs. To break such a dilemma, a feasible but not well-studied solution is to leverage social contexts (e.g., comments) from historical news for training a detection model and apply it to newly emerging news without social contexts. This requires the model to (1) sufficiently learn helpful knowledge from social contexts, and (2) be well compatible with situations that social contexts are available or not. To achieve this goal, we propose to absorb and parameterize useful knowledge from comments in historical news and then inject it into a content-only detection model. Specifically, we design the Comments Assisted Fake News Detection method (CAS-FEND), which transfers useful knowledge from a comments-aware teacher model to a content-only student model during training. The student model is further used to detect newly emerging fake news. Experiments show that the CAS-FEND student model outperforms all content-only methods and even those with 1/4 comments as inputs, demonstrating its superiority for early detection.

摘要
<>translate_language: zh-CN<>严谨性和时效性都是社交媒体上检测假新闻的关键因素。然而，现有方法很多时会陷入精度-时效性之间的谍诀：内容仅仅方法可以保证时效性，但是它们的检测能力相对较弱，而基于社交上下文的方法通常可以提供更高的检测精度，但是它们需要较长的时间来积累社交上下文。为了突破这种谍诀，我们可以利用社交上下文（例如评论）来训练检测模型，并将其应用于新出现的新闻。这需要模型可以（1）充分学习社交上下文中的有用知识，并（2）在社交上下文存在或缺失时都能够具有Compatibility。为了实现这个目标，我们提出了注入社交上下文知识（e.g., 评论）到内容仅仅模型中的方法。我们称之为注入社交知识的Comments Assisted Fake News Detection方法（CAS-FEND）。在训练过程中，我们将社交上下文知识由一个师模型转移到内容仅仅模型中，然后使用这个师模型来检测新出现的假新闻。实验结果表明，CAS-FEND学生模型在检测新出现的假新闻方面表现出色，even outperforming those with 1/4 comments as inputs，这说明它在早期检测方面具有优势。

$\textit{Swap and Predict}$ – Predicting the Semantic Changes in Words across Corpora by Context Swapping

paper_url: http://arxiv.org/abs/2310.10397
repo_url: https://github.com/a1da4/svp-swap
paper_authors: Taichi Aida, Danushka Bollegala
For: The paper is written for detecting semantic changes of words in different text corpora.* Methods: The proposed method, Swapping-based Semantic Change Detection (SSCD), uses random context swapping to compare the meaning of a target word in two different text corpora.* Results: The method accurately predicts semantic changes of words in four languages (English, German, Swedish, and Latin) and across different time spans (over 50 years and about five years), and achieves significant performance improvements compared to strong baselines for the English semantic change prediction task.Here are the three key points in Simplified Chinese:* For: 文章目的是检测不同文本集中单个词语的 semantics 是否发生变化。* Methods: 提议的方法是基于随机上下文交换的 Swapping-based Semantic Change Detection (SSCD)，用于比较两个不同文本集中单个词语的含义。* Results: 方法可以准确地检测单个词语在四种语言（英语、德语、瑞典语和拉丁语）和不同时间间隔（超过50年和约5年）中的 semantics 变化，并在英语 semantic change prediction 任务上 achiev 高性能改进。

Abstract
Meanings of words change over time and across domains. Detecting the semantic changes of words is an important task for various NLP applications that must make time-sensitive predictions. We consider the problem of predicting whether a given target word, $w$, changes its meaning between two different text corpora, $\mathcal{C}_1$ and $\mathcal{C}_2$. For this purpose, we propose $\textit{Swapping-based Semantic Change Detection}$ (SSCD), an unsupervised method that randomly swaps contexts between $\mathcal{C}_1$ and $\mathcal{C}_2$ where $w$ occurs. We then look at the distribution of contextualised word embeddings of $w$, obtained from a pretrained masked language model (MLM), representing the meaning of $w$ in its occurrence contexts in $\mathcal{C}_1$ and $\mathcal{C}_2$. Intuitively, if the meaning of $w$ does not change between $\mathcal{C}_1$ and $\mathcal{C}_2$, we would expect the distributions of contextualised word embeddings of $w$ to remain the same before and after this random swapping process. Despite its simplicity, we demonstrate that even by using pretrained MLMs without any fine-tuning, our proposed context swapping method accurately predicts the semantic changes of words in four languages (English, German, Swedish, and Latin) and across different time spans (over 50 years and about five years). Moreover, our method achieves significant performance improvements compared to strong baselines for the English semantic change prediction task. Source code is available at https://github.com/a1da4/svp-swap .

摘要
文字的意思随时间和领域而变化。探测文字的 semantic change 是 NLP 应用中的一项重要任务，需要做到时效预测。我们考虑了 predicting whether a given target word, $w$, changes its meaning between two different text corpora, $\mathcal{C}_1$ and $\mathcal{C}_2$ 的问题。为此，我们提出了 $\textit{Swapping-based Semantic Change Detection}$ (SSCD)，一种无监督的方法， randomly swaps contexts between $\mathcal{C}_1$ and $\mathcal{C}_2$ where $w$ occurs。然后，我们 examine the distribution of contextualised word embeddings of $w$, obtained from a pretrained masked language model (MLM), representing the meaning of $w$ in its occurrence contexts in $\mathcal{C}_1$ and $\mathcal{C}_2$。如果 $w$ 的意思在 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 中不变，我们就会 expects the distributions of contextualised word embeddings of $w$ to remain the same before and after this random swapping process。尽管其简单，我们示示了使用预训练 MLM 无需 fine-tuning 的我们提posed context swapping method 可以准确地预测英语、德语、瑞典语和拉丁语中文字的 semantic change ，并且在不同的时间间隔（超过 50 年和约 5 年）中具有显著的性能提升。此外，我们的方法在英语 semantic change prediction 任务中也具有显著的性能提升。代码可以在 https://github.com/a1da4/svp-swap 找到。

Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance

paper_url: http://arxiv.org/abs/2310.10385
repo_url: https://github.com/Smu-Tan/ZS-NMT-Variations
paper_authors: Shaomu Tan, Christof Monz
for: 这个论文旨在探讨多语言神经机器翻译（MNMT）在零shot（ZS）翻译质量方面存在高度变化的原因。
methods: 该论文采用了系统性的实验方法，涵盖了40种语言的1560个翻译方向。通过分析， authors发现了三个关键因素对零shot NMT性能产生高度变化：1）目标语言翻译能力，2）词汇重叠，3）语言特性。
results: 研究发现，目标语言翻译质量是零shot NMT性能的最大影响因素，词汇重叠一直影响翻译质量。此外，语言属性，如语言家族和书写系统，对小型模型来说也具有一定的影响。此外， authors还发现了零shot翻译挑战不仅是 Off-target 问题，更是 beyond Off-target 问题。

Abstract
Multilingual Neural Machine Translation (MNMT) facilitates knowledge sharing but often suffers from poor zero-shot (ZS) translation qualities. While prior work has explored the causes of overall low ZS performance, our work introduces a fresh perspective: the presence of high variations in ZS performance. This suggests that MNMT does not uniformly exhibit poor ZS capability; instead, certain translation directions yield reasonable results. Through systematic experimentation involving 1,560 language directions spanning 40 languages, we identify three key factors contributing to high variations in ZS NMT performance: 1) target side translation capability 2) vocabulary overlap 3) linguistic properties. Our findings highlight that the target side translation quality is the most influential factor, with vocabulary overlap consistently impacting ZS performance. Additionally, linguistic properties, such as language family and writing system, play a role, particularly with smaller models. Furthermore, we suggest that the off-target issue is a symptom of inadequate ZS performance, emphasizing that zero-shot translation challenges extend beyond addressing the off-target problem. We release the data and models serving as a benchmark to study zero-shot for future research at https://github.com/Smu-Tan/ZS-NMT-Variations

摘要
多语言神经机器翻译（MNMT）促进知识共享，但经常受到零上下文（ZS）翻译质量的劣化影响。尽管先前的工作已经探讨过总体低ZS性能的原因，我们的工作引入了一个新的视角：ZS翻译方向中的高变化性。这表示MNMT不uniformmente具有差的ZS能力；相反，某些翻译方向实际上可以得到不错的结果。通过对40种语言、1560个语言方向进行系统性的实验，我们确定了三个关键因素对ZS NMT性能的高变化：1）目标语言翻译能力2）词汇重叠3）语言特性。我们的发现表明目标语言翻译质量是最重要的因素，词汇重叠一直影响ZS性能。此外，语言家庭和书写系统等语言特性也在一定程度上影响ZS性能，特别是使用较小的模型时。此外，我们认为偏离问题是ZS翻译挑战的一部分，强调零上下文翻译挑战不仅是解决偏离问题而已。我们在github上发布了数据和模型，用于未来研究零上下文翻译，请参考https://github.com/Smu-Tan/ZS-NMT-Variations。

Privacy in Large Language Models: Attacks, Defenses and Future Directions

paper_url: http://arxiv.org/abs/2310.10383
repo_url: None
paper_authors: Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Yangqiu Song
for: This paper aims to provide a comprehensive analysis of privacy attacks targeting large language models (LLMs) and to identify potential vulnerabilities in these models.
methods: The paper uses a categorization of privacy attacks based on the adversary’s assumed capabilities to shed light on the potential vulnerabilities present in LLMs. It also presents a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks.
results: The paper identifies upcoming privacy concerns as LLMs evolve and points out several potential avenues for future exploration.

Abstract
The advancement of large language models (LLMs) has significantly enhanced the ability to effectively tackle various downstream NLP tasks and unify these tasks into generative pipelines. On the one hand, powerful language models, trained on massive textual data, have brought unparalleled accessibility and usability for both models and users. On the other hand, unrestricted access to these models can also introduce potential malicious and unintentional privacy risks. Despite ongoing efforts to address the safety and privacy concerns associated with LLMs, the problem remains unresolved. In this paper, we provide a comprehensive analysis of the current privacy attacks targeting LLMs and categorize them according to the adversary's assumed capabilities to shed light on the potential vulnerabilities present in LLMs. Then, we present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks. Beyond existing works, we identify upcoming privacy concerns as LLMs evolve. Lastly, we point out several potential avenues for future exploration.

摘要
LLMs 的进步significantly 提高了解决不同下游 NLP 任务的能力，并将这些任务集成成生成管道。一方面，强大的语言模型，通过庞大的文本数据进行训练，带来了无 precedent的可用性和使用性，对于模型和用户来说。然而，不受限制的访问这些模型也可能 introduce 恶意和无意的隐私风险。虽然持续努力解决 LLMS 中的安全和隐私问题，但问题仍未得到解决。本文提供了 LLMS 中隐私攻击的全面分析，根据敌对者假设的能力，将隐私攻击分为不同类别，以透视 LLMS 中的可能性隐私漏洞。然后，我们提供了一个详细的防御策略的概述，以响应这些隐私攻击。此外，我们还标识了 LLMS 的未来隐私问题。最后，我们指出了未来探索的一些可能性。

Contextual Data Augmentation for Task-Oriented Dialog Systems

paper_url: http://arxiv.org/abs/2310.10380
repo_url: None
paper_authors: Dustin Axman, Avik Ray, Shubham Garg, Jing Huang
for: 增强当前对话系统的训练任务。
methods: 使用对话上下文 Conditional 生成用户回复，并通过新的提示设计和输出重新排序来生成对话。
results: 在多种 benchmark 数据集上，我们的对话增强模型可以生成高质量的对话，提高对话成功率达到 $8%$ 的提高。Here’s the full text in Simplified Chinese:
for: 本文主要用于增强当前对话系统的训练任务。
methods: 我们提出了一种基于对话上下文 Conditional 生成用户回复的对话增强模型，并通过新的提示设计和输出重新排序来生成对话。
results: 在多种 benchmark 数据集上，我们的对话增强模型可以生成高质量的对话，提高对话成功率达到 $8%$ 的提高。

Abstract
Collection of annotated dialogs for training task-oriented dialog systems have been one of the key bottlenecks in improving current models. While dialog response generation has been widely studied on the agent side, it is not evident if similar generative models can be used to generate a large variety of, and often unexpected, user inputs that real dialog systems encounter in practice. Existing data augmentation techniques such as paraphrase generation do not take the dialog context into consideration. In this paper, we develop a novel dialog augmentation model that generates a user turn, conditioning on full dialog context. Additionally, with a new prompt design for language model, and output re-ranking, the dialogs generated from our model can be directly used to train downstream dialog systems. On common benchmark datasets MultiWoZ and SGD, we show that our dialog augmentation model generates high quality dialogs and improves dialog success rate by as much as $8\%$ over baseline.

摘要
“对话系统训练 Task-oriented 对话系统的集成 annotation 对话集成是一个关键瓶颈，目前模型的改进。虽然对话回复生成已经广泛研究，但是不清楚是否可以使用类似的生成模型来生成实际对话系统遇到的多样化和意外的用户输入。现有的数据增强技术，如重叠生成，不考虑对话上下文。在本文中，我们开发了一种基于对话上下文的对话增强模型，可以生成用户转折，并且通过新的语言模型提示和输出重新排序，生成的对话可以直接用于下游对话系统训练。在 MultiWoZ 和 SGD 等常用数据集上，我们展示了我们的对话增强模型可以生成高质量对话，提高对话成功率达到 $8\%$ 。”

Legal NLP Meets MiCAR: Advancing the Analysis of Crypto White Papers

paper_url: http://arxiv.org/abs/2310.10333
repo_url: None
paper_authors: Carolina Camassa
for: 这个论文是为了探讨欧盟Markets in Crypto-Assets Regulation（MiCAR）对不ikel进行规范的影响，以及在这个领域中文本分析的应用。
methods: 本论文使用自然语言处理（NLP）技术来分析不ikel白皮书，并探讨在MiCAR规范下如何integrate NLP。
results: 本论文发现了不ikel白皮书的文本分析应用存在一些研究漏洞，并对MiCAR规范的影响进行了分析，从而为规范机构、投资者和私有货币发行人提供了可能的研究方向。

Abstract
In the rapidly evolving field of crypto assets, white papers are essential documents for investor guidance, and are now subject to unprecedented content requirements under the European Union's Markets in Crypto-Assets Regulation (MiCAR). Natural Language Processing (NLP) can serve as a powerful tool for both analyzing these documents and assisting in regulatory compliance. This paper delivers two contributions to the topic. First, we survey existing applications of textual analysis to unregulated crypto asset white papers, uncovering a research gap that could be bridged with interdisciplinary collaboration. We then conduct an analysis of the changes introduced by MiCAR, highlighting the opportunities and challenges of integrating NLP within the new regulatory framework. The findings set the stage for further research, with the potential to benefit regulators, crypto asset issuers, and investors.

摘要
在迅速发展的区块链资产领域，白皮书是投资者指导的重要文件，现在欧盟市场区块链资产管理法规（MiCAR）下面面临无前例的内容要求。自然语言处理（NLP）可以作为分析这些文件并协助合规遵守的强大工具。这篇论文在这个主题上做出了两项贡献。首先，我们对未经规范的区块链资产白皮书的文本分析应用进行了调查，揭示出了一个研究差距，这可以通过交叉领域合作bridged。然后，我们对MiCAR引入的变化进行了分析， highlighting the opportunities and challenges of integrating NLP within the new regulatory framework。这些发现可以为 regulators、区块链资产发行人和投资者带来 beneficial。

Optimized Tokenization for Transcribed Error Correction

paper_url: http://arxiv.org/abs/2310.10704
repo_url: None
paper_authors: Tomer Wullach, Shlomo E. Chazan
for: 提高 speech recognition 系统的精度和可靠性
methods: 使用生成的错误分布和语言特定的 vocabulary 调整
results: 证明使用生成的错误分布和语言特定的 vocabulary 可以提高 correction 模型的性能，并且可以在多种语言和speech recognition 系统中应用

Abstract
The challenges facing speech recognition systems, such as variations in pronunciations, adverse audio conditions, and the scarcity of labeled data, emphasize the necessity for a post-processing step that corrects recurring errors. Previous research has shown the advantages of employing dedicated error correction models, yet training such models requires large amounts of labeled data which is not easily obtained. To overcome this limitation, synthetic transcribed-like data is often utilized, however, bridging the distribution gap between transcribed errors and synthetic noise is not trivial. In this paper, we demonstrate that the performance of correction models can be significantly increased by training solely using synthetic data. Specifically, we empirically show that: (1) synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations; (2) applying language-specific adjustments to the vocabulary of a BPE tokenizer strike a balance between adapting to unseen distributions and retaining knowledge of transcribed errors. We showcase the benefits of these key observations, and evaluate our approach using multiple languages, speech recognition systems and prominent speech recognition datasets.

摘要
Speech recognition systems face many challenges, such as differences in pronunciation, poor audio quality, and a lack of labeled data. To address these challenges, researchers have found that using dedicated error correction models can be effective, but these models require large amounts of labeled data, which is not easily obtained. To overcome this limitation, synthetic transcribed-like data is often used, but it can be difficult to bridge the gap between the distribution of transcribed errors and the synthetic noise. In this paper, we show that the performance of correction models can be significantly improved by training solely using synthetic data. Specifically, we find that: (1) synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations; (2) applying language-specific adjustments to the vocabulary of a BPE tokenizer can strike a balance between adapting to unseen distributions and retaining knowledge of transcribed errors. We demonstrate the benefits of these key observations using multiple languages, speech recognition systems, and prominent speech recognition datasets.

Untying the Reversal Curse via Bidirectional Language Model Editing

paper_url: http://arxiv.org/abs/2310.10322
repo_url: https://github.com/mjy1111/BAKE
paper_authors: Jun-Yu Ma, Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, Cong Liu
for: 本研究旨在提供一种 bidirectional language model editing 的评估方法，以评估编辑后模型是否可以在反向方向上撤回知识。
methods: 本研究提出了一种 bidirectional assessment for knowledge editing (BAKE) 的benchmark，以评估编辑后模型的反向可逆性。此外，研究还提出了一种名为 bidirectionally inversible relationship modeling (BIRD) 的方法，用于 Mitigating the reversal curse。
results: 实验显示，BIRD 可以通过更新模型参数来提高四种不同大小的 LLM 的表现，并且可以在问答和判断任务中提高模型的表现。

Abstract
Recent studies have demonstrated that large language models (LLMs) store massive factual knowledge within their parameters. But existing LLMs are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in the concept of model editing. Despite the emergence of benchmarks and approaches, these unidirectional editing and evaluation have failed to explore the reversal curse. Intuitively, if "The capital of France is" is edited to be a counterfact "London" within a model, then it should be able to naturally reason and recall the reverse fact, i.e., "London is the capital of" followed by "France" instead of "England". In this paper, we study bidirectional language model editing, aiming to provide rigorous model editing evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A new evaluation metric of reversibility is introduced, and a benchmark dubbed as Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate the reversibility of edited models in recalling knowledge in the reverse direction of editing. We surprisingly observe that while current editing methods and LLMs can effectively recall editing facts in the direction of editing, they suffer serious deficiencies when evaluated in the reverse direction. To mitigate the reversal curse, a method named Bidirectionally Inversible Relationship moDeling (BIRD) is proposed. A set of editing objectives that incorporate bidirectional relationships between subject and object into the updated model weights are designed. Experiments show that BIRD improves the performance of four representative LLMs of different sizes via question answering and judgement.

摘要
研究者最近发现，大型语言模型（LLM）中含有巨量的事实知识。然而，现有的LLM容易产生假或过时的知识，导致模型产生假信息。由于重新训练LLM是资源占用的，因此对模型编辑的概念产生了增加的兴趣。虽然有了 benchmarcks 和方法，但这些单向编辑和评估未能探索反转咒。在这篇文章中，我们研究了对向语言模型编辑，以提供对编辑后模型的精确评估，以确定编辑后模型是否可以在反向方向上恢复编辑知识。我们引入了一种新的评估指标——反向可逆性指标，并构建了一个名为“ bidirectional Assessment for Knowledge Editing”（BAKE）的benchmarcks，以评估编辑后模型在反向方向上的知识恢复能力。我们意外发现，当前的编辑方法和LLM可以很好地在编辑方向上恢复编辑知识，但在反向方向上表现异常差。为了 Mitigate the reversal curse，我们提出了一种名为“ bidirectionally Inversible Relationship moDeling”（BIRD）的方法。我们设计了一组编辑目标，将对象和主题之间的双向关系 integrate 到更新后的模型参数中。实验表明，BIRD 可以提高四种不同大小的 LLM 的表现，通过问答和判断。

Investigating Bias in Multilingual Language Models: Cross-Lingual Transfer of Debiasing Techniques

paper_url: http://arxiv.org/abs/2310.10310
repo_url: https://github.com/manon-reusens/multilingual_bias
paper_authors: Manon Reusens, Philipp Borchert, Margot Mieskes, Jochen De Weerdt, Bart Baesens
for: 本研究探讨了多语言模型中偏见纠正技术的跨语言传递性。我们对英文、法语、德语和荷语进行了研究。
methods: 我们使用多语言BERT（mBERT）来检验跨语言纠正技术的可行性，并发现这些技术可以跨语言传递，并且在不同语言上表现良好。
results: 我们发现，对非英语语言应用这些技术不会带来性能下降。使用CrowS-Pairs数据集的翻译，我们发现 SentenceDebias 是所有语言中最佳的纠正技术，可以在 mBERT 中减少偏见约13%。此外，我们发现在各种语言上追加预训练可以提高跨语言效果，特别是在低资源语言中。

Abstract
This paper investigates the transferability of debiasing techniques across different languages within multilingual models. We examine the applicability of these techniques in English, French, German, and Dutch. Using multilingual BERT (mBERT), we demonstrate that cross-lingual transfer of debiasing techniques is not only feasible but also yields promising results. Surprisingly, our findings reveal no performance disadvantages when applying these techniques to non-English languages. Using translations of the CrowS-Pairs dataset, our analysis identifies SentenceDebias as the best technique across different languages, reducing bias in mBERT by an average of 13%. We also find that debiasing techniques with additional pretraining exhibit enhanced cross-lingual effectiveness for the languages included in the analyses, particularly in lower-resource languages. These novel insights contribute to a deeper understanding of bias mitigation in multilingual language models and provide practical guidance for debiasing techniques in different language contexts.

摘要
Translation in Simplified Chinese:这篇论文研究了多语言模型中的偏见纠正技术的传递性。我们对英语、法语、德语和荷语进行了研究，使用多语言BERT（mBERT）来示范了跨语言传递的偏见纠正技术的可行性和效果。我们的结果表明，对非英语语言应用这些技术并不会带来性能下降，而且使用翻译的 CrowS-Pairs 数据集，我们的分析发现，在不同语言上，SentenceDebias 是最有效的技术，可以减少 mBERT 中的偏见程度。此外，我们还发现，对于不同语言的语言模型，额外的预训练可以提高跨语言效果，特别是对于低资源语言。这些发现对偏见纠正在多语言语言模型中的深入理解和实践指导提供了有价值的贡献。

Multi-Stage Pre-training Enhanced by ChatGPT for Multi-Scenario Multi-Domain Dialogue Summarization

paper_url: http://arxiv.org/abs/2310.10285
repo_url: https://github.com/zhouweixiao/mp4
paper_authors: Weixiao Zhou, Gengyao Li, Xianfu Cheng, Xinnian Liang, Junnan Zhu, Feifei Zhai, Zhoujun Li
for: 本研究针对多scene多domain的对话摘要进行了新的预训练模型设计，以增强预训练模型的适应性和对话摘要能力。
methods: 本研究使用了一种多stage预训练策略，通过将各个预训练目标调整为预训练模型的核心部分，以减少预训练模型与精革模型之间的差距。具体来说，我们首先进行了域对预训练，使用大量多scene多domain的对话资料，以增强我们的预训练模型的适应性。然后，我们进行了任务对预训练，使用大量多scene多domain的 “对话摘要” 平行数据，由ChatGPT进行标注，以增强我们的预训练模型的对话摘要能力。
results: 实验结果显示，我们的预训练模型在全域 fine-tuning、zero-shot 和几少shot设定中均有着重要的进步，与先前的状态艺术模型相比，具有更高的准确率和更好的一致性。

Abstract
Dialogue summarization involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.

摘要
对话概要化 involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.Here's the translation in Traditional Chinese:对话概要化 involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.

Generative Calibration for In-context Learning

paper_url: http://arxiv.org/abs/2310.10266
repo_url: https://github.com/changmenseng/generative_calibration
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Cao Liu, Jun Zhao, Kang Liu
for: 本研究目的是解释LLMs中的受欢迎特性——即场景学习，并提出一种基于生成抽象的约束方法来改进其性能。
methods: 本研究使用了 тео리тиче分析和实验方法来解释受欢迎特性的问题，并提出了一种基于生成抽象的约束方法来改进性能。
results: 研究发现，通过调整 labels 的分布，可以提高受欢迎特性的性能，并且这种方法可以在不同的 prompt 配置下保持稳定性。实验结果显示，提出的方法可以大幅提高受欢迎特性的性能，相比于 ICAL 和现有的准则方法，提高了27%的粗略率。

Abstract
As one of the most exciting features of large language models (LLMs), in-context learning is a mixed blessing. While it allows users to fast-prototype a task solver with only a few training examples, the performance is generally sensitive to various configurations of the prompt such as the choice or order of the training examples. In this paper, we for the first time theoretically and empirically identify that such a paradox is mainly due to the label shift of the in-context model to the data distribution, in which LLMs shift the label marginal $p(y)$ while having a good label conditional $p(x|y)$. With this understanding, we can simply calibrate the in-context predictive distribution by adjusting the label marginal, which is estimated via Monte-Carlo sampling over the in-context model, i.e., generation of LLMs. We call our approach as generative calibration. We conduct exhaustive experiments with 12 text classification tasks and 12 LLMs scaling from 774M to 33B, generally find that the proposed method greatly and consistently outperforms the ICL as well as state-of-the-art calibration methods, by up to 27% absolute in macro-F1. Meanwhile, the proposed method is also stable under different prompt configurations.

摘要
一个 LLM 中最吸引人的特点之一是内容学习（in-context learning），它允许用户快速批量任务解决器，只需要几个训练示例。然而，这种特点同时带来了一些问题，例如prompt的选择和顺序对性能的敏感性。在这篇论文中，我们首次 theoretically和empirically发现，这种парадок斯主要是由 LLMS 的数据分布 Label Shift 引起的， LLMS 会将标签梯度 $p(y)$ Shift，而保持标签条件 $p(x|y)$ 良好。通过这种理解，我们可以简单地调整受 Context 预测分布，这是通过 Monte-Carlo 采样来Estimate LLMS 的标签梯度。我们称之为生成 Calibration。我们进行了12种文本分类任务和12种 LLMS 的探索性实验，发现我们的方法可以大幅提高 IC 和当前最佳化方法的表现，最高提高27%的绝对值。此外，我们的方法也在不同的 prompt 配置下保持稳定。

Enhancing Interpretability using Human Similarity Judgements to Prune Word Embeddings

paper_url: http://arxiv.org/abs/2310.10262
repo_url: None
paper_authors: Natalia Flechas Manrique, Wanqian Bao, Aurelie Herbelot, Uri Hasson
for: 这个论文的目的是提供一种可读性方法，以便理解自然语言处理（NLP）系统中的 semantics。
methods: 这种方法使用supervised learning，并为给定的领域（如运动、职业）标识一 subset of 模型特征，以提高人类相似性判断的预测。这种方法只保留20-40%的原始特征，并且在8个独立的semantic domain中都有不同的特征集。
results: 这种方法可以帮助理解NLP系统中的 semantics，并且可以用来解释 humans 对不同领域的分类。例如， humans 在分类运动时会 differentiate based on how gender-inclusive和international they are。此外，这种方法还可以用来预测words的Semantic dimensions，例如 cognitive、emotional 和social dimensions。

Abstract
Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40% of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores' profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.

摘要
《NLプロセッシングにおける可読性方法の探索》目的：提供NLプロセッシング中の具体的システム构造の下でのSemanticsの问题解釈を提供する方法。方法：1. 给定のドメイン（例如、スポーツ、职业）に対し、predict human similarity judgmentsのためのsupervised learning方法を提供する。2. この方法では、原始の埋め込みに対して、20-40%の削减を行い、8つの独立したセマンティックドメインでのpredictionの改善を目指す。3. この方法でRetained featuresのSemanticsを解釈するために、two approachesを提供する。第一方法：1. Retained embeddingsの初期Componentのスコアを计算し、これらのスコアに対応するドメインワード（co-hyponyms）のスコアを测定する。2. この分析により、人々はスポーツなどをどのように区别しているかを理解することができる。第二方法：1. Retained setsを使用して、65を超えるsemantically annotated dimensionに対する予测タスクを実行する。2. この方法では、职业に対するRetained featuresは、cognitive、emotional、socialdimensionsに最も优れていることが分かり、フルーツや野菜に対するRetained featuresは、味（gustation）dimensionに最も优れていることが分かる。结论：これらの方法により、AIシステムと人间の知识のAlignmentを改善することができる。

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

paper_url: http://arxiv.org/abs/2310.10226
repo_url: https://github.com/gmftbygmftby/rep-dropout
paper_authors: Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, Yixuan Su
for: 本研究旨在解释文本神经网络垃圾问题的根本原因，从数据角度出发，提出了一个简单的解释。
methods: 我们采用了随机抽样和注意力抑制等方法来调查这个问题，并进行了实验 validate our findings。
results: 我们的实验结果表明，训练数据中的重复元素与神经网络垃圾问题之间存在强相关关系，避免训练数据中的重复元素可以大幅减少垃圾问题的出现。此外，我们发现，对于不同的方法，包括高流入词、可能性目标和自我强化现象，都可以通过对训练数据中的重复元素进行罚金来解释其效果。

Abstract
There are a number of diverging hypotheses about the neural text degeneration problem, i.e., generating repetitive and dull loops, which makes this problem both interesting and confusing. In this work, we aim to advance our understanding by presenting a straightforward and fundamental explanation from the data perspective. Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data. Subsequent experiments also demonstrate that by selectively dropping out the attention to repetitive words in training data, degeneration can be significantly minimized. Furthermore, our empirical analysis illustrates that prior works addressing the degeneration issue from various standpoints, such as the high-inflow words, the likelihood objective, and the self-reinforcement phenomenon, can be interpreted by one simple explanation. That is, penalizing the repetitions in training data is a common and fundamental factor for their effectiveness. Moreover, our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.

摘要
有很多关于神经文本衰退问题的不同假设，即生成循环和极端的循环，使得这个问题同时具有诱人性和混乱性。在这项工作中，我们希望通过数据角度提供直接和基本的解释，以进一步深化我们对这个问题的理解。我们的初步调查发现，衰退问题与训练数据中的重复的强相关性存在很强的关系。后续的实验也表明，在训练数据中 selectively dropping out 重复的注意力可以明显减少衰退。此外，我们的实验分析表明，先前关于衰退问题的不同方法，如高流入词、可能性目标和自我强化现象，都可以通过一个简单的解释：即在训练数据中 penalty 重复。此外，我们的实验还表明，即使考虑更大的模型大小和指导调整，penalizing 训练数据中的重复仍然是关键的。

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

paper_url: http://arxiv.org/abs/2310.10195
repo_url: https://github.com/openlmlab/lomo
paper_authors: Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu
for: 降低大语言模型训练的硬件门槛
methods: 利用非负矩阵分解估算二阶均值，采用分组更新正则化稳定收敛
results: 与AdamW相当的性能，同时减少训练内存占用

Abstract
Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.

摘要
大型语言模型已经取得了非常出色的成功，但它们的庞大参数大小需要很大的内存进行训练，从而设置了高度的门槛。而最近提出的低内存优化（LOMO）可以降低内存占用量，但是它的优化技术，类似于随机梯度下降，对于hyper参数敏感，而且 converge 性不如 AdamW 优化器，fail to match the performance of the prevailing optimizer for large language models。经验表明，与滑动 average 相比，适应式学习率更是关键性的 bridging 因素。基于这一点，我们提出了low-memory optimization with adaptive learning rate（AdaLomo），它在每个参数上提供了适应式学习率。为保持内存效率，我们使用非负矩阵因子分解来Estimate 第二个矩阵积分。此外，我们建议使用 grouped update normalization来稳定收敛。我们的实验表明，AdaLomo 可以与 AdamW 的性能相当，同时具有 significanly 降低内存需求，从而降低训练大语言模型的硬件阻碍。

VIBE: Topic-Driven Temporal Adaptation for Twitter Classification

paper_url: http://arxiv.org/abs/2310.10191
repo_url: https://github.com/CelestineZYJ/VIBE-Temporal-Adaptation
paper_authors: Yuji Zhang, Jing Li, Wenjie Li
for: address the challenge of deteriorating text classification performance in real-world social media due to language evolution
methods: 使用变量信息瓶颈（IB）正则化模型 latent topic evolution 进行时间适应，并通过多任务训练来使用时间戳和类别标签预测
results: 在 Twitter 上进行三种分类任务，与前一个状态的继续预处理方法相比，只使用3%的数据，显著提高了模型的性能

Abstract
Language features are evolving in real-world social media, resulting in the deteriorating performance of text classification in dynamics. To address this challenge, we study temporal adaptation, where models trained on past data are tested in the future. Most prior work focused on continued pretraining or knowledge updating, which may compromise their performance on noisy social media data. To tackle this issue, we reflect feature change via modeling latent topic evolution and propose a novel model, VIBE: Variational Information Bottleneck for Evolutions. Concretely, we first employ two Information Bottleneck (IB) regularizers to distinguish past and future topics. Then, the distinguished topics work as adaptive features via multi-task training with timestamp and class label prediction. In adaptive learning, VIBE utilizes retrieved unlabeled data from online streams created posterior to training data time. Substantial Twitter experiments on three classification tasks show that our model, with only 3% of data, significantly outperforms previous state-of-the-art continued-pretraining methods.

摘要
语言特征在现实世界社交媒体上发展，导致文本分类的性能下降。为Address这个挑战，我们研究时间适应，即使模型在过去数据上训练后，在未来数据上进行测试。大多数前期工作都集中在继续预训练或知识更新上，这可能会 compromise 社交媒体数据的性能。为解决这个问题，我们通过模拟 latent topic evolution 来反射特征变化，并提出了一种新的模型，namely VIBE：Variational Information Bottleneck for Evolutions。具体来说，我们首先使用两个 Information Bottleneck（IB）正则化来分辨过去和未来话题。然后，这些分辨出来的话题被用作 adaptive features，通过多任务训练时间戳和类别标签预测。在adaptive learning中，VIBE 利用了 posterior 于训练数据时间创建的在线流量中检索到的无标签数据，以进行学习。在 Twitter 上进行了三种分类任务的实验，我们发现，使用只有 3% 的数据，我们的模型可以明显超越先前的继续预训练方法。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models

paper_url: http://arxiv.org/abs/2310.10180
repo_url: https://github.com/menik1126/TRIGO
paper_authors: Jing Xiong, Jianhao Shen, Ye Yuan, Haiming Wang, Yichun Yin, Zhengying Liu, Lin Li, Zhijiang Guo, Qingxing Cao, Yinya Huang, Chuanyang Zheng, Xiaodan Liang, Ming Zhang, Qun Liu
for: 检验高级生成语言模型的逻辑能力和数学逻辑能力。
methods: 提出了一个基于Lean formal语言系统的ATP benchmark，评估模型在式子和数学表达中的推理能力和 manipulate、分组、因数化能力。
results: 对高级生成语言模型进行了广泛的实验，发现TRIGO benchmark可以挑战高级模型，包括GPT-4，并提供一个新的工具来研究高级模型在正式逻辑和数学逻辑方面的能力。

Abstract
Automated theorem proving (ATP) has become an appealing domain for exploring the reasoning ability of the recent successful generative language models. However, current ATP benchmarks mainly focus on symbolic inference, but rarely involve the understanding of complex number combination reasoning. In this work, we propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas and its capability to manipulate, group, and factor number terms. We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system. We then automatically generate additional examples from the annotated samples to expand the dataset. Furthermore, we develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability. Our extensive experiments show our proposed TRIGO poses a new challenge for advanced generative LM's including GPT-4 which is pre-trained on a considerable amount of open-source formal theorem-proving language data, and provide a new tool to study the generative LM's ability on both formal and mathematical reasoning.

摘要
自动证明 theorem (ATP) 已成为一个吸引人的领域，以探索最新的成功生成语言模型的逻辑能力。然而，当前的 ATP 标准 mainly focuses on 符号逻辑推理，很少涉及复杂的数学运算理解。在这种工作中，我们提出了 TRIGO，一个 ATP 标准，需要模型将 trigonometric 表达式简化为步骤证明，并评估生成LM的逻辑能力，包括数学表达式的排序、分组和因数化。我们从网络上收集了 trigonometric 表达式和简化过程，并 manually 鉴定了这些简化过程。然后，我们使用 Lean 正式语言系统来翻译这些样例，并自动生成了更多的样例来扩大数据集。此外，我们开发了基于 Lean-Gym 的自动生成器，以创建不同难度和分布的数据集，以全面分析模型的总体化能力。我们的广泛实验表明，我们的提出的 TRIGO 对高级的生成LM，包括 GPT-4，具有新的挑战，并提供了一个新的工具来研究生成LM的 both formal 和数学逻辑能力。

Joint Music and Language Attention Models for Zero-shot Music Tagging

paper_url: http://arxiv.org/abs/2310.10159
repo_url: None
paper_authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong
for: 这个论文目的是提出一种针对开放集音乐标签问题的零shot音乐标签系统。
methods: 该系统使用一种联合音乐和语言注意力（JMLA）模型，包括一个预训练的masked autoencoder音频编码器和一个Falcon7B干扰器。我们还引入了preceiver resampler将任意长度音频转换为固定长度表示。在编码器和解码器层之间添加了紧密的注意力连接，以改进编码器和解码器层之间的信息流。
results: 我们使用了互联网上收集的大规模音乐和描述数据集来训练JMLA模型。我们使用ChatGPT将原始描述转换为正规化和多样化的描述，以训练JMLA模型。我们的提议的JMLA系统在GTZAN数据集上实现了零shot音乐标签准确率为64.82%，超过了前一个零shot系统的性能，并与前一个系统在FMA和MagnaTagATune数据集上的性能相似。

Abstract
Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.

摘要
音乐标注是一项任务，旨在预测音乐录音的标签。然而，过去的音乐标注研究主要集中在靠近音乐标注任务上，这些任务无法泛化到新的标签。在这项工作中，我们提出了一种基于共同音乐和语言注意力（JMLA）模型的零批学习音乐标注系统，以解决开放集音乐标注问题。JMLA模型包括一个预训练的masked autoencoder音频编码器和一个Falcon7B decoder。我们引入了preceiver resampler将任意长度音频转换为固定长度嵌入。我们引入了 dense attention连接 между编码器和解码器层，以改进编码器和解码器之间的信息流。我们收集了互联网上大规模的音乐和描述数据集。我们提议使用ChatGPT将Raw描述转换为正式化和多样化的描述，以训练JMLA模型。我们提出的JMLA系统在GTZAN数据集上实现了零批学习音乐标注精度为64.82%，超过了前一代零批系统的性能，并与前一代系统在FMA和MagnaTagATune数据集上实现了相似的结果。

DNA: Denoised Neighborhood Aggregation for Fine-grained Category Discovery

paper_url: http://arxiv.org/abs/2310.10151
repo_url: https://github.com/Lackel/DNA
paper_authors: Wenbin An, Feng Tian, Wenkai Shi, Yan Chen, Qinghua Zheng, QianYing Wang, Ping Chen
for: bridging the gap between fine-grained analysis and high annotation cost
methods: self-supervised framework that encodes semantic structures of data into the embedding space, with three principles to filter out false neighbors
results: retrieves more accurate neighbors and outperforms state-of-the-art models by a large margin (average 9.96% improvement on three metrics)

Abstract
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster representations. In this paper, we propose Denoised Neighborhood Aggregation (DNA), a self-supervised framework that encodes semantic structures of data into the embedding space. Specifically, we retrieve k-nearest neighbors of a query as its positive keys to capture semantic similarities between data and then aggregate information from the neighbors to learn compact cluster representations, which can make fine-grained categories more separatable. However, the retrieved neighbors can be noisy and contain many false-positive keys, which can degrade the quality of learned embeddings. To cope with this challenge, we propose three principles to filter out these false neighbors for better representation learning. Furthermore, we theoretically justify that the learning objective of our framework is equivalent to a clustering loss, which can capture semantic similarities between data to form compact fine-grained clusters. Extensive experiments on three benchmark datasets show that our method can retrieve more accurate neighbors (21.31% accuracy improvement) and outperform state-of-the-art models by a large margin (average 9.96% improvement on three metrics). Our code and data are available at https://github.com/Lackel/DNA.

摘要
发现细化类别从宽域标注数据是一个实用和挑战性的任务，可以bridging the gap между需求细化分析和高标注成本。先前的工作主要关注实例级别的 отличия来学习低级特征，但忽略数据之间的 semantic similarity，这可能会使这些模型学习不够紧凑的集群表示。在这篇论文中，我们提出了 Denoised Neighborhood Aggregation（DNA），一种无监督的框架，它可以将数据的 semantic structure编码到嵌入空间中。具体来说，我们在查询时检索 k 个最近邻居作为它的正确键，以捕捉数据之间的semantic similarity，然后将邻居中的信息聚合以学习紧凑的集群表示。但是，检索到的邻居可能含有很多假阳键，这会下降学习得到的嵌入的质量。为了解决这个挑战，我们提出了三个原则来筛选假阳键，以便更好地学习嵌入。此外，我们也证明了我们的学习目标等价于一种聚类损失函数，可以捕捉数据之间的semantic similarity，以形成细化的集群。我们在三个标准数据集上进行了广泛的实验，得到了更高准确的邻居（21.31%的准确率提高）和超过当前领先模型（平均9.96%的提高）。我们的代码和数据可以在https://github.com/Lackel/DNA中找到。

Node-based Knowledge Graph Contrastive Learning for Medical Relationship Prediction

paper_url: http://arxiv.org/abs/2310.10138
repo_url: https://github.com/zhi520/nc-kge
paper_authors: Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen
For: The paper is written for enhancing the distinctiveness of knowledge graph embeddings (KGEs) and improving the performance of downstream tasks such as predicting drug combinations and reasoning disease-drug relationships.* Methods: The paper proposes a novel node-based contrastive learning method for KGE, called NC-KGE, which constructs appropriate contrastive node pairs on knowledge graphs (KGs) and integrates a relation-aware attention mechanism to focus on semantic relationships and node interactions.* Results: The paper shows that NC-KGE performs competitively with state-of-the-art models on public datasets and outperforms all baselines in predicting biomedical relationship predictions tasks, especially in predicting drug combination relationships.

Abstract
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), has been challenging. In this paper, we proposed a novel node-based contrastive learning method for knowledge graph embedding, NC-KGE. NC-KGE enhances knowledge extraction in embeddings and speeds up training convergence by constructing appropriate contrastive node pairs on KGs. This scheme can be easily integrated with other knowledge graph embedding (KGE) methods. For downstream task such as biochemical relationship prediction, we have incorporated a relation-aware attention mechanism into NC-KGE, focusing on the semantic relationships and node interactions. Extensive experiments show that NC-KGE performs competitively with state-of-the-art models on public datasets like FB15k-237 and WN18RR. Particularly in biomedical relationship prediction tasks, NC-KGE outperforms all baselines on datasets such as PharmKG8k-28, DRKG17k-21, and BioKG72k-14, especially in predicting drug combination relationships. We release our code at https://github.com/zhi520/NC-KGE.

摘要
biomedical知识图（BKG）的嵌入生成了可靠的表示，对于许多人工智能应用有益，如预测药物组合和理解疾病药物关系。然而，在知识图（KG）中构建适当的对照对是一项挑战。在这篇论文中，我们提出了一种新的节点基于对照学习方法 для知识图嵌入（NC-KGE）。NC-KGE在KG中构建适当的对照节点对，从而提高了知识EXTRACTION在嵌入中的效果，并加速了训练的收敛。此方法可以轻松地与其他知识图嵌入（KGE）方法结合使用。在下游任务中，我们将一种关系意识注意力机制 incorporated into NC-KGE，这种机制会话焦点在知识图中的semantic关系和节点互动。我们在公共数据集FB15k-237和WN18RR进行了广泛的实验，结果显示NC-KGE与状态之前模型相当竞争。特别是在生物医学关系预测任务中，NC-KGE在PharmKG8k-28、DRKG17k-21和BioKG72k-14等数据集上表现出色，尤其是预测药物组合关系。我们在https://github.com/zhi520/NC-KGE中发布了代码。

Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

paper_url: http://arxiv.org/abs/2310.10118
repo_url: None
paper_authors: Arthur Amalvy, Vincent Labatut, Richard Dufour
for: 提高 named entity recognition (NER) 的准确率，特别是在长文档中。
methods: 使用 Alpaca instrucituned large language model (LLM) 生成一个 sintethic context retrieval 训练数据集，然后使用 BERT 模型进行 neural context retriever 训练。
results: 在英文文学数据集中（包括 40 本第一章），我们的方法比几种 retrieval 基准方法高效，提高 NER 任务的准确率。

Abstract
While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

摘要
Recent pre-trained transformer-based models can perform named entity recognition (NER) with high accuracy, but their limited range is a problem when applied to long documents like whole novels. To address this issue, we propose to retrieve relevant context at the document level. However, due to the lack of supervision, we must rely on unsupervised approaches. We use Alpaca, an instruction-tuned large language model (LLM), to generate a synthetic context retrieval training dataset, and then train a neural context retriever based on a BERT model to find relevant context for NER. Our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.Here's the word-for-word translation of the text into Simplified Chinese:现代预训练变换器模型可以实现命名实体识别（NER）的高精度，但它们的限制范围是一个问题，应用于整个小说等长文档时。为解决这个问题，我们提议在文档级别上提取相关的上下文。然而，由于缺乏监督，我们必须采用无监督方法。我们使用Alpaca，一个指导调整的大型自然语言模型（LLM），生成一个假数据集，并使用这个数据集来训练一个基于BERT模型的神经网络上下文检索器。我们表明，我们的方法在英文文学 dataset 上（由40本第一章组成） outperform 多个检索基准点 для NER 任务。

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

paper_url: http://arxiv.org/abs/2310.10106
repo_url: None
paper_authors: Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent
for: 本研究旨在开发一个综合式多频道自动语音识别（MC-SA-ASR）系统，该系统结合Conformer编码器和 speaker-attributed Transformer编码器，并且可以有效地结合语音识别和发音识别模块在多频道设置下。
methods: 本研究使用了Conformer编码器和 speaker-attributed Transformer编码器，并且在多频道设置下使用了多框精度注意力和发音识别模块。
results: 在对LibriSpeech数据的模拟混合数据进行测试时，本研究可以 reduves the word error rate（WER）by up to 12% and 16% compared to previous single-channel and multichannel approaches，respectively。此外，本研究还 investigate了不同的输入特征对ASR性能的影响。最后，我们的实验表明，本系统在真实世界的多频道会议记录中具有有效性。

Abstract
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. To the best of our knowledge, this is the first model that efficiently integrates ASR and speaker identification modules in a multichannel setting. On simulated mixtures of LibriSpeech data, our system reduces the word error rate (WER) by up to 12% and 16% relative compared to previously proposed single-channel and multichannel approaches, respectively. Furthermore, we investigate the impact of different input features, including multichannel magnitude and phase information, on the ASR performance. Finally, our experiments on the AMI corpus confirm the effectiveness of our system for real-world multichannel meeting transcription.

摘要
我们提出了一个综合式多通道自动语音识别（MC-SA-ASR）系统，该系统使用Conformer编码器和Speaker-attributed Transformer编码器。我们认为这是首个在多通道设定下集成ASR和speaker认知模块的模型。在对LibriSpeech数据的模拟混合物中，我们的系统可以降低单个通道和多个通道方法相比，Word Error Rate（WER）下降至12%和16%。此外，我们还研究了不同的输入特征，包括多通道幅度和频率信息，对ASR性能的影响。最后，我们在AMI corpus上进行了实验，证明了我们的系统在实际多通道会议记录中的有效性。Note: "Simplified Chinese" is used to refer to the standardized form of Chinese used in mainland China, which is different from "Traditional Chinese" used in Taiwan and other regions.

Decomposed Prompt Tuning via Low-Rank Reparameterization

paper_url: http://arxiv.org/abs/2310.10094
repo_url: https://github.com/xyaoooo/dpt
paper_authors: Yao Xiao, Lu Xu, Jiaxi Li, Wei Lu, Xiaoli Li
for: 提高Prompt Tuning的效率和精度
methods: 使用低级别矩阵初始化软提问
results: 在高资源和低资源情况下，实验结果表明提议方法具有效果

Abstract
While prompt tuning approaches have achieved competitive performance with high efficiency, we observe that they invariably employ the same initialization process, wherein the soft prompt is either randomly initialized or derived from an existing embedding vocabulary. In contrast to these conventional methods, this study aims to investigate an alternative way to derive soft prompt. Our empirical studies show that the soft prompt typically exhibits a low intrinsic rank characteristic. With such observations, we propose decomposed prompt tuning, a novel approach that utilizes low-rank matrices to initialize the soft prompt. Through the low-rank reparameterization, our method significantly reduces the number of trainable parameters while maintaining effectiveness. Experimental results on the SuperGLUE benchmark in both high-resource and low-resource scenarios demonstrate the effectiveness of the proposed method.

摘要
而Prompt调整方法已经实现了高效率的竞争性表现，但我们发现这些传统方法 invariably使用相同的初始化过程，即软提示是随机初始化或者基于现有的Embedding词汇。相比之下，这一研究旨在调查一种不同的软提示 derive的方法。我们的实验研究表明，软提示通常具有低内在矩阵特征。基于这些观察，我们提议了分解Prompt调整，一种使用低级数矩阵初始化软提示的新方法。通过低级数重parameter化，我们的方法可以减少训练参数的数量，同时保持效果。SuperGLUEbenchmark上的实验结果表明，我们的方法在高资源和低资源情况下都具有显著的效果。

JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning

paper_url: http://arxiv.org/abs/2310.10083
repo_url: None
paper_authors: Issey Sukeda, Masahiro Suzuki, Hiroki Sakaji, Satoshi Kodera
for: 本研究旨在探讨如何适应医疗领域的大语言模型（LLMs），以及如何通过领域适应来提高模型的性能。
methods: 本研究使用了LoRA基于的指令调整方法来调整LLMs，以吸收医疗领域特定的知识。
results: 研究发现，通过LoRA基于的指令调整方法，可以部分地将医疗领域特定的知识integrated到LLMs中，大型模型表现更加明显。此外，研究还发现，可以通过适应英语中心模型来进行日本应用领域的适应，同时也 highlighted了日本中心模型的局限性。这些发现可以帮助医疗机构 fine-tune和运行模型，不需要依赖于外部服务。

Abstract
In the ongoing wave of impact driven by large language models (LLMs) like ChatGPT, the adaptation of LLMs to medical domain has emerged as a crucial research frontier. Since mainstream LLMs tend to be designed for general-purpose applications, constructing a medical LLM through domain adaptation is a huge challenge. While instruction-tuning is used to fine-tune some LLMs, its precise roles in domain adaptation remain unknown. Here we show the contribution of LoRA-based instruction-tuning to performance in Japanese medical question-answering tasks. In doing so, we employ a multifaceted evaluation for multiple-choice questions, including scoring based on "Exact match" and "Gestalt distance" in addition to the conventional accuracy. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects. Furthermore, our results underscore the potential of adapting English-centric models for Japanese applications in domain adaptation, while also highlighting the persisting limitations of Japanese-centric models. This initiative represents a pioneering effort in enabling medical institutions to fine-tune and operate models without relying on external services.

摘要
在现代语言模型（LLM）如ChatGPT的浪潮中，适应医疗领域的LLM研究已成为一个关键的前沿领域。由于主流LLM通常是设计为通用应用程序，因此在医疗领域中构建一个LLM通过领域适应是一项巨大的挑战。而在某些LLM上进行了 instrucion-tuning，其precise roles在领域适应仍然未知。在这里，我们展示了LoRA基于的instruction-tuning对于日本医学问答任务的贡献。为此，我们采用了多方面的评估方法，包括基于"精确匹配"和"格式距离"的分数，以及传统的准确率。我们的发现表明，LoRA基于的instruction-tuning可以部分地将领域特定知识引入LLM，大型模型表现更加明显。此外，我们的结果还指出了将英语中心模型适应日本应用的潜在优势，同时也高亮了日本中心模型的限制。这个实验代表了医疗机构可以通过自主定制和操作模型而不需要依赖于外部服务的先驱性努力。

Let’s reward step by step: Step-Level reward model as the Navigators for Reasoning

paper_url: http://arxiv.org/abs/2310.10080
repo_url: None
paper_authors: Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang
for: 本研究旨在探讨用Large Language Models（LLM）进行多步逻辑时，是否可以通过在推理过程中提供反馈或搜索机制来提高推理准确性。
methods: 本研究使用了Process-Supervised Reward Model（PRM），在训练阶段为LLM提供步骤级别的反馈，类似于Proximal Policy Optimization（PPO）或拒绝抽样。我们还提出了一种启发式搜索算法，使用PRM的步骤级别反馈来优化LLM在多步任务中推理的路径。
results: 我们的研究显示，使用修改后的PRM在数学 benchmark 上（GSM8K和MATH）得到了更好的结果，并且在代码生成任务中也得到了类似的改进。此外，我们还开发了一种自动生成步骤级别奖励数据的方法，用于探讨代码生成任务中的不同路径。这些结果表明，我们的奖励模型基于的方法在推理任务中具有良好的robust性。

Abstract
Recent years have seen considerable advancements in multi-step reasoning with Large Language Models (LLMs). The previous studies have elucidated the merits of integrating feedback or search mechanisms during model inference to improve the reasoning accuracy. The Process-Supervised Reward Model (PRM), typically furnishes LLMs with step-by-step feedback during the training phase, akin to Proximal Policy Optimization (PPO) or reject sampling. Our objective is to examine the efficacy of PRM in the inference phase to help discern the optimal solution paths for multi-step tasks such as mathematical reasoning and code generation. To this end, we propose a heuristic greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. This tailored PRM demonstrated enhanced results compared to the Chain of Thought (CoT) on mathematical benchmarks like GSM8K and MATH. Additionally, to explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks. Thus highlighting the robust nature of our reward-model-based approach to inference for reasoning tasks.

摘要

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

paper_url: http://arxiv.org/abs/2310.10077
repo_url: None
paper_authors: Shuyu Jiang, Xingshu Chen, Rui Tang
for: This paper aims to reveal the vulnerability of large language models (LLMs) to compositional instruction attacks that can elicit harmful content, despite current approaches that focus on detecting and training against harmful prompts.
methods: The paper introduces an innovative technique called Compositional Instruction Attacks (CIA), which combines and encapsulates multiple instructions to hide harmful prompts within harmless ones. Two transformation methods, T-CIA and W-CIA, are also proposed to disguise harmful instructions as talking or writing tasks.
results: The paper achieves an attack success rate of 95%+ on safety assessment datasets and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed), and 91%+ for ChatGLM2 on harmful prompt datasets, demonstrating the effectiveness of CIA in eliciting harmful content from LLMs.

Abstract
Recently, Large language models (LLMs) with powerful general capabilities have been increasingly integrated into various Web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. Unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. Current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

摘要
Currently, large language models (LLMs) with strong overall capabilities have been increasingly integrated into various web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. However, they still face the risk of generating harmful content such as hate speech and criminal activities in practical applications. Existing approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focus on the "superficial" harmful prompts with a single intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios.In this paper, we propose an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combining and encapsulating multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify the underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs.We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieved an attack success rate of 95%+ on safety assessment datasets and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed), and 91%+ for ChatGLM2 on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

paper_url: http://arxiv.org/abs/2310.10698
repo_url: None
paper_authors: Yingwei Ma, Yue Yu, Shanshan Li, Yu Jiang, Yong Guo, Yuanliang Zhang, Yutao Xie, Xiangke Liao
for: 提高自动代码生成的准确率，充分利用大语言模型（LLM）的含义映射能力。
methods: 提出“含义链条”（SeCoT）方法，通过LLM自动学习源代码的含义信息（如数据流和控制流），提高代码生成的精度。
results: 在三个DL benchmark上实现了状态之准确率提高，证明SeCoT可以帮助大型LLM实现更高精度的代码生成。

Abstract
Large language models (LLMs) have showcased remarkable prowess in code generation. However, automated code generation is still challenging since it requires a high-level semantic mapping between natural language requirements and codes. Most existing LLMs-based approaches for code generation rely on decoder-only causal language models often treate codes merely as plain text tokens, i.e., feeding the requirements as a prompt input, and outputing code as flat sequence of tokens, potentially missing the rich semantic features inherent in source code. To bridge this gap, this paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. Our motivation is that the semantic information of the source code (\eg data flow and control flow) describes more precise program execution behavior, intention and function. By guiding LLM consider and integrate semantic information, we can achieve a more granular understanding and representation of code, enhancing code generation accuracy. Meanwhile, while traditional techniques leveraging such semantic information require complex static or dynamic code analysis to obtain features such as data flow and control flow, SeCoT demonstrates that this process can be fully automated via the intrinsic capabilities of LLMs (i.e., in-context learning), while being generalizable and applicable to challenging domains. While SeCoT can be applied with different LLMs, this paper focuses on the powerful GPT-style models: ChatGPT(close-source model) and WizardCoder(open-source model). The experimental study on three popular DL benchmarks (i.e., HumanEval, HumanEval-ET and MBPP) shows that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.

摘要
大型语言模型（LLM）已经展示了很好的代码生成能力。然而，自动化代码生成仍然是一个挑战，因为它需要高级别的 semantic mapping zwischen自然语言要求和代码。现有的 LLMs-based 方法 для代码生成都是基于 causal 语言模型，通常将代码当作平面文本符号，即通过提供要求作为输入，并将代码输出为平面序列符号。这可能会遗漏代码中的较为复杂的 semantics 特征。为了bridging这个差距，本文提出了“semantic chain-of-thought” 方法，以帮助 LL M 考虑和integrate semantic information，从而提高代码生成的准确性。我们的动机是，源代码中的semantic信息（例如数据流和控制流）可以描述更加精确的程序执行行为、意图和功能。通过引导 LL M 考虑这些semantic信息，我们可以实现代码生成更加精准和智能。而传统的方法需要复杂的静态或动态代码分析，以获取such as data flow和control flow的特征。然而，SeCoT 示出了这个过程可以通过 LL Ms 的内在能力（即在场景学习）自动化，并且可以普遍应用于复杂的领域。SeCoT 可以与不同的 LL Ms 结合使用，本文主要采用了强大的 GPT-style 模型：ChatGPT（关闭源代码模型）和WizardCoder（开源模型）。我们对三个Popular DL bencmarks（即HumanEval、HumanEval-ET和MBPP）进行了实验，结果表明，SeCoT 可以实现状态机器人的表现，大幅提高可能性。

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

paper_url: http://arxiv.org/abs/2310.10050
repo_url: None
paper_authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
for: liberating public domain texts at scale
methods: EffOCR (EfficientOCR), a novel open-source OCR package that uses a character or word-level image retrieval approach, is accurate and sample efficient to train and deploy
results: EffOCR was used to digitize 20 million historical U.S. newspaper scans with high accuracy, and achieved zero-shot performance on randomly selected documents from the U.S. National Archives, as well as accurately digitizing Japanese documents that other OCR solutions failed on.

Abstract
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.

摘要
亿量公共领域文档尚未被整合到数字化，或者缺乏准确的数字化。现代自然语言处理技术无法对这些文档进行索引、检索和概要分析，或者提取信息进行统计分析，这些文档也无法被包含在语言模型训练中。由于公共领域文档的多样性和庞大量， liberating them at scale requires an accurate, extremely cheap, and sample-efficient optical character recognition (OCR) technology. Existing OCR engines, primarily designed for small-scale commercial applications in high-resource languages, often fall short of these requirements.EffOCR（EfficientOCR）是一个新的开源 OCR 包，它满足了计算机和样本效率的要求，以便大规模解放文档。而不是使用常见的序列到序列架构，EffOCR 将 OCR 视为字符或单词级图像检索问题。EffOCR 具有低成本和样本效率的训练，因为模型只需学习字符的视觉特征，而不是字符串如何在语言中sequenced使用。EffOCR 的模型集可以通过几行代码部署，并且支持轻松、样本效率地自定义。此外，EffOCR 还具有简单的模型训练接口和最小的标注要求，因此可以轻松地进行随机选择的文档评估和日本文档的数字化，而其他 OCR 解决方案都失败了。

Improving Large Language Model Fine-tuning for Solving Math Problems

paper_url: http://arxiv.org/abs/2310.10047
repo_url: None
paper_authors: Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-Reyes, Peter J. Liu
for: 解决大语言模型（LLMs）在数学问题解决方面的成本高、精度低问题。
methods: investigate three fine-tuning strategies：(1) solution fine-tuning，(2) solution-cluster re-ranking，(3) multi-task sequential fine-tuning。
results: 使用MATH dataset，对PaLM 2模型进行了三种精度调整策略的研究，并发现：(1) 使用精度调整的步骤解释可以对模型性能产生显著影响；(2) 筛选和多数投票可以单独使用以提高模型性能，同时使用两者可以叠加提高性能；(3) 将生成和评估任务分别进行多任务并行调整可以比基eline更高的性能。

Abstract
Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.

摘要
尽管大型自然语言模型（LLM）在许多自然语言任务上表现出色，但解决数学问题仍然是它们的主要挑战。LLM的通过一次和通过N次性能在解决数学问题上存在很大的差距，这表明LLM可能在解决数学问题的过程中很近于发现正确的解决方案，因此我们对LLM的 fine-tuning 方法进行了探索。使用具有挑战性的 MATH 数据集，我们 investigate了三种 fine-tuning 策略：（1）解决 fine-tuning，我们将 LLM fine-tune 为生成一个给定数学问题的详细解决方案；（2）解决集 cluster 重新排名，我们将 LLM fine-tune 为一个解决方案验证器/评估器，以选择生成的候选解决方案集；（3）多任务顺序 fine-tuning，它将解决方案生成和评估任务集成起来，以提高 LLM 性能。通过这些方法，我们在 PaLM 2 模型上进行了一系列实验，并发现：（1）用于 fine-tuning 的步骤解决方案质量和风格可以对模型性能产生重要影响；（2）解决重新排名和多数投票都是可以提高模型性能的有效方法，但是它们可以同时使用以实现更大的性能提升；（3）将解决生成和评估任务分开并进行多任务顺序 fine-tuning 可以比基于解决 fine-tuning 的基eline提供更好的性能。根据这些发现，我们设计了一种 fine-tuning 配方，通过这种配方，我们在 MATH 数据集上使用 fine-tuned PaLM 2-L 模型，实现了 Approximately 58.8% 的准确率，与未经 fine-tuning 的 PaLM 2-L 模型的多shot 性能相比，提高了约 11.2%。

Empirical Study of Zero-Shot NER with ChatGPT

paper_url: http://arxiv.org/abs/2310.10035
repo_url: https://github.com/emma1066/zero-shot-ner-with-chatgpt
paper_authors: Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, Hongwei Wang
for: 本研究探讨了大型自然语言模型（LLM）在零shot信息EXTRACTION任务中的表现，尤其是在ChatGPT和命名实体识别（NER）任务中。
methods: 我们采用了启发于LLM的卓越逻辑能力的方法，并对NER任务进行了修改和适应。我们提出了分解问题解决方案，将NER任务分解成更加简单的互相关联问题，并通过语法提高和工具增强等方法来促进模型的中间思考。此外，我们还采用了自身一致性来优化NER任务。
results: 我们的方法在七个benchmark上实现了零shotNER任务的很好表现，包括中文和英文 dataset，以及域特定和通用领域场景。此外，我们还进行了错误分析和优化建议。此外，我们还证明了我们的方法在几个shot设置和其他LLM中的效果。

Abstract
Large language models (LLMs) exhibited powerful capability in various natural language processing tasks. This work focuses on exploring LLM performance on zero-shot information extraction, with a focus on the ChatGPT and named entity recognition (NER) task. Inspired by the remarkable reasoning capability of LLM on symbolic and arithmetic reasoning, we adapt the prevalent reasoning methods to NER and propose reasoning strategies tailored for NER. First, we explore a decomposed question-answering paradigm by breaking down the NER task into simpler subproblems by labels. Second, we propose syntactic augmentation to stimulate the model's intermediate thinking in two ways: syntactic prompting, which encourages the model to analyze the syntactic structure itself, and tool augmentation, which provides the model with the syntactic information generated by a parsing tool. Besides, we adapt self-consistency to NER by proposing a two-stage majority voting strategy, which first votes for the most consistent mentions, then the most consistent types. The proposed methods achieve remarkable improvements for zero-shot NER across seven benchmarks, including Chinese and English datasets, and on both domain-specific and general-domain scenarios. In addition, we present a comprehensive analysis of the error types with suggestions for optimization directions. We also verify the effectiveness of the proposed methods on the few-shot setting and other LLMs.

摘要
大型自然语言模型（LLM）在各种自然语言处理任务中表现出了强大的能力。本研究将关注LLM在零式信息提取任务中的表现，特别是关注ChatGPT和命名实体识别（NER）任务。受到LLM在符号逻辑和加算逻辑中的卓越逻辑能力的启发，我们采用了现有的逻辑方法，并对NER任务进行了修改和定制。首先，我们探索了一种分解问题解决方案，将NER任务分解成 simpler subproblems by labels。其次，我们提出了语法增强的方法，通过语法提示和工具增强来让模型在语法结构本身进行分析。此外，我们采用了自适应性来NER，提出了两个阶段多数投票策略，首先投票最符合的提及，然后投票最符合的类型。提出的方法在零式NER中实现了显著的提升，在七个benchmark上，包括中文和英文数据集，以及域特定和通用领域场景中。此外，我们还提供了错误类型的完整分析和优化方向。此外，我们还证明了提出的方法在几个ew-shot设定和其他LLMs中的效果。