cs.CL - 2023-08-17

Contrasting Linguistic Patterns in Human and LLM-Generated Text

paper_url: http://arxiv.org/abs/2308.09067
repo_url: None
paper_authors: Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares
for: 对比人类编写的英语新闻文本和相似的大语言模型（LLM）输出。
methods: 使用多个 linguistic dimensions，包括 morphological, syntactic, psychometric和sociolinguistic aspects进行量化分析。
results: 结果显示人类文本和AI生成文本存在许多可观量的差异，如人类文本中句子长度分布较为散布、使用依赖和结构类型不同、 Constituents shorter、以及更多的情感表达（恐惧、厌恶）。LLM输出使用更多的数字、符号和助动词（表示 объекivity 语言），以及更多的代词。人类文本中的性别偏见也被LLM模型反映出来。

Abstract
We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from 4 LLMs from the LLaMa family. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Among others, human texts exhibit more scattered sentence length distributions, a distinct use of dependency and constituent types, shorter constituents, and more aggressive emotions (fear, disgust) than LLM-generated texts. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs.

摘要
我们进行了量化分析，对人工写的英文新闻文本与相关的大语言模型（LLM）输出进行比较，从4个LLM家族中选择了输出。我们的分析覆盖了多个可衡量的语言方面，包括形态、 sintaxis、心理测试和社会语言方面。结果显示了人类和AI生成文本之间的各种可衡量差异。人类文本显示了更加散布的句子长度分布、更明确的依赖和结构类型使用、更短的成分和更强烈的情感（恐惧、厌恶）than LLM生成的文本。LLM输出使用了更多的数字、符号和助动词（表示Objective语言），以及更多的代名词。人类文本中的性别偏见也被LLM模型表达出来。

Don’t lose the message while paraphrasing: A study on content preserving style transfer

paper_url: http://arxiv.org/abs/2308.09055
repo_url: https://github.com/s-nlp/lewit-informal
paper_authors: Nikolay Babakov, David Dale, Ilya Gusev, Irina Krotova, Alexander Panchenko
for: 本研究旨在提高自然语言处理中的文本风格转换技术，使得文本可以在需要的形式下进行转换，如从毒瘤到神经、从正式到便秘、从古老到现代英语等。
methods: 本研究使用了多种文本风格转换模型，并进行了对这些模型的比较，以确定哪些模型能够最好地保持原始内容的含义。
results: 研究发现，当转换文本时，保持原始内容的含义是非常重要，而现有的文本风格转换模型往往会产生不准确的结果。此外，研究还提出了一种修改了无监督的LEWIS方法，可以在提高文本风格转换的同时，保持原始内容的含义。

Abstract
Text style transfer techniques are gaining popularity in natural language processing allowing paraphrasing text in the required form: from toxic to neural, from formal to informal, from old to the modern English language, etc. Solving the task is not sufficient to generate some neural/informal/modern text, but it is important to preserve the original content unchanged. This requirement becomes even more critical in some applications such as style transfer of goal-oriented dialogues where the factual information shall be kept to preserve the original message, e.g. ordering a certain type of pizza to a certain address at a certain time. The aspect of content preservation is critical for real-world applications of style transfer studies, but it has received little attention. To bridge this gap we perform a comparison of various style transfer models on the example of the formality transfer domain. To perform a study of the content preservation abilities of various style transfer methods we create a parallel dataset of formal vs. informal task-oriented dialogues. The key difference between our dataset and the existing ones like GYAFC [17] is the presence of goal-oriented dialogues with predefined semantic slots essential to be kept during paraphrasing, e.g. named entities. This additional annotation allowed us to conduct a precise comparative study of several state-of-the-art techniques for style transfer. Another result of our study is a modification of the unsupervised method LEWIS [19] which yields a substantial improvement over the original method and all evaluated baselines on the proposed task.

摘要
文本风格传输技术在自然语言处理领域升级，允许将文本从毒精到神经、从正式到便秘、从古老到现代英语等形式进行重写。解决这个任务不仅是生成一些神经/便秘/现代的文本，而且也需要保持原始内容不变。这种需求在某些应用程序中，如文本风格传输的对话目标中，变得非常重要。例如，在订购特定类型的比萨时，需要保持原始信息，如订购地址和时间。对于实际应用，保持内容的精度是关键。为了填补这一漏洞，我们进行了不同风格传输模型的比较，并创建了一个并行的正式 vs. 便秘任务型对话集。我们的数据集和现有的 datasets like GYAFC [17] differ in that they contain goal-oriented dialogues with predefined semantic slots that must be preserved during paraphrasing, such as named entities。这些额外的注释使我们能够进行精确的比较研究多种现状风格传输技术。另外，我们还提出了一种基于不监督学习的方法 modification of LEWIS [19]，它在我们的任务上实现了显著提高，并在所有评估基准上表现出优于原始方法和评估基准。

Reinforced Self-Training (ReST) for Language Modeling

paper_url: http://arxiv.org/abs/2308.08998
repo_url: https://github.com/jettbrains/-L-
paper_authors: Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas
for: 提高大语言模型的输出质量
methods: 使用激励学习自动对人类反馈进行调整
results: 在计算和样本利用效率的情况下，可以substantially提高翻译质量，并且通过自动度量和人类评估在机器翻译benchmark上表现出色。

Abstract
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

摘要
“强化学习从人类反馈（RLHF）可以提高大语言模型（LLM）的输出质量，将其与人类偏好Alignment。我们提出了一种简单的算法，名为强化自我培训（ReST），以实现这一目标。给定初始LLM策略，ReST会生成一个数据集，然后使用线下RL算法来改善LLM策略。与 Typical online RLHF方法不同的是，ReST在训练集生成过程中不需要在线学习，因此可以进行数据重用。虽然ReST是一种通用的激励学习方法，但我们在机器翻译任务中进行了研究，并显示了它可以在计算和样本效率的情况下提高翻译质量。”Note that Simplified Chinese is the official standard for Chinese writing in mainland China, and it is used in this translation. Traditional Chinese is used in Taiwan and other regions, and it may differ slightly from the Simplified Chinese used here.

Evaluation of really good grammatical error correction

paper_url: http://arxiv.org/abs/2308.08982
repo_url: https://github.com/robertostling/gec-evaluation
paper_authors: Robert Östling, Katarina Gillholm, Murathan Kurfalı, Marie Mattson, Mats Wirén
for: 这个论文的目的是评估不同 grammatical error correction (GEC) 系统的表现，以及评估现有评价方法的有效性。
methods: 这个论文使用了一些已知的评价方法，以及一些新的评价方法，例如使用人工评估和大语言模型 (LLM)。
results: 研究发现，使用 GPT-3 在几个步骤中的情况下，可以明显超越之前的 grammatical error correction 系统，并且可以在 Swedish 语言中减少了训练数据的语言差异。此外，研究还发现了现有的评价方法存在偏见，而人工评估则能够更好地揭示这些偏见。

Abstract
Although rarely stated, in practice, Grammatical Error Correction (GEC) encompasses various models with distinct objectives, ranging from grammatical error detection to improving fluency. Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives. Reference-based evaluations suffer from limitations in capturing the wide variety of possible correction and the biases introduced during reference creation and is prone to favor fixing local errors over overall text improvement. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only 0.11% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

摘要

Beam Retrieval: General End-to-End Retrieval for Multi-Hop Question Answering

paper_url: http://arxiv.org/abs/2308.08973
repo_url: https://github.com/canghongjian/beam_retriever
paper_authors: Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Yong Liu, Shen Huang
for: 这个论文旨在提出一种扩展多步问答（Multi-hop QA）的搜索框架，以解决复杂问题时存在多个相关段落的选择和逻辑推理问题。
methods: 该方法称为Beam Retrieval，它是一种通用的终端搜索框架，可以维护多个部分假设的相关段落，从而扩大搜索空间并减少 irrelevant passage 的选择风险。此外，Beam Retrieval 将编码器和两个分类头并行优化，通过将所有跳数的损失函数进行共同最小化。
results: 实验结果表明，Beam Retrieval 与基eline比较，在复杂的 MuSiQue-Ans 上提高了nearly 50%的性能，并且在 HotpotQA 和 2WikiMultiHopQA 上也超过了所有之前的搜索器。此外，通过提供高质量的上下文，Beam Retrieval 帮助我们的supervised reader 实现了新的州队性能记录，并在零学习 GPT-3.5 上进行了重要提升（最高提升28.8点）。

Abstract
Multi-hop QA involves finding multiple relevant passages and step-by-step reasoning to answer complex questions. While previous approaches have developed retrieval modules for selecting relevant passages, they face challenges in scenarios beyond two hops, owing to the limited performance of one-step methods and the failure of two-step methods when selecting irrelevant passages in earlier stages. In this work, we introduce Beam Retrieval, a general end-to-end retrieval framework for multi-hop QA. This approach maintains multiple partial hypotheses of relevant passages at each step, expanding the search space and reducing the risk of missing relevant passages. Moreover, Beam Retrieval jointly optimizes an encoder and two classification heads by minimizing the combined loss across all hops. To establish a complete QA system, we incorporate a supervised reader or a zero-shot GPT-3.5. Experimental results demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with baselines on challenging MuSiQue-Ans, and it also surpasses all previous retrievers on HotpotQA and 2WikiMultiHopQA. Providing high-quality context, Beam Retrieval helps our supervised reader achieve new state-of-the-art performance and substantially improves (up to 28.8 points) the QA performance of zero-shot GPT-3.5.

摘要
多跳问答涉及到找到多个相关段落并进行步骤 reasoning 来回答复杂问题。而前一些方法已经开发了检索模块，但在超过两步 scenarios 中表现不佳，因为一步方法的表现有限，而两步方法在选择无关段落的早期阶段会失败。在这项工作中，我们介绍了 Beam Retrieval，一种通用的检索框架 для多跳问答。这种方法在每个步骤中维护多个部分假设的相关段落，扩大搜索空间，降低缺失相关段落的风险。此外，Beam Retrieval 同时优化一个encoder和两个分类头，通过在所有跳数中共同最小化损失来jointly 优化。为建立完整的问答系统，我们将 incorporate 一个监督式读者或一个零容量 GPT-3.5。实验结果表明，Beam Retrieval 相比基eline 提高了约50%的表现，并且在 HotpotQA 和 2WikiMultiHopQA 等检索器中也超越了所有前一些 Retriever。通过提供高质量的 контекст，Beam Retrieval 帮助我们的监督式读者达到新的状态平台纪录，并在零容量 GPT-3.5 中提高了问答性能（最多28.8个点）。

Factuality Detection using Machine Translation – a Use Case for German Clinical Text

paper_url: http://arxiv.org/abs/2308.08827
repo_url: None
paper_authors: Mohammed Bin Sumait, Aleksandra Gabryszak, Leonhard Hennig, Roland Roller
for: 这种研究是用于检测临床文本中的事实性的。
methods: 这种方法使用机器翻译将英语数据翻译成德语，然后使用变换器来检测事实性。
results: 这种方法可以准确地检测临床文本中的事实性。

Abstract
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.

摘要
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.Here's the word-for-word translation in Simplified Chinese:事实性可以在自动处理医疗文本时发挥重要作用，因为不同的症状是否存在、可能存在、未提及或确认的差异会影响结果。大多数情况下，需要一 sufficient number of examples来处理这种现象。但是，医疗文本可能包含敏感信息，因此数据Difficult to share。在实际性检测中，这项工作提出了一个简单的解决方案，利用机器翻译将英文数据翻译成德文，以训练基于转换器的实际性检测模型。

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

paper_url: http://arxiv.org/abs/2308.08807
repo_url: None
paper_authors: Jivnesh Sandhan
for: 本论文的目的是使 sanskrit 手稿更加可accessible для最终用户通过自然语言技术。
methods: 本研究使用了语言学家所提供的语言模型，以解决 sanskrit 语言的 morphological richness、free word orderliness 和 low-resource nature 等问题。
results: 本研究提出了一个名为 SanskritShala 的 neural toolkit，可以在线进行 sanskrit 手稿的各种 NLP 任务的实时分析。此外，研究还提出了一些 linguistically-informed 的 neural architecture，并实现了这些模型的 interpretability 和 multilingual extension。最终，研究获得了 state-of-the-art 的性能。

Abstract
The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.

摘要
主要研究目标是使梵语手稿更加可 accessible 给用户通过自然语言技术。梵语的 morphological richness、自由词序和 low-resource 性带来了开发深度学习解决方案的很大挑战。我们标识出了四个基本任务，这些任务是对梵语 NLP 技术的关键：梵语单词分 segmentation（SWS）、依赖分析、复合类型识别和 poetry 分析。首先，SWS 是任何其他下游应用程序的基本文本处理任务，但它受到 sandhi 现象的影响，使其成为挑战。同时，现有的依赖分析方法在 morphologically rich 和 low-resource 语言 LIKE 梵语上表现不佳。复合类型识别也是梵语的挑战，因为它们在上下文中具有相互关联的 semantics。这些挑战导致了在 NLP 应用程序如问答和机器翻译中的下OPTIMAL 性能。最后，梵语诗歌还没有得到了计算语言学的广泛研究。在面临这些挑战时，本论文做出了多个贡献：1. 提出了语言学 informed 的神经网络架构。2. 展示了解释性和多语言扩展的可能性。3. 我们的提议系统在性能方面做出了州OF-THE-ART 的贡献。4. 最后，我们发布了一个名为 SanskritShala 的神经工具kit，它是一个在线应用程序，可以对输入进行实时分析，并提供多种 NLP 任务的分析结果。总的来说，本论文的贡献在于使梵语手稿更加可 accessible，通过开发 Robust NLP 技术和发布资源、数据集和在线工具kit。

Chinese Spelling Correction as Rephrasing Language Model

paper_url: http://arxiv.org/abs/2308.08796
repo_url: https://github.com/gingasan/lemon
paper_authors: Linfeng Liu, Hongqiu Wu, Hai Zhao
for: 本研究旨在提高中文拼写自动 corrections的精度和可重用性，通过增加额外槽位，使模型能够更好地理解语义和自动拼写 corrections。
methods: 本研究提出了一种新的training paradigm，即“重写语言模型”（ReLM），通过增加额外槽位，使模型能够更好地理解语义和自动拼写 corrections。
results: 对于精度和可重用性，ReLM在精度和零shot benchmark上达到了新的state-of-the-artresult，与之前的对手相比，提高了大幅度。此外，ReLM还能够学习可转移的语言表示，当拼写自动 corrections与其他任务一起进行 JOINT 训练时。

Abstract
This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose $Rephrasing Language Modeling$ (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-the-art results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.

摘要

Task Relation Distillation and Prototypical Pseudo Label for Incremental Named Entity Recognition

paper_url: http://arxiv.org/abs/2308.08793
repo_url: https://github.com/bladedancer957/iner_rdp
paper_authors: Duzhen Zhang, Hongliu Li, Wei Cong, Rongtao Xu, Jiahua Dong, Xiuyi Chen
for: 这个论文是为了解决逐次学习中的快速忘却问题，特别是在background shift情况下。
methods: 这个方法使用了任务关系热钻和类型尺度拟标法来解决快速忘却和background shift问题。
results: 在十个INER设定下，这个方法实现了6.08%的微平均准确率和7.71%的macro平均准确率的显著提升，相比之前的状态泰尊方法。

Abstract
Incremental Named Entity Recognition (INER) involves the sequential learning of new entity types without accessing the training data of previously learned types. However, INER faces the challenge of catastrophic forgetting specific for incremental learning, further aggravated by background shift (i.e., old and future entity types are labeled as the non-entity type in the current task). To address these challenges, we propose a method called task Relation Distillation and Prototypical pseudo label (RDP) for INER. Specifically, to tackle catastrophic forgetting, we introduce a task relation distillation scheme that serves two purposes: 1) ensuring inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss, and 2) enhancing the model's prediction confidence by minimizing intra-task self-entropy loss. Simultaneously, to mitigate background shift, we develop a prototypical pseudo label strategy that distinguishes old entity types from the current non-entity type using the old model. This strategy generates high-quality pseudo labels by measuring the distances between token embeddings and type-wise prototypes. We conducted extensive experiments on ten INER settings of three benchmark datasets (i.e., CoNLL2003, I2B2, and OntoNotes5). The results demonstrate that our method achieves significant improvements over the previous state-of-the-art methods, with an average increase of 6.08% in Micro F1 score and 7.71% in Macro F1 score.

摘要
incremental named entity recognition (INER) 是一种逐步学习新类型的方法，不需要训练数据集的访问。然而，INER面临着增量学习中的灾难性忘记问题，这问题更加严重由背景转换（即旧和未来类型都被标记为当前任务中的非实体类型）。为解决这些挑战，我们提出了一种方法called任务关系热针和抽象 pseudo标签（RDP）。Specifically, to tackle catastrophic forgetting, we introduce a task relation distillation scheme that serves two purposes: 1) ensuring inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss, and 2) enhancing the model's prediction confidence by minimizing intra-task self-entropy loss. Simultaneously, to mitigate background shift, we develop a prototypical pseudo label strategy that distinguishes old entity types from the current non-entity type using the old model. This strategy generates high-quality pseudo labels by measuring the distances between token embeddings and type-wise prototypes.我们在三个 benchmark datasets（CoNLL2003、I2B2和OntoNotes5）上进行了广泛的实验，结果显示，我们的方法在前一个状态的方法中获得了显著提高，其中 Micro F1 分数提高了6.08%，Macro F1 分数提高了7.71%。

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

paper_url: http://arxiv.org/abs/2308.08747
repo_url: https://github.com/luoxiaoheics/continual-tune
paper_authors: Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, Yue Zhang
for: 本研究旨在探讨大语言模型（LLM）在继续微调过程中是否存在悬峰忘记现象（Catastrophic Forgetting，CF）。
methods: 本研究使用了各种方法，包括领域知识、逻辑和语言理解等方面的评估，来评估 LLM 的知识忘记现象。
results: 研究发现，LLM 在继续微调过程中一般存在 CF 现象，而scale增加后，忘记现象也加剧。 Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ 比较少忘记，保持更多的知识。另外，研究发现 LLM 在继续微调过程中可以减少语言偏见（如性偏见）。

Abstract
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information as it learns new information. As large language models (LLMs) have shown excellent performance, it is interesting to uncover whether CF exists in the continual fine-tuning of LLMs. In this study, we empirically evaluate the forgetting phenomenon in LLMs' knowledge, from the perspectives of domain knowledge, reasoning, and reading comprehension. The experiments demonstrate that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b. Furthermore, as the scale increases, the severity of forgetting also intensifies. Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ suffers less forgetting and maintains more knowledge. We also observe that LLMs can mitigate language bias (e.g. gender bias) during continual fine-tuning. Moreover, we find that ALPACA can maintain more knowledge and capacity compared with LLAMA during the continual fine-tuning, which implies that general instruction tuning can help mitigate the forgetting phenomenon of LLMs in the further fine-tuning process.

摘要
您好！我很高兴今天分享一篇关于机器学习的研究成果。研究发现，当机器学习模型（LLM）在不断微调过程中学习新信息时，可能会出现“恐慌性忘记”（Catastrophic Forgetting，CF）现象。在这种情况下，模型可能会忘记之前学习的信息。我们通过实验证明，CF确实存在于LLM中，并且随着模型的缩放程度增大，忘记现象也变得更加严重。此外，我们发现，使用泛化指令调整可以减轻LLM在不断微调过程中的忘记现象。在这篇研究中，我们还发现了一些其他有趣的现象。例如，我们发现LLM可以在不断微调过程中减少语言偏见（如性偏见）。此外，我们还发现了一些不同的模型之间的差异。例如，比较decoder-only模型BLOOMZ和encoder-decoder模型mT0，BLOOMZ在不断微调过程中减少忘记现象，并保持更多的知识。这些发现可能有助于我们更好地理解LLM在不断微调过程中的行为，并且可能有助于我们开发更好的LLM模型。总之，这篇研究提供了一些有趣的发现，可能有助于我们更好地理解LLM在不断微调过程中的行为，并且可能有助于我们开发更好的LLM模型。如果您有兴趣，请随时阅读我们的研究论文，了解更多细节。谢谢！

Enhancing Phrase Representation by Information Bottleneck Guided Text Diffusion Process for Keyphrase Extraction

paper_url: http://arxiv.org/abs/2308.08739
repo_url: None
paper_authors: Yuanzhen Luo, Qingyu Zhou, Feng Zhou
for: 本研究的目的是提出一种基于Variational Information Bottleneck（VIB）的supervised文本扩散过程，用于提高键短语提取（KPE）的性能。
methods: 该方法首先根据整个文档生成需要的键短语嵌入，然后将生成的嵌入注入到每个短语表示中。然后，使用排名网络和VIB进行优化，并使用排名损失和分类损失进行训练。
results: 实验显示，Diff-KPE比exist的KPE方法在一个大规模的开放领域键短语提取benchmark（OpenKP）和一个科学领域的数据集（KP20K）上表现出色，得到了更好的性能。

Abstract
Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.

摘要
KEYPHRASE EXTRACTION (KPE) 是自然语言处理中一项重要的任务，目标是从给定的文档中提取关键短语。现有的超级vised方法将 KPE 视为顺序标注、span-level 分类或生成任务，但这些方法可能会忽略关键短语信息，从而导致不准确的结果。在本研究中，我们提出了Diff-KPE，它利用supervisedVariational Information Bottleneck（VIB）来导引文本噪声过程，从而生成改进的关键短语表示。Diff-KPE首先根据整个文档生成所需的关键短语嵌入，然后将生成的嵌入注入到每个短语表示中。然后，一个排名网络和VIB分别与排名损失和分类损失进行优化。这种Diff-KPE的设计允许我们根据文档信息和关键短语来排名每个候选短语，从而提高了关键短语提取的准确率。实验表明，Diff-KPE在一个大规模的开放领域关键短语提取标准benchmark OpenKP 和一个科学领域数据集 KP20K 上都有出色的表现，超过了现有的 KPE 方法。

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

paper_url: http://arxiv.org/abs/2308.08713
repo_url: https://github.com/95anantsingh/decoding-emotions
paper_authors: Anant Singh, Akshat Gupta
for: 这篇论文主要是为了评估多种语言下的语音情感识别（SER）模型，以及这些模型的内部表示方式。
methods: 这篇论文使用了八种语音表示模型和六种语言进行了比较，并通过探测实验获取这些模型的内部工作方式。
results: 研究发现，使用单个最佳层的语音模型特征可以降低错误率平均为32%，而使用所有层的语音模型特征可以达到最佳结果。此外，这些探测结果表明，语音模型的中间层 capture最重要的情感信息。

Abstract
Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.

摘要
近年来，基于转换器的语音表示模型在语音处理领域已经带来巨大的变革。然而，关于评估这些模型在多种语言的语音情感识别（SER）方面的研究还是有限的。这篇文章填补了这些差距，并提供了一个全面的 SER benchmark，包括八种语音表示模型和六种不同的语言。我们进行了探索性的实验，以了解这些模型内部的具体工作原理。我们发现，使用单个最佳层的语音模型特征可以降低错误率平均为32%，相比于所有层的语音模型特征使用系统。我们还实现了德语和波斯语的状态机器人表现。我们的探索结果表明，语音模型的中间层最好 capture 情感信息。

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

paper_url: http://arxiv.org/abs/2308.09723
repo_url: None
paper_authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla
for: 这篇研究旨在提高大型语言模型（LLMs）的实际应用性，因为它们需要大量的内存，而且最新的生成模型在自动化预测过程中发生了内存带宽瓶须。
methods: 我们提出了一种有效的量化方法，可以降低模型的内存使用量和加速推断，而且不需要额外的调整。我们还引入了一个简单且有效的规律，使用预训模型的维度作为量化精度。
results: 我们评估了我们的提案，并证明了它们可以实现最大化的执行速度和最小化的质量损失。在使用大量的Open Source模型和内部MoE模型时，我们获得了与原始模型相似的准确性，并在相同数量的GPU上实现了3.65倍的运算速度。

Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs.

摘要

Large Language Models for Granularized Barrett’s Esophagus Diagnosis Classification

paper_url: http://arxiv.org/abs/2308.08660
repo_url: None
paper_authors: Jenna Kefeli, Ali Soroush, Courtney J. Diamond, Haley M. Zylberberg, Benjamin May, Julian A. Abrams, Chunhua Weng, Nicholas Tatonetti
For: 本研究旨在提高贝勒氏食道癌前体（BE）诊断代码的精度和特点，以满足各种研究和临床应用场景的需求。* Methods: 研究人员开发了一种通用的变换器基本方法，用于自动提取BE诊断报告中的关键特征。该方法使用了两个临床预训练的大语言模型，并与手动 Chart review 进行比较。* Results: 研究人员通过使用哥伦比亚大学爱立信医学中心的BE诊断报告，并与 Gastroenterologist 注解的目标进行比较，实现了二分类分化和多类BE相关诊断分类。研究人员发现，使用两个临床预训练的大语言模型，与手动开发的规则基本方法相比，其性能相对较高，具有更高的精度和更快的实现速度。

Abstract
Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two clinically pre-trained large language models, with best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieves 0.964 F1-score, while the multi-class model achieves 0.911 F1-score. Our method is generalizable and faster to implement as compared to a tailored rule-based approach.

摘要
Current diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack specificity and granularity for many research or clinical use cases. Manual chart review is time-consuming and laborious to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two pre-trained large language models, with the best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieved an F1-score of 0.964, while the multi-class model achieved an F1-score of 0.911. Our method is more efficient and generalizable than a tailored rule-based approach.

Learning the meanings of function words from grounded language using a visual question answering model

paper_url: http://arxiv.org/abs/2308.08628
repo_url: https://github.com/evaportelance/vqa-function-word-learning
paper_authors: Eva Portelance, Michael C. Frank, Dan Jurafsky
For: 本研究探讨了儿童如何学习功能词（function words），以及模型如何使用这些词语来回答复杂的视觉问题。* Methods: 研究使用了基于神经网络的视觉语言模型，通过训练这些模型在视觉上grounded的语言中使用功能词来研究这些词语的含义。* Results: 研究发现，这些模型可以通过非Symbolic普遍学习算法来学习功能词的含义，无需任何先前的语言含义知识。此外，模型还可以学习逻辑连接词（like “and” and “or”）的含义，并且可以在语言中理解多种表达方式。最后，研究发现，词语学习难度与模型的输入频率有关。

Abstract
Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spacial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives "and" and "or" without any prior knowledge of logical reasoning, as well as early evidence that they can develop the ability to reason about alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer evidence that it is possible to learn the meanings of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.

摘要
函数词如“或”、“后”或“更”的理解可能需要逻辑、数学和关系的推理能力。儿童如何学习这些词？以前的获得理论 frequently rely on positing a foundation of innate knowledge。然而，最新的神经网络基于视觉问答模型 Apparently can learn to use function words as part of answering questions about complex visual scenes.在这篇论文中，我们研究这些模型对函数词的学习，以更好地理解儿童和模型如何学习这些词的meaning。我们发现，基于视觉受限语言的回归模型在学习gradient semantics for function words requiring spacial and numerical reasoning。此外，我们发现这些模型可以不带任何逻辑推理知识学习逻辑连接词“和”和“或”的meaning，以及初步证明它们可以在语言解释中理解备用表达。最后，我们发现word learning difficulty是模型输入频率的函数。我们的发现表明可以通过非符号统计学学习算法，不带任何语言含义的先验知识，来学习函数词的meaning。

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

paper_url: http://arxiv.org/abs/2308.08625
repo_url: https://github.com/rttl-ai/bioptimus
paper_authors: Pavlova Vera, Mohammed Makhlouf
for: investigate different pre-training methods for biomedical language models and compare their performance on Named Entity Recognition (NER) tasks.
methods: pre-training the biomedical language model from scratch, pre-training it in a continued fashion, and using a curriculum learning approach with contextualized weight distillation.
results: a new biomedical language model (BIOptimus) that sets new states of the art on several biomedical NER tasks, and an analysis of the impact of masking rate, corruption strategy, and masking strategies on the performance of the biomedical LM.

Abstract
Using language models (LMs) pre-trained in a self-supervised setting on large corpora and then fine-tuning for a downstream task has helped to deal with the problem of limited label data for supervised learning tasks such as Named Entity Recognition (NER). Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained using different methods and techniques that advance results on many BioNLP tasks, including NER. However, there is still a lack of a comprehensive comparison of pre-training approaches that would work more optimally in the biomedical domain. This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We compare existing methods with our proposed pre-training method of initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found. The method helps to speed up the pre-training stage and improve performance on NER. In addition, we compare how masking rate, corruption strategy, and masking strategies impact the performance of the biomedical LM. Finally, using the insights from our experiments, we introduce a new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning (CL) and contextualized weight distillation method. Our model sets new states of the art on several biomedical Named Entity Recognition (NER) tasks. We release our code and all pre-trained models

摘要
使用先前训练的语言模型（LM）在自然语言处理（NLP）领域进行预训练，然后进行下游任务（如命名实体识别（NER））的问题，受到有限的标注数据的限制。 current research in biomedical NLP 提供了许多适用于生物医学领域的语言模型，通过不同的方法和技术进行预训练，提高了许多生物医学任务的结果，包括NER。然而，还没有一个全面的比较研究，探讨不同的预训练方法在生物医学领域是否可以更优。这篇论文旨在调查不同的预训练方法，如从scratch预训练和继续预训练。我们比较了现有的方法和我们提出的方法，该方法是使用BERT模型中的权重初始化新的token。这种方法可以加速预训练阶段，并提高NER的表现。此外，我们还比较了掩码率、损害策略和掩码策略对生物医学LM的性能的影响。最后，基于我们的实验结果，我们提出了一个新的生物医学LM（BIOptimus），通过curriculum learning（CL）和Contextualized weight distillation方法进行预训练。我们的模型在许多生物医学NER任务中设置了新的纪录。我们发布了我们的代码和所有预训练模型。

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

paper_url: http://arxiv.org/abs/2308.08493
repo_url: None
paper_authors: Shahriar Golchin, Mihai Surdeanu
for: 本研究旨在探讨大语言模型（LLM）在不同任务上的效果，并检测LLM在训练数据中是否存在数据污染问题。
methods: 本研究提出了一种简单 yet effective的方法来检测LLM中的数据污染。该方法包括在小样本中identify potential contamination，并使用“导向指令”（prompt）来评估实例是否污染。
results: 研究发现，使用“导向指令”可以准确地检测LLM中的数据污染。 Specifically, the paper achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Additionally, the findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

Abstract
Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

摘要
“数据污染”，即大语言模型（LLM）训练数据中下游任务的数据污染，是评估LLM的效果所存在的一个潜在问题。我们提出了一种简单 yet有效的方法来识别LLM中的数据污染。我们的方法的核心在于在小样本中标识潜在的污染，然后使用这些信息来评估整个数据分区是否受污染。为了评估个体实例的污染情况，我们采用了“引导指令”：一个包含数据集名称、分区类型和参考实例的开头部分的提示，要求LLM完成它。如果LLM的输出与参考实例的后半部分匹配，则认为该实例污染了。为了判断整个分区是否受污染，我们提出了两个想法。第一个想法是如果使用ROUGE或BLEURT评估指标，则认为分区受污染，如果使用引导指令vs无引导指令的平均 overlap score差异是 statistically significant。第二个想法是使用基于GPT-4的类型学习提示，如果多个实例被标记为污染，则认为整个分区受污染。我们的最佳方法可以在七个数据集上达到92%-100%的准确率，与人工评估相比。此外，我们的发现表明GPT-4污染了AG News、WNLI和XSum数据集。