results: LLMs 可以在某些情况下做出合适的回应,但它们无法完全遵循人类情感行为的Alignment,也无法建立类似情况之间的连接。Abstract
Recently, the community has witnessed the advancement of Large Language Models (LLMs), which have shown remarkable performance on various downstream tasks. Led by powerful models like ChatGPT and Claude, LLMs are revolutionizing how users engage with software, assuming more than mere tools but intelligent assistants. Consequently, evaluating LLMs' anthropomorphic capabilities becomes increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes five LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4 and LLaMA 2. A conclusion can be drawn from the results that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, dubbed EmotionBench, is made publicly in https://github.com/CUHK-ARISE/EmotionBench. We aspire to contribute to the advancement of LLMs regarding better alignment with the emotional behaviors of human beings, thereby enhancing their utility and applicability as intelligent assistants.
摘要
近期,社区目睹了大型语言模型(LLM)的发展,其在不同下游任务上表现出了很好的表现。带领于强大的模型如ChatGPT和Claude,LLMs在软件中的应用不再是只是工具,而是智能助手。因此,评估LLMs的人类化能力在当今话题中变得越来越重要。基于心理学中的情感评估理论,我们提议评估LLMs的同情能力,即在特定情况下,它们的情感如何变化。经过仔细和全面的调查,我们收集了包含超过400个情况的数据集,这些情况被证明可以诱发出8种基本的情感。将这些情况分为36个因素,我们进行了全球范围内的人工评估,受试者超过1,200人。与人工评估结果为参考,我们的评估包括5个LLM,其中包括商业和开源模型,以及不同的模型大小和最新的迭代(如GPT-4和LLaMA 2)。结果显示,虽然LLMs在某些情况下能够适应,但它们在与人类情感行为的Alignment方面异常,无法建立类似情况之间的连接。我们收集的情况数据集、人工评估结果和测试框架代码(EmotionBench)将在https://github.com/CUHK-ARISE/EmotionBench上公开。我们希望通过提高LLMs与人类情感行为的Alignment,从而提高它们在智能助手方面的应用和可用性。
KITLM: Domain-Specific Knowledge InTegration into Language Models for Question Answering
results: KITLM表现较SKILL和GPT-3.5-turbo更出色,在MetaQA和AeroQA中都达到了1.5倍以上的提高,并且在飞航领域中也有了显著的提高。Abstract
Large language models (LLMs) have demonstrated remarkable performance in a wide range of natural language tasks. However, as these models continue to grow in size, they face significant challenges in terms of computational costs. Additionally, LLMs often lack efficient domain-specific understanding, which is particularly crucial in specialized fields such as aviation and healthcare. To boost the domain-specific understanding, we propose, KITLM, a novel knowledge base integration approach into language model through relevant information infusion. By integrating pertinent knowledge, not only the performance of the language model is greatly enhanced, but the model size requirement is also significantly reduced while achieving comparable performance. Our proposed knowledge-infused model surpasses the performance of both GPT-3.5-turbo and the state-of-the-art knowledge infusion method, SKILL, achieving over 1.5 times improvement in exact match scores on the MetaQA. KITLM showed a similar performance boost in the aviation domain with AeroQA. The drastic performance improvement of KITLM over the existing methods can be attributed to the infusion of relevant knowledge while mitigating noise. In addition, we release two curated datasets to accelerate knowledge infusion research in specialized fields: a) AeroQA, a new benchmark dataset designed for multi-hop question-answering within the aviation domain, and b) Aviation Corpus, a dataset constructed from unstructured text extracted from the National Transportation Safety Board reports. Our research contributes to advancing the field of domain-specific language understanding and showcases the potential of knowledge infusion techniques in improving the performance of language models on question-answering.
摘要
大型语言模型(LLM)在各种自然语言任务中表现出色,但是随着模型的大小不断增长,其计算成本也随之增加。此外,LLM often lacks efficient domain-specific understanding,尤其在专业领域如航空和医疗等领域。为了提高域 Specific Understanding,我们提议了一种基于知识库 интеграción的语言模型approach,即KITLM。通过将相关知识integrated into the language model,不仅提高了语言模型的性能,而且降低了模型的大小,同时实现相似的性能。我们的提出的知识混合模型在MetaQA上的精确匹配分数上超过了GPT-3.5-turbo和SKILL的状态的表现,达到了1.5倍的提升。KITLM在航空领域的AeroQA上也显示了类似的性能提升。我们发布了两个 curaated dataset,以促进域 Specific Understanding研究:a) AeroQA,一个新的多步问答检验 benchmark dataset,和b) Aviation Corpus,一个从国家交通安全委员会报告中提取的未结构化文本构建的数据集。我们的研究对域 Specific Understanding领域的发展做出了贡献,并展示了知识混合技术在问答任务中的潜在提升效果。
Negative Lexical Constraints in Neural Machine Translation
results: 我们提出了一种方法来减轻这个问题,通过在训练过程中使用剪辑的负约束来对抗模型生成多种表面形式的词语,从而减轻约束被违反的问题。我们示出了我们的方法可以改善约束,但问题仍然存在许多情况下。Abstract
This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the neural translation model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedback-based translation refinement. We also studied to which extent these methods "evade" the constraints presented to the model (usually in the dictionary form) by generating a different surface form of a given constraint.We propose a way to mitigate the issue through training with stemmed negative constraints to counter the model's ability to induce a variety of the surface forms of a word that can result in bypassing the constraint. We demonstrate that our method improves the constraining, although the problem still persists in many cases.
摘要
Translated into Simplified Chinese:这篇论文研究了英语到捷克语神经机器翻译中的负 lexical 约束。负 lexical 约束用于禁止翻译模型生成的某些词或表达。我们比较了基于修改解码过程或训练数据的不同方法。我们在两个任务上进行了比较:重句和反馈基于翻译重新评估。我们还研究了这些方法如何"逃脱"给模型的约束(通常是字典形式),生成不同的表面形式。我们提议通过减少负约束来 Mitigate 这个问题,使用减少负约束来对模型的表面形式变化进行对抗。我们示出了我们的方法可以改善约束,但问题仍然存在于多个情况。
WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
results: 研究结果表明,使用自动化启动算法和精心预处理的基本更新可以提高模型的性能,并在多个下游任务中显示出优异的成绩。Abstract
A fundamental challenge in the current NLP context, dominated by language models, comes from the inflexibility of current architectures to 'learn' new information. While model-centric solutions like continual learning or parameter-efficient fine tuning are available, the question still remains of how to reliably identify changes in language or in the world. In this paper, we propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia. We argue that such resource can be helpful for accelerating diachronic NLP, specifically, for training models able to scan knowledge resources for core updates concerning a concept, an event, or a named entity. Our proposed end-to-end method is fully automatic, and leverages a bootstrapping algorithm for gradually creating a high-quality dataset. Our results suggest that bootstrapping the seed version of WikiTiDe leads to better fine-tuned models. We also leverage fine-tuned models in a number of downstream tasks, showing promising results with respect to competitive baselines.
摘要
当前NLP上的一个基本挑战是模型的不可变性,即现有的模型无法"学习"新信息。虽有模型中心的解决方案如 kontinual learning 和参数效率的精度调整,但问题仍然是如何可靠地识别语言或世界中的变化。在这篇论文中,我们提出了 WikiTiDe dataset,它是基于 Wikipedia 中的时间戳定义对的对应集。我们认为这种资源可以帮助加速 diachronic NLP,即训练能够扫描知识资源的模型,以找到核心更新 concerning 概念、事件或Named Entity。我们的提议的终端方法是自动的,并利用搅拌算法来逐渐创建高质量的数据集。我们的结果表明,使用搅拌种子版本 WikiTiDe 可以获得更好的精度调整。此外,我们还利用精度调整的模型在一些下游任务中的表现,与比较标准的基准线有着良好的结果。
Towards Controllable Natural Language Inference through Lexical Inference Types
paper_authors: Yingji Zhang, Danilo S. Carvalho, Ian Pratt-Hartmann, Andre Freitas for: This paper aims to provide a mechanism for producing explanatory (abductive) inference chains that ground claims to their supporting premises.methods: The paper employs the T5 model to directly generate an entailment tree, which explains how the answer is inferred. However, the T5 model lacks the ability to explain and control the generation of intermediate steps, which is crucial for the multi-hop inference process.results: The paper proposes a controlled natural language inference architecture for multi-premise explanatory inference, which includes defining lexical inference types based on Abstract Meaning Representation (AMR) graph and modifying the architecture of T5 to learn a latent sentence representation conditioned on said type information. The paper also delivers a dataset of approximately 5000 annotated explanatory inference steps, with well-grounded lexical-symbolic operations. Experimental results indicate that the inference typing induced at the T5 bottleneck can help T5 to generate a conclusion under explicit control.Abstract
Explainable natural language inference aims to provide a mechanism to produce explanatory (abductive) inference chains which ground claims to their supporting premises. A recent corpus called EntailmentBank strives to advance this task by explaining the answer to a question using an entailment tree \cite{dalvi2021explaining}. They employ the T5 model to directly generate the tree, which can explain how the answer is inferred. However, it lacks the ability to explain and control the generation of intermediate steps, which is crucial for the multi-hop inference process. % One recent corpus, EntailmentBank, aims to push this task forward by explaining an answer to a question according to an entailment tree \cite{dalvi2021explaining}. They employ T5 to generate the tree directly, which can explain how the answer is inferred but cannot explain how the intermediate is generated, which is essential to the multi-hop inference process. In this work, we focus on proposing a controlled natural language inference architecture for multi-premise explanatory inference. To improve control and enable explanatory analysis over the generation, we define lexical inference types based on Abstract Meaning Representation (AMR) graph and modify the architecture of T5 to learn a latent sentence representation (T5 bottleneck) conditioned on said type information. We also deliver a dataset of approximately 5000 annotated explanatory inference steps, with well-grounded lexical-symbolic operations. Experimental results indicate that the inference typing induced at the T5 bottleneck can help T5 to generate a conclusion under explicit control.
摘要
自然语言推理可以提供一种机制,以便生成解释性的推理链,并将含义链绑定到它的支持前提。一个新的资料库called EntailmentBank,努力推动这项任务,通过解释答案使用推理树 \cite{dalvi2021explaining}.它使用T5模型直接生成推理树,可以解释答案如何被推理出来。然而,它缺乏对中间步骤的解释和控制能力,这是多步推理过程中的关键。在这项工作中,我们关注提出一种可控的自然语言推理体系,用于多个前提解释推理。为了提高控制和启用解释分析,我们定义了基于抽象意义表示(AMR)图的语义推理类型,并修改T5模型的架构,以学习受到这些类型信息的隐藏句子表示(T5瓶颈)。我们还提供了约5000个注释的解释推理步骤数据集,其中包含了具有固定 lexical-symbolic 操作的准确地标注。实验结果表明,在T5瓶颈中引入的推理类型induced可以帮助T5在显式控制下生成结论。
for: investigate a consistent method for deriving the correlation between sentence vector and semantic meaning of a sentence
methods: use three state-of-the-art word/sentence embedding methods (GPT-3, Word2Vec, and Sentence-BERT) to embed plain text sentence strings into high dimensional spaces, and compute the pairwise distance between any possible combination of two sentence vectors in an embedding space
results: observe correlations of the same sentence in different embedding spaces and correlations of different sentences in the same embedding space, which are consistent with the hypothesis and provide a foundation for further researchAbstract
This is an experiential study of investigating a consistent method for deriving the correlation between sentence vector and semantic meaning of a sentence. We first used three state-of-the-art word/sentence embedding methods including GPT-3, Word2Vec, and Sentence-BERT, to embed plain text sentence strings into high dimensional spaces. Then we compute the pairwise distance between any possible combination of two sentence vectors in an embedding space and map them into a matrix. Based on each distance matrix, we compute the correlation of distances of a sentence vector with respect to the other sentence vectors in an embedding space. Then we compute the correlation of each pair of the distance matrices. We observed correlations of the same sentence in different embedding spaces and correlations of different sentences in the same embedding space. These observations are consistent with our hypothesis and take us to the next stage.
摘要
这是一项实验性研究,旨在找到一种稳定的方法,用于计算句子 vector 和句子意义之间的相关性。我们首先使用了三种当前顶尖词语/句子嵌入方法,包括 GPT-3、Word2Vec 和 Sentence-BERT,将平文句子串embedded到高维空间中。然后,我们计算了任意两个句子 vector 之间的距离,并将其映射到一个矩阵中。基于每个距离矩阵,我们计算了每个句子 vector 与其他句子 vector 在嵌入空间中的距离相关性。然后,我们计算了每对距离矩阵之间的相关性。我们发现了不同嵌入空间中的同句子之间的相关性,以及同一个嵌入空间中的不同句子之间的相关性。这些观察结果与我们的假设一致,为我们的下一步做出了基础。
Mondrian: Prompt Abstraction Attack Against Large Language Models for Cheaper API Pricing
results: 研究结果表明,MONDRIAN可以成功地将用户查询语句的字符数减少13%至23%,并且这些简化的查询语句对任务特定和通用的语言模型如ChatGPT没有显著影响。此外,MONDRIAN还可以减少 instruciton prompts 的字符数至少11%,而不会影响输出质量。因此,这种攻击策略可以让攻击者获得利益,而无需承担API开发和部署的成本。Abstract
The Machine Learning as a Service (MLaaS) market is rapidly expanding and becoming more mature. For example, OpenAI's ChatGPT is an advanced large language model (LLM) that generates responses for various queries with associated fees. Although these models can deliver satisfactory performance, they are far from perfect. Researchers have long studied the vulnerabilities and limitations of LLMs, such as adversarial attacks and model toxicity. Inevitably, commercial ML models are also not exempt from such issues, which can be problematic as MLaaS continues to grow. In this paper, we discover a new attack strategy against LLM APIs, namely the prompt abstraction attack. Specifically, we propose Mondrian, a simple and straightforward method that abstracts sentences, which can lower the cost of using LLM APIs. In this approach, the adversary first creates a pseudo API (with a lower established price) to serve as the proxy of the target API (with a higher established price). Next, the pseudo API leverages Mondrian to modify the user query, obtain the abstracted response from the target API, and forward it back to the end user. Our results show that Mondrian successfully reduces user queries' token length ranging from 13% to 23% across various tasks, including text classification, generation, and question answering. Meanwhile, these abstracted queries do not significantly affect the utility of task-specific and general language models like ChatGPT. Mondrian also reduces instruction prompts' token length by at least 11% without compromising output quality. As a result, the prompt abstraction attack enables the adversary to profit without bearing the cost of API development and deployment.
摘要
Machine Learning as a Service(MLaaS)市场迅速扩大,成熔度也在提高。例如,OpenAI的ChatGPT是一种先进的大型语言模型(LLM),可以根据不同的问题生成相应的回答,但这些模型并不完美。研究人员已经长期研究LLM的攻击和限制,如对抗攻击和模型毒性。然而,商业ML模型也不能免受这些问题,这可能会对MLaaS的发展带来问题。在这篇论文中,我们发现了一种新的攻击策略对LLM API,即提档攻击。特别是,我们提出了一种名为Mondrian的简单和直观的方法,可以将句子抽象成更短的句子。在这种方法中,敌对者首先创建一个假API(具有较低的成本),作为目标API(具有较高的成本)的代理。然后,假API使用Mondrian modify用户的查询,从目标API获取抽象回答,并将其返回给终端用户。我们的结果表明,Mondrian可以成功地将用户查询的字符数量减少13%到23%,并且这些抽象查询不会对任务特定和总语言模型如ChatGPT产生重大影响。此外,Mondrian还可以减少 instruktion 的字符数量至少11%,无需妥协输出质量。因此,提档攻击可以让敌对者获利而不需要承担API的开发和部署成本。
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue
results: 我们的模型在多种能力方面超过了基eline,并与之前的最佳模型和ChatGPT匹配在一些能力方面。RLHF进一步提高了模型的命令遵循能力和安全性。Abstract
Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot always align responses with safety and professionalism experts. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from pre-training to reinforcement learning with human feedback (RLHF). Additionally, we introduce a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We define a refined annotation rule and evaluation criteria given the biomedical domain's unique characteristics. Results show that our model outperforms baselines in various capacities and matches the performance of ChatGPT in a few abilities, despite having 50x training data with previous best model and 100x parameters with ChatGPT. RLHF further improves the model's instruction-following ability and safety.We also release our code, datasets and model for further research.
摘要
近期大语言模型(LLM)的进步取得了很大的突破,在理解和回答用户意图方面表现出色。然而,在某些专业领域,如中医,其表现仍然落后于通用场景。现有的中医integration into LLMs的尝试都是通过监督微调(SFT)和单转Dialogue数据进行。这些模型缺乏医生般的积极问题和多转Dialogue的能力,并且不能一直与安全和专业性保持一致。在这项工作中,我们介绍了 Zhongjing,首个基于LLaMA的中医语言模型,该模型通过整个训练管道,从预训练到人工反馈学习(RLHF)来实现。此外,我们还介绍了一个70000 authentic doctor-patient对话的中医多转对话数据集,CMtMedQA,这使得模型在复杂对话和积极问题的 iniciation 方面具有显著提升。我们采用了特定领域的注解规则和评价标准。结果表明,我们的模型在多种方面超过基eline,并与之前的最佳模型和ChatGPT的性能相当,即使有50倍的训练数据和100倍的参数。RLHF进一步改善了模型的指令遵循能力和安全性。我们还发布了代码、数据集和模型,以便进一步的研究。
Knowledge-preserving Pruning for Pre-trained Language Models without Retraining
results: 在 SQuAD benchmarck 上,距离80%的压缩率下,与现有的不需要重新训练剪辑算法相比,提高了58.02%的 F1 分数Abstract
Given a pre-trained language model, how can we efficiently compress it without retraining? Retraining-free structured pruning algorithms are crucial in pre-trained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to preserve the useful knowledge of pre-trained models. In this paper, we propose K-pruning (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pre-trained language models. K-pruning identifies and prunes attention heads and neurons deemed to be superfluous, based on the amount of their inherent knowledge. K-pruning applies an iterative process of pruning followed by knowledge reconstruction for each sub-layer to preserve the knowledge of the pre-trained models. Consequently, K-pruning shows up to 58.02%p higher F1 score than existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark.
摘要
Translated into Simplified Chinese:给一个预训练语言模型,如何高效压缩它而无需重新训练?预训练模型压缩中的结构化压缩算法对于大型语言模型的压缩具有重要的降低压缩成本和可以压缩大型语言模型。然而,现有的预训练模型压缩算法往往会导致严重的准确性下降,因为它们无法保留预训练模型的有用知识。在本文中,我们提出了K-压缩(知识保留压缩),一种高精度的预训练模型压缩算法。K-压缩通过评估注意头和神经元的含义量来进行压缩,并在每个子层上应用迭代压缩和知识重建过程以保留预训练模型的知识。因此,K-压缩在80%的压缩率下,与现有的预训练模型压缩算法相比,在SQuAD测试 benchmark上显示出58.02%p的更高的F1分数。
Improving Few-shot and Zero-shot Entity Linking with Coarse-to-Fine Lexicon-based Retriever
paper_authors: Shijue Huang, Bingbing Wang, Libo Qin, Qin Zhao, Ruifeng Xu for: 这篇论文主要针对中文少量和零量实体识别问题,尤其是对尾部和出现的实体进行更加准确的识别。methods: 该论文提出了一种基于词典的粗细化检索器,通过两层检索来有效地检索实体候选者。第一层利用实体名称进行检索,而第二层则是利用实体描述来细化检索并准确地划分出新的实体。results: 实验结果显示,该方法可以在不进行广泛的训练过程中获得优秀的性能,并且在NLPCC 2023共享任务6中 ranked 1st in Chinese Few-shot and Zero-shot Entity Linking。Abstract
Few-shot and zero-shot entity linking focus on the tail and emerging entities, which are more challenging but closer to real-world scenarios. The mainstream method is the ''retrieve and rerank'' two-stage framework. In this paper, we propose a coarse-to-fine lexicon-based retriever to retrieve entity candidates in an effective manner, which operates in two layers. The first layer retrieves coarse-grained candidates by leveraging entity names, while the second layer narrows down the search to fine-grained candidates within the coarse-grained ones. In addition, this second layer utilizes entity descriptions to effectively disambiguate tail or new entities that share names with existing popular entities. Experimental results indicate that our approach can obtain superior performance without requiring extensive finetuning in the retrieval stage. Notably, our approach ranks the 1st in NLPCC 2023 Shared Task 6 on Chinese Few-shot and Zero-shot Entity Linking.
摘要
主要研究领域是几招和零招实体连接,它们更加具有实际场景的挑战性。主流方法是''检索并重新排''的两个阶段框架。在这篇论文中,我们提出了一种粗细层次lexicon-based检索器,可以有效地 retrieve实体候选者,它在两层结构下运行。第一层通过实体名称进行检索粗细候选者,第二层在粗细候选者中进行筛选和精度增强。此外,第二层还利用实体描述来有效地减少尾部或新出现的实体名称冲突。实验结果表明,我们的方法可以在检索阶段无需大规模的微调就可以获得优秀表现。特别是,我们的方法在NLPCC 2023共享任务6中的中文几招和零招实体连接中获得了第一名。
Coupling Symbolic Reasoning with Language Modeling for Efficient Longitudinal Understanding of Unstructured Electronic Medical Records
results: 研究发现,将符号逻辑与语言模型结合使用可以提高不结构化医疗记录中各种医学变量的提取率。此外,研究还发现了现有的开源 LLMs 在检索性能方面与商业 LLMs 相当。最后,研究强调了使用符号逻辑来导航 LLMs 的重要性,因为纯然使用 LLMs 会导致性能最低。Abstract
The application of Artificial Intelligence (AI) in healthcare has been revolutionary, especially with the recent advancements in transformer-based Large Language Models (LLMs). However, the task of understanding unstructured electronic medical records remains a challenge given the nature of the records (e.g., disorganization, inconsistency, and redundancy) and the inability of LLMs to derive reasoning paradigms that allow for comprehensive understanding of medical variables. In this work, we examine the power of coupling symbolic reasoning with language modeling toward improved understanding of unstructured clinical texts. We show that such a combination improves the extraction of several medical variables from unstructured records. In addition, we show that the state-of-the-art commercially-free LLMs enjoy retrieval capabilities comparable to those provided by their commercial counterparts. Finally, we elaborate on the need for LLM steering through the application of symbolic reasoning as the exclusive use of LLMs results in the lowest performance.
摘要
人工智能(AI)在医疗领域的应用已经是革命性的,尤其是最近的转换器基于大语言模型(LLM)的进步。然而,理解不结构化的电子医疗记录仍然是一个挑战,因为记录的自然特性(如混乱、不一致和重复),以及LLM无法 derivation reasoning 模式,导致医学变量的全面理解受到限制。在这种情况下,我们研究了对象 Symbolic reasoning 和语言模型结合的能力,以提高不结构化医疗文本理解。我们发现,这种结合可以提高多个医学变量的提取。此外,我们发现了现成的自由LLM在检索能力方面与商业LLM相当。最后,我们讨论了LLM的导航需要通过象征逻辑的应用,因为纯粹使用LLM会导致性能最低。
LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning
paper_authors: Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, Bo Li
For: + The paper is written for fine-tuning large language models (LLMs) with low-rank adaptation (LoRA) method.* Methods: + The LoRA-FA method chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. + The method eliminates the requirement to store full-rank input activations, reducing the activation memory without performance degradation and expensive recomputation.* Results: + The LoRA-FA method achieves close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. + The method reduces the overall memory cost by up to 1.4 times compared to LoRA.Here’s the simplified Chinese text:* For: + 这篇论文是为大型自然语言模型(LLM)的精度调整(LoRA)方法而写的。* Methods: + LoRA-FA方法选择在每个LoRA层中冻结投影下重量$A$,并更新投影上重量$B$。 + 方法消除了需要存储完整Activation的存储要求,从而降低了活动内存的消耗,不会影响性能和费用计算。* Results: + LoRA-FA方法在不同任务上都能达到与全参数调整和LoRA相同的精度。 + 方法可以将总内存成本降低到LoRA的1.4倍。Abstract
The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$ compared to LoRA.
摘要
LoRA方法可以大幅减少精度调整大语言模型(LLM)的训练参数,但仍然需要费时的活动记忆更新低级 веса。减少LoRA层数或使用活动重计算可能会减少调整性能或增加计算开销。在这种工作中,我们提出了LoRA-FA,一种内存高效的调整方法,可以不需要存储全级输入活动的存储。LoRA-FA在每个LoRA层中决定将$A$的投影下降 веса冻结,而将$B$的投影上升 веса更新。这确保了模型参数的变化在LLMs调整过程中 residual在低级空间中,而无需存储全级输入活动。我们在多种模型类型(RoBERTa、T5、LLaMA)和模型规模上进行了广泛的实验。我们的结果表明,LoRA-FA可以在不同任务上实现与全参数调整和LoRA的相似精度,并且可以将总内存成本减少到1.4倍。
Studying Large Language Model Generalization with Influence Functions
paper_authors: Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman
methods: 影响函数可以回答一个 counterfactual 问题:如果某个序列添加到训练集中, Then how would the model’s parameters and outputs change? However, influence functions are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP).
results: 我们使用 Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) 方法将影响函数扩展到 LLMs 中,并实现了类似于传统影响函数估计器的准确性,同时计算 IHVP 的速度是指数级别 slower。我们还 investigate了两种算法技巧来降低计算候选训练序列梯度的成本:TF-IDF 筛选和查询批处理。通过影响函数,我们研究了 LLMs 的泛化模式,包括泛化模式的稀缺性、增加抽象的规律、数学和编程能力、 across-lingual 泛化和角色扮演行为。 despite 许多复杂的泛化形式,我们发现一个意外的限制:影响幅在键短语顺序反转时 decay 到 Near-zero。总的来说,影响函数给我们一种强大的新工具来研究 LLMs 的泛化特性。Abstract
When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.
摘要
Translated into Simplified Chinese:当试图更好地了解一个机器学习模型以便理解和 mitigate 相关的风险时,一个有价值的证据来源是:哪些训练示例最大程度地影响模型的行为?影响函数旨在回答一个Counterfactual问题:如果给定的序列添加到训练集中, THEN 模型的参数(以及其输出)如何变化?虽然影响函数已经生成了小型模型的情况,但是它们难以扩展到大型自然语言模型(LLMs),因为计算 inverse-Hessian-vector product(IHVP)的困难。我们使用 Eigenvalue-corrected Kronecker-Factored Approximate Curvature(EK-FAC)方法来扩展影响函数到 LLMs 中,并在520亿参数中实现了类似的准确率。我们还 investigate 两种算法技术来降低计算候选训练序列的导数的成本:TF-IDF 筛选和查询批处理。我们使用影响函数来研究 LLMs 的总化模式,包括影响模式的稀疏性、随着Scale的增长、数学和编程能力、cross-lingual总化和角色扮演行为。尽管 Apparently 出现了多种复杂的总化形式,但我们发现一个意外的限制:影响的 decay 到 near-zero 当键phrase 的顺序被反转。总之,影响函数给我们一种强大的新工具来研究 LLMs 的总化性能。
Dialogue Systems Can Generate Appropriate Responses without the Use of Question Marks? – Investigation of the Effects of Question Marks on Dialogue Systems
results: 研究发现,问号在对话系统中有显著的影响,并且分析了具体的示例以了解哪些类型的语音会对对话系统产生影响。Abstract
When individuals engage in spoken discourse, various phenomena can be observed that differ from those that are apparent in text-based conversation. While written communication commonly uses a question mark to denote a query, in spoken discourse, queries are frequently indicated by a rising intonation at the end of a sentence. However, numerous speech recognition engines do not append a question mark to recognized queries, presenting a challenge when creating a spoken dialogue system. Specifically, the absence of a question mark at the end of a sentence can impede the generation of appropriate responses to queries in spoken dialogue systems. Hence, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. Moreover, we analyze specific examples in an effort to determine which types of utterances have the impact on dialogue systems.
摘要
当人们在口头交流中发言时,可以观察到文本对话中不同的现象。written communication通常使用问号来标示问题,而在口头交流中,问题通常由句子尾的升高声调表示。然而,许多语音识别器不会将认可的问题append到句子尾,这对创建口头对话系统带来挑战。specifically, the absence of a question mark at the end of a sentence can hinder the generation of appropriate responses to queries in spoken dialogue systems. Therefore, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. In addition, we analyze specific examples to determine which types of utterances have the greatest impact on dialogue systems.Note: The word "问号" (wèn zhàng) in the text refers to the question mark symbol (?) used in written Chinese to indicate a question.
Towards General Text Embeddings with Multi-stage Contrastive Learning
results: 该模型在文本嵌入benchmark上表现出色,比之前的模型更高效,并在不需要进一步 fine-tuning 的情况下,在代码检索任务上也表现出优异。Abstract
We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.
摘要
我们介绍GTE,一种通用文本嵌入模型,通过多阶段对比学习训练。随着现代NPLTasks的统一,我们使用对比学习训练一种多元数据集合的通用文本嵌入模型,并在不同来源的数据集上进行了大量的训练数据增加。这使得GTE在现有嵌入模型的基础上实现了显著性能提升。特别是,即使使用110M个参数,GTE$_\text{base}$仍然可以超越OpenAI提供的黑盒嵌入API以及10倍大的文本嵌入模型在庞大文本嵌入 benchmark 上。此外,不需要额外 fine-tuning 每种编程语言,我们的模型可以在与类似大小的前一代最佳代码搜索器相比超越它们。总之,我们的模型通过有效地利用多阶段对比学习,提供了一种强大和高效的文本嵌入模型,可以广泛应用于不同的NPLTasks和代码相关任务。
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition
paper_authors: Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon for: 这个论文的目的是用具体的任务准则(mission-focused instruction tuning)来训练更加成本效益的模型,以便在广泛的应用领域中具有优秀的表现。methods: 这篇论文使用了命名实体识别(NER)作为案例研究,通过减少ChatGPT模型的参数量,以训练更小的UniversalNER模型,以便在开放的NER任务上达到高度的准确率。results: 研究发现,无需使用直接监督,UniversalNER模型可以在多个领域和数据集上达到remarkable的NER准确率,并在average上超过了Alpaca和Vicuna模型以及InstructUIE系统。此外,UniversalNER模型还可以在不同的任务上进行多任务学习,并且可以在不同的数据集上进行适应性的学习。Abstract
Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We will release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.
摘要
大型语言模型(LLM)已经表现出了很好的通用性,例如理解任意实体和关系。 instrucion 调教已经证明可以将 LLM 转化为更加Cost-efficient的模型,如阿LPACA和维纳纳。然而,这些学生模型仍然在下游应用中落后于原始 LLM 多达几个百分点。在这篇论文中,我们 explore 目标调教与使命匹配的 instrucion 调教来训练学生模型,以便在广泛的应用领域中 excel。使用名实体识别(NER)为案例研究,我们示出了如何通过 Targeted Distillation 将 ChatGPT 训练成更加小的 UniversalNER 模型,以便在开放的 NER 应用中进行识别。为了评估,我们组织了历史上最大的 NER benchmark,包括43个数据集,覆盖9个多样化的领域,如生物医学、编程、社交媒体、法律、金融等。无需直接监督,UniversalNER 在 tens of thousands 个实体类型中实现了惊人的 NER 准确率,比如常规的 instrucion-tuned 模型(如阿LPACA和维纳纳)多达30个Absolute F1点的提升。同时,UniversalNER 不仅获得了 ChatGPT 可以识别任意实体类型的能力,还在 NER 准确率方面超过了 ChatGPT 的表现,提高了7-9个Absolute F1点的提升。凯于此,UniversalNER 甚至超过了当前state-of-the-art的多任务 instrucion-tuned 系统(InstructUIE),该系统使用了supervised NER例子。我们还进行了严格的减少研究,以评估各种组件在我们的调教方法中的影响。我们将发布调教方法、数据和 UniversalNER 模型,以便未来研究人员可以通过目标调教来进一步提高模型的性能。
From Ambiguity to Explicitness: NLP-Assisted 5G Specification Abstraction for Formal Analysis
results: 本研究实现了三种不同的依赖关系模型,其中最佳模型可达到有效率39% для标识符提取和42% для正式属性预测。这些结果证明了我们的方法的可行性和效果,并预示了对大规模复杂规格和协议分析的高效方法。Abstract
Formal method-based analysis of the 5G Wireless Communication Protocol is crucial for identifying logical vulnerabilities and facilitating an all-encompassing security assessment, especially in the design phase. Natural Language Processing (NLP) assisted techniques and most of the tools are not widely adopted by the industry and research community. Traditional formal verification through a mathematics approach heavily relied on manual logical abstraction prone to being time-consuming, and error-prone. The reason that the NLP-assisted method did not apply in industrial research may be due to the ambiguity in the natural language of the protocol designs nature is controversial to the explicitness of formal verification. To address the challenge of adopting the formal methods in protocol designs, targeting (3GPP) protocols that are written in natural language, in this study, we propose a hybrid approach to streamline the analysis of protocols. We introduce a two-step pipeline that first uses NLP tools to construct data and then uses constructed data to extract identifiers and formal properties by using the NLP model. The identifiers and formal properties are further used for formal analysis. We implemented three models that take different dependencies between identifiers and formal properties as criteria. Our results of the optimal model reach valid accuracy of 39% for identifier extraction and 42% for formal properties predictions. Our work is proof of concept for an efficient procedure in performing formal analysis for largescale complicate specification and protocol analysis, especially for 5G and nextG communications.
摘要
formal方法基础的分析对5G无线通信协议是关键的,尤其在设计阶段。自然语言处理(NLP)助け的技术和工具在行业和研究社区中并不很受欢迎。传统的形式验证通过数学方法,强调手动逻辑归纳,容易占用时间和容易出错。因为自然语言协议设计的语言性是 controvertible,NLP助け的方法在工业研究中并未得到广泛采用。为了解决协议设计中的形式方法采用的挑战,我们在本研究中提出了一种混合方法。我们提出了一个两步管道,首先使用NLP工具生成数据,然后使用生成的数据提取标识符和形式属性,并用NLP模型进行预测。标识符和形式属性被用于形式分析。我们实现了三个模型,它们根据标识符和形式属性之间的依赖关系作为优化标准。我们的结果表明,我们的优化模型可以达到有效率的39% для标识符提取和42% для形式属性预测。我们的工作是一种有效的方法,用于对大规模复杂规格和协议分析进行有效的形式分析,特别是对5G和nextG通信。
Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization
results: 实验结果表明,AdaFedSelecKD可以与中央训练方法相比,在QMSum数据集上实现相似的性能,并且表现稳定和可靠。Abstract
Meeting summarization has emerged as a promising technique for providing users with condensed summaries. However, existing work has focused on training models on centralized data, neglecting real-world scenarios where meeting data are infeasible to collect centrally, due to their sensitive nature. This gap motivates us to explore federated learning for meeting summarization. Two critical challenges impede progress. First, state-of-the-art summarizers are based on parameter-heavy pre-trained models. Exchanging such a model's parameters across clients imposes large bandwidth costs. Second, as real-world meeting data belong to various domains and are distributed across clients, they are instances of non-identically and independently distributed (non-IID). IID assumptions do not hold, which changes which forms of learning algorithms best apply. To address this, we propose Adapter-based Federated Selective Knowledge Distillation (AdaFedSelecKD) for training performant client models. Specifically, we develop an adapter-based summarization model where two adapters cooperatively facilitate learning using fewer parameters to reduce communication costs. Then, we devise a selective knowledge distillation strategy, assisting clients in robustly handling domain-focused modelling on their own data, while leveraging global parameters based on non-IID data. Extensive experiments on the QMSum benchmark demonstrate AdaFedSelecKD can achieve comparable performance with powerful centralized training methods, and shows its generalizability and robustness.
摘要
SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability
methods: 该论文使用了一种新的NAR模型,combines the accuracy of AED-based model和NAR模型,并且具有良好的contextualization表现。
results: 在50,000小时的工业大数据实验中,该提案的模型比强基eline模型在自定义和普通语音识别任务中表现出色,并且还提出了一种高效的热词筛选方法。Abstract
Hotword customization is one of the important issues remained in ASR field - it is of value to enable users of ASR systems to customize names of entities, persons and other phrases. The past few years have seen both implicit and explicit modeling strategies for ASR contextualization developed. While these approaches have performed adequately, they still exhibit certain shortcomings such as instability in effectiveness. In this paper we propose Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) a novel NAR based ASR system with flexible and effective hotword customization ability. It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Besides, we explore an efficient way to filter large scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all opened as well as two hotword test sets.
摘要
“热词自定义是ASR领域中一个重要的 Issue - 它具有价值,以允许ASR系统的使用者自定义名称、人名和其他短语。过去几年,有 implicit 和 explicit 模型化策略为ASR上下文化开发出来。although these approaches have performed adequately, they still exhibit certain shortcomings such as instability in effectiveness。在这篇文章中,我们提出Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) ,一种新的 NAR 基于 ASR 系统,具有灵活和有效的热词自定义能力。它结合了 AED-based 模型的精度,NAR 模型的效率,以及优秀的上下文化表现。在50,000小时的工业大数据实验中,我们的提议模型比强大的基eline在自定义和一般 ASR 任务上表现出色。此外,我们还探索了一种高效的方法来筛选大规模的来临热词,以进一步提高效能。文章中的原始代码和工业模型都已经公开,同时还提供了两个热词测试集。”
Exploring Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning
results: 该论文通过实验表明,自动生成的错误选项和反馈信息质量仍有很大的改进空间。同时,它还提出了未来研究的方向。Abstract
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in both assessments and practices. An important aspect of MCQs is the distractors, i.e., incorrect options that are designed to target specific misconceptions or insufficient knowledge among students. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers, which has limited scalability. In this work, we explore the task of automated distractor and corresponding feedback message generation in math MCQs using large language models. We establish a formulation of these two tasks and propose a simple, in-context learning-based solution. Moreover, we explore using two non-standard metrics to evaluate the quality of the generated distractors and feedback messages. We conduct extensive experiments on these tasks using a real-world MCQ dataset that contains student response information. Our findings suggest that there is a lot of room for improvement in automated distractor and feedback generation. We also outline several directions for future work
摘要
多选问题(MCQ)在教育中是非常普遍的,因为它们容易进行管理、评分和练习。MCQ中的错误选项(distractors)是一个重要的特点,它们需要针对学生的错误观点或知识不足进行设计。然而,制作高质量的错误选项仍然是一项劳动密集的任务,这限制了大规模应用。在这篇文章中,我们探讨了自动生成MCQ中的错误选项和相应的反馈消息,使用大型自然语言模型。我们提出了一种简单的、在场景学习中进行学习的解决方案。此外,我们还使用了两种非标准度量来评估生成的错误选项和反馈消息的质量。我们在实际的MCQ数据集上进行了广泛的实验,我们的发现表明了自动生成错误选项和反馈消息还有很大的可改进空间。我们还提出了未来工作的一些方向。
Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits
results: 这paper证明了average-hard attention transformers可以recognizeTC0类语言,而log-precision transformers可以recognize uniform TC0类语言。这两个result都表明transformer模型可以被constant-depth threshold circuits模型。Abstract
Transformers have emerged as a widely used neural network model for various natural language processing tasks. Previous research explored their relationship with constant-depth threshold circuits, making two assumptions: average-hard attention and logarithmic precision for internal computations relative to input length. Merrill et al. (2022) prove that average-hard attention transformers recognize languages that fall within the complexity class TC0, denoting the set of languages that can be recognized by constant-depth polynomial-size threshold circuits. Likewise, Merrill and Sabharwal (2023) show that log-precision transformers recognize languages within the class of uniform TC0. This shows that both transformer models can be simulated by constant-depth threshold circuits, with the latter being more robust due to generating a uniform circuit family. Our paper shows that the first result can be extended to yield uniform circuits as well.
摘要
transformers 已经成为自然语言处理任务中广泛使用的神经网络模型。前一个研究探讨了它们与常 depth 阈值电路之间的关系,并假设了两个假设:均值困难注意力和对内部计算的对数精度相对于输入长度。Merrill et al. (2022) 证明了 average-hard 注意力 transformers 可以认出 fall 在 TC0 复杂性类中的语言,其中 TC0 表示可以通过常 depth 多项式大小阈值电路来认出的语言。另外,Merrill 和 Sabharwal (2023) 表明 log-precision transformers 可以认出 uniform TC0 类中的语言。这表明两种 transformer 模型都可以被模拟为常 depth 阈值电路,其中后一种更加稳定,因为它生成了一个 uniform 电路家族。我们的论文显示,第一个结果可以被推广到生成 uniform 电路。