results: 论文通过对多个未看过的数据集进行严格的训练,证明REMARK-LLM可以插入2倍多的签名比特数据入文本中,同时保持 semantic integrity,并且在各种水印检测和移除攻击下展现出更好的抗性。Abstract
We present REMARK-LLM, a novel efficient, and robust watermarking framework designed for texts generated by large language models (LLMs). Synthesizing human-like content using LLMs necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (IP). However, the generated content is prone to malicious exploitation, including spamming and plagiarism. To address the challenges, REMARK-LLM proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into LLM-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; Furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. REMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. Extensive evaluations on multiple unseen datasets highlight REMARK-LLM proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. Furthermore, REMARK-LLM exhibits better resilience against a spectrum of watermark detection and removal attacks.
摘要
我们介绍REMARK-LLM,一种新的高效、可靠的文本杂化框架,适用于大语言模型(LLM)生成的文本。使用LLM生成人类化内容需要庞大的计算资源和广泛的数据集,包括重要知识产权(IP)。然而,生成的内容容易被恶意利用,如垃圾邮件和抄袭。为解决这些挑战,REMARK-LLM提出了三个新组件:(i)一个学习基于的消息编码模块,用于在LLM生成的文本中混入binary标识符;(ii)一个重parameterization模块,将消息编码的稠密分布转换为稀疏分布的杂化文本token;(iii)一个专门 для抽取签名的解码模块。此外,我们引入了优化的搜索算法,以确保生成的内容具有准确性和一致性。REMARK-LLM在 Semantic integrity的保持和有效签名检索方面进行了严格的训练,同时能够插入2倍多的签名比特到同一个文本中,并且维护Semantic integrity。此外,REMARK-LLM表现出更好的抗干扰和抗除法风险。
GRI: Graph-based Relative Isomorphism of Word Embedding Spaces
paper_authors: Muhammad Asif Ali, Yan Hu, Jianbin Qin, Di Wang
for: automatic construction of bilingual dictionaries using monolingual embedding spaces
methods: combines distributional training objectives with attentive graph convolutions to consider the impact of semantically similar words
results: outperforms existing research by improving the average P@1 by up to 63.6%Abstract
Automated construction of bilingual dictionaries using monolingual embedding spaces is a core challenge in machine translation. The end performance of these dictionaries relies upon the geometric similarity of individual spaces, i.e., their degree of isomorphism. Existing attempts aimed at controlling the relative isomorphism of different spaces fail to incorporate the impact of semantically related words in the training objective. To address this, we propose GRI that combines the distributional training objectives with attentive graph convolutions to unanimously consider the impact of semantically similar words required to define/compute the relative isomorphism of multiple spaces. Experimental evaluation shows that GRI outperforms the existing research by improving the average P@1 by a relative score of up to 63.6%. We release the codes for GRI at https://github.com/asif6827/GRI.
摘要
自动化建立双语词典使用单语空间的嵌入是机器翻译的核心挑战。这些词典的性能取决于各个空间的几何相似性,即他们的相对几何同构性。现有的尝试都没有考虑semantic关联的影响,即在训练目标中考虑相似的单词。为解决这个问题,我们提出了GRI,它将分布式训练目标与注意力 Graph Convolutions 结合,同时考虑多个空间中相似的单词,以统一评估多个空间的相对几何同构性。实验表明,GRI可以提高平均P@1的表现,相比现有研究提高63.6%。我们在github上分享了GRI代码,可以在https://github.com/asif6827/GRI中下载。
results: 实现了一个高效的 kNN-MT 框架,可以快速构建大规模的 datastore,并在 WMT’19 German-to-English 翻译任务中实现了相当的提升。Abstract
k-nearest-neighbor machine translation (kNN-MT) boosts the translation quality of a pre-trained neural machine translation (NMT) model by utilizing translation examples during decoding. Translation examples are stored in a vector database, called a datastore, which contains one entry for each target token from the parallel data it is made from. Due to its size, it is computationally expensive both to construct and to retrieve examples from the datastore. In this paper, we present an efficient and extensible kNN-MT framework, knn-seq, for researchers and developers that is carefully designed to run efficiently, even with a billion-scale large datastore. knn-seq is developed as a plug-in on fairseq and easy to switch models and kNN indexes. Experimental results show that our implemented kNN-MT achieves a comparable gain to the original kNN-MT, and the billion-scale datastore construction took 2.21 hours in the WMT'19 German-to-English translation task. We publish our knn-seq as an MIT-licensed open-source project and the code is available on https://github.com/naist-nlp/knn-seq . The demo video is available on https://youtu.be/zTDzEOq80m0 .
摘要
k- nearest-neighbor机器翻译(kNN-MT)可以提高一个预训练的神经机器翻译(NMT)模型的翻译质量,通过在解码过程中使用翻译示例。翻译示例被存储在一个vector数据库中,称为datastore,每个目标单词都有一个入口。由于其大小,construct和retrieve示例从datastore是计算昂贵的。在这篇论文中,我们提出了一个高效和可扩展的kNN-MT框架,knn-seq,这是为研究人员和开发人员设计的,可以高效运行,即使数据存储量达到了十亿级。knn-seq是一个plug-in在fairseq上,可以方便地更换模型和kNN索引。实验结果表明,我们实现的kNN-MT可以与原始kNN-MT做比较,并且构建了一个百亿级datastore只需2.21小时在WMT'19德语到英语翻译任务中。我们在MIT许可下发布了knn-seq作为开源项目,代码可以在https://github.com/naist-nlp/knn-seq上获取。demo视频可以在https://youtu.be/zTDzEOq80m0上找到。
LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following
paper_authors: Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, Kai-Wei Chang
for: 这种 paper 的目的是提高 Embodied Instruction Following 中的泛化能力,使 agents 能够在未看过的环境中更好地执行任务。
methods: 这种 paper 使用了 contrastive learning 和 meta-actions 来解决 Embodied Instruction Following 中的泛化问题。
results: compared to a strong multi-modal Transformer baseline, 这种方法 achieved a significant 4.5% absolute gain in success rate in unseen environments of ALFRED Embodied Instruction Following.I hope that helps! Let me know if you have any other questions.Abstract
End-to-end Transformers have demonstrated an impressive success rate for Embodied Instruction Following when the environment has been seen in training. However, they tend to struggle when deployed in an unseen environment. This lack of generalizability is due to the agent's insensitivity to subtle changes in natural language instructions. To mitigate this issue, we propose explicitly aligning the agent's hidden states with the instructions via contrastive learning. Nevertheless, the semantic gap between high-level language instructions and the agent's low-level action space remains an obstacle. Therefore, we further introduce a novel concept of meta-actions to bridge the gap. Meta-actions are ubiquitous action patterns that can be parsed from the original action sequence. These patterns represent higher-level semantics that are intuitively aligned closer to the instructions. When meta-actions are applied as additional training signals, the agent generalizes better to unseen environments. Compared to a strong multi-modal Transformer baseline, we achieve a significant 4.5% absolute gain in success rate in unseen environments of ALFRED Embodied Instruction Following. Additional analysis shows that the contrastive objective and meta-actions are complementary in achieving the best results, and the resulting agent better aligns its states with corresponding instructions, making it more suitable for real-world embodied agents. The code is available at: https://github.com/joeyy5588/LACMA.
摘要
END-TO-END 转换器在训练中见过环境下的Embodied Instruction Following任务中表现出色,但在未经训练的环境下却表现不佳,这导致了模型的普适性受到限制。这种问题的原因在于模型对自然语言指令的敏感性不够,这使得模型在不同环境下无法适应。为了解决这个问题,我们提议通过对模型隐藏状态与指令进行对齐来提高模型的敏感性。然而,高级语言指令和模型的低级动作空间之间的差距仍然存在,这使得模型困难地将高级语言指令翻译成低级动作。为了解决这个问题,我们提出了一种新的概念——元动作。元动作是在原始动作序列中提取出的普适的动作模式,它们可以帮助模型更好地理解高级语言指令的含义。当元动作作为训练信号时,模型在未经训练的环境下的总成功率得到了显著的提高。相比于一个强大的多Modal Transformer参考点,我们在未经训练的ALFRED Embodied Instruction Following任务中实现了4.5%的绝对提升。更进一步的分析表明,对比于对照学习和元动作的融合,我们的方法更好地实现了模型与指令之间的对齐,使得模型更适合实际的具体体现agent。代码可以在以下链接获取:https://github.com/joeyy5588/LACMA。
Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-ly
results: 我们进行了一项广泛的实验分析,以评估in-context PVI 的可靠性。我们的发现表明,in-context PVI 估计值具有类似的特性于原始 PVI。具体地说,在听 Context 设置下,in-context PVI 估计值具有稳定的特性,不受不同的示例选择和射击数的影响。此外,我们还示了如何使用 in-context PVI 来标识困难的实例。这篇论文强调了 in-context PVI 的潜在价值和ICL的可能性。Abstract
In-context learning (ICL) is a new learning paradigm that has gained popularity along with the development of large language models. In this work, we adapt a recently proposed hardness metric, pointwise $\mathcal{V}$-usable information (PVI), to an in-context version (in-context PVI). Compared to the original PVI, in-context PVI is more efficient in that it requires only a few exemplars and does not require fine-tuning. We conducted a comprehensive empirical analysis to evaluate the reliability of in-context PVI. Our findings indicate that in-context PVI estimates exhibit similar characteristics to the original PVI. Specific to the in-context setting, we show that in-context PVI estimates remain consistent across different exemplar selections and numbers of shots. The variance of in-context PVI estimates across different exemplar selections is insignificant, which suggests that in-context PVI are stable. Furthermore, we demonstrate how in-context PVI can be employed to identify challenging instances. Our work highlights the potential of in-context PVI and provides new insights into the capabilities of ICL.
摘要
新学习理念“内容学习”(ICL)随着大语言模型的发展而受到关注。在这项工作中,我们对一种最近提出的困难度度量,点对可用信息(PVI)进行了适应。与原始PVI相比,内容PVI更加高效,只需几个示例并无需微调。我们进行了广泛的实验分析,以评估内容PVI的可靠性。我们的发现表明,内容PVI估计具有与原始PVI相似的特征。具体来说,在内容设置下,内容PVI估计具有不同示例选择和射击数量的稳定性。 var(内容PVI估计)在不同示例选择下的差异不显著,这表明内容PVI是稳定的。此外,我们还证明了内容PVI可以用于标识困难实例。我们的工作探讨了内容PVI的潜力和ICL的可能性,并提供了新的视角。
Direct Neural Machine Translation with Task-level Mixture of Experts models
methods: 论文提出了多种方法来解决直接NMT系统的限制,包括多语言NMT和中间语言NMT(通过英语翻译)。它们还提出了Task-level Mixture of expert models(任务级混合专家模型),一种基于Transformer模型的推理效率优化方法。
results: 论文表明,Task-level MoE-based direct NMT系统在大量低资源和高资源irect对的翻译任务上表现出色,并且在7种语言对上超过了双语和中间语言NMT模型。Abstract
Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.
摘要
直接神经机器翻译(直接NMT)是一种NMT系统,用于翻译非英语语言对。直接NMT系统经常面临限制,即非英语语言对的并不充足。多种方法已经提出来解决这个问题,如多语言NMT和中转NMT(通过英语翻译)。任务级别混合模型(Task-level MoE),一种基于转换器模型的推理效率版本,在许多语言对上表现出了优秀的NMT性能。在Task-level MoE中,不同语言组可以使用不同的路由策略来优化对语言之间的学习和推理速度。在本研究中,我们研究Task-level MoE在直接NMT中的适用性,并提出了一系列高性能的训练和评估配置。通过这些配置,Task-level MoE基于直接NMT系统在大量低资源和高资源直接对的翻译方向上表现出了比比较好的成绩。我们的Task-level MoE系统与16个专家相比,超过了双语NMT和中转NMT模型在7个语言对上的性能。然而,中转NMT模型仍然在9个语言对和方向上表现出了较好的成绩。
Understanding Retrieval Augmentation for Long-Form Question Answering
results: 研究发现,使用不同检索文档集可以影响LM的回答生成质量。此外,研究还发现了长文本生成中的归因模式,以及LM的归因错误的主要原因。Abstract
We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.
摘要
我们提出了一项研究,探讨 Retrieval-augmented 语言模型(LM)在长问答中的表现。我们分析了不同LM在使用同一份证据文档时的响应,以及不同证据文档集的质量如何影响LM生成的答案。我们研究了各种答案特征(如流畅度、长度、变化程度),强调在上下文文档中归因生成的长文答案。我们收集了人类标注答案归因的数据,并评估了自动判断归因的方法。我们的研究提供了新的认知,揭示了 Retrieval-augmented 语言模型在长知识含量文本生成中的影响,以及长文生成中的归因模式和错误的主要原因。这些分析结果为未来工作提供了方向。
Simple Mechanisms for Representing, Indexing and Manipulating Concepts
results: 该方法可以在不同的概念之间找到共同主题,并可以用来建立一个概念字典,以便将输入数据正确地归类到相关的概念中。Abstract
Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data. We will argue instead that learning a concept could be done by looking at its moment statistics matrix to generate a concrete representation or signature of that concept. These signatures can be used to discover structure across the set of concepts and could recursively produce higher-level concepts by learning this structure from those signatures. When the concepts are `intersected', signatures of the concepts can be used to find a common theme across a number of related `intersected' concepts. This process could be used to keep a dictionary of concepts so that inputs could correctly identify and be routed to the set of concepts involved in the (latent) generation of the input.
摘要
Note: Simplified Chinese is used in this translation, which is a more casual and conversational style of Chinese. Traditional Chinese would be more formal and written.
Pseudointelligence: A Unifying Framework for Language Model Evaluation
results: 可以用来评估语言模型的两个案例研究以及现有评估方法的分析Abstract
With large language models surpassing human performance on an increasing number of benchmarks, we must take a principled approach for targeted evaluation of model capabilities. Inspired by pseudorandomness, we propose pseudointelligence, which captures the maxim that "(perceived) intelligence lies in the eye of the beholder". That is, that claims of intelligence are meaningful only when their evaluator is taken into account. Concretely, we propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator. We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
摘要
With large language models surpassing human performance on an increasing number of benchmarks, we must take a principled approach for targeted evaluation of model capabilities. Inspired by pseudorandomness, we propose pseudointelligence, which captures the maxim that "(perceived) intelligence lies in the eye of the beholder". That is, that claims of intelligence are meaningful only when their evaluator is taken into account. Concretely, we propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator. We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation
results: 研究发现,IFT 模型默认将 male-inflected 翻译作为结果,甚至会忽略女性职业 gender 标签。此外,研究还发现模型在错误翻译中忽略 masculine 和 feminine pronoun 的问题。基于这些发现,研究提出了一种简单、有效的偏见 Mitigation 解决方案,通过 few-shot learning 实现了更加公平的翻译结果。Abstract
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. Finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
摘要
现代指导模型可以解决多个自然语言处理任务,机器翻译(MT)是其中的一个重要用例。然而,当前的研究经常关注标准性能指标,而忽略了吸引人的公平和道德考虑。在MT中,这可能导致误射翻译,其中的一些危害包括延续偏见和预设。在这项工作中,我们填补这个遗漏,我们研究了IFT模型在机器翻译中是否存在性别偏见,以及如何缓解它。具体来说,我们在英语到德语和西班牙语的WinoMT corpus上计算了确定性别偏见的指标。我们发现,IFT模型默认使用♂inflected翻译,即使 female occupational stereotypes。然后,使用可见性方法,我们发现模型在误射翻译中系统地忽略指示翻译对象的性别的代名词。最后,基于这一发现,我们提出了一种易于实施的和有效的偏见缓解解决方案,该解决方案基于几 shot learning,可以导致非常公平的翻译。
Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
results: 实现10%的提高精度在CFQ和COGS数据集上,无需hyperparameter tuningAbstract
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harnessing the power of dataset cartography (Swayamdipta et al., 2020). By strategically identifying a subset of compositional generalization data using this approach, we achieve a remarkable improvement in model accuracy, yielding enhancements of up to 10% on CFQ and COGS datasets. Notably, our technique incorporates dataset cartography as a curriculum learning criterion, eliminating the need for hyperparameter tuning while consistently achieving superior performance. Our findings highlight the untapped potential of dataset cartography in unleashing the full capabilities of compositional generalization within Transformer models. Our code is available at https://github.com/cyberiada/cartography-for-compositionality.
摘要
神经网络已经革命化语言模型化,并在各种下游任务中表现出色。然而,这些模型是否达到人类认知能力的 Compositional generalization 水平仍然是一个议题。现有的方法主要集中在新的建筑和学习方法上,而我们则提出了一种拓展 dataset cartography(Swayamdipta et al., 2020)的新方法。通过策略地选择 Compositional generalization 数据 subsets,我们实现了模型精度的显著提高,CFQ 和 COGS 数据集上的提高达到 10%。值得注意的是,我们的技术将 dataset cartography 作为课程学习标准,从而消除了 hyperparameter 调整的需求,并一直保持优秀的性能。我们的发现表明,使用 dataset cartography 可以解 liberate 传播模型中的 Compositional generalization 潜力。我们的代码可以在 GitHub 上找到:https://github.com/cyberiada/cartography-for-compositionality。
On the Benefit of Generative Foundation Models for Human Activity Recognition
paper_authors: Zikang Leng, Hyeokhyen Kwon, Thomas Plötz
for: solves the problem of limited annotated data in human activity recognition (HAR) by using generative AI to autonomously generate virtual IMU data from text descriptions.
methods: uses Large Language Models (LLMs) and motion synthesis models to generate virtual IMU data.
results: identifies several promising research pathways that could benefit from generative AI in HAR, including generating benchmark datasets, developing foundational models specific to HAR, exploring hierarchical structures within HAR, breaking down complex activities, and applications in health sensing and activity summarization.Here is the text in Simplified Chinese:
for: 解决人体活动识别(HAR)中数据稀缺问题,使用生成AI自动生成文本描述IMU数据。
methods: 使用大型语言模型(LLMs)和运动合成模型生成IMU数据。
results: 找到了生成AI在HAR中的许多有优势的研究方向,包括生成数据集、开发特有于HAR的基础模型、阶段分解复杂活动、应用于健康感知和活动概要。Abstract
In human activity recognition (HAR), the limited availability of annotated data presents a significant challenge. Drawing inspiration from the latest advancements in generative AI, including Large Language Models (LLMs) and motion synthesis models, we believe that generative AI can address this data scarcity by autonomously generating virtual IMU data from text descriptions. Beyond this, we spotlight several promising research pathways that could benefit from generative AI for the community, including the generating benchmark datasets, the development of foundational models specific to HAR, the exploration of hierarchical structures within HAR, breaking down complex activities, and applications in health sensing and activity summarization.
摘要
人类活动识别(HAR)中,数据缺乏问题是一大挑战。我们 Drawing inspiration from the latest advancements in generative AI,包括大型自然语言模型(LLM)和运动合成模型,我们认为生成AI可以解决这种数据缺乏问题,通过自动生成虚拟IMU数据从文本描述中。此外,我们还指出了许多有前途的研究方向,包括生成标准数据集,开发特有的HAR基础模型,探索HAR层次结构,分解复杂活动,以及医疗感知和活动概要应用。
Towards Safer Operations: An Expert-involved Dataset of High-Pressure Gas Incidents for Preventing Future Failures
results: 初步的结果表明,NLP技术可以有效地分析事故报告,以预防未来的失败。 dataset 可以促进未来的研究在 NLP 和事故管理领域。 dataset 的访问也提供(IncidentAI dataset 可以在:https://github.com/Cinnamon/incident-ai-dataset 中找到)。Abstract
This paper introduces a new IncidentAI dataset for safety prevention. Different from prior corpora that usually contain a single task, our dataset comprises three tasks: named entity recognition, cause-effect extraction, and information retrieval. The dataset is annotated by domain experts who have at least six years of practical experience as high-pressure gas conservation managers. We validate the contribution of the dataset in the scenario of safety prevention. Preliminary results on the three tasks show that NLP techniques are beneficial for analyzing incident reports to prevent future failures. The dataset facilitates future research in NLP and incident management communities. The access to the dataset is also provided (the IncidentAI dataset is available at: https://github.com/Cinnamon/incident-ai-dataset).
摘要
这篇论文介绍了一个新的 IncidentAI 数据集,用于安全预防。与过去的数据集不同,我们的数据集包含三个任务:命名实体识别、 causal EXTRACTION 和信息检索。数据集由具有至少六年实践经验的高压气保存管理员进行标注。我们验证了数据集在安全预防方面的贡献。初步结果显示,NLP 技术可以有效地分析事故报告,以预防未来的失败。该数据集将促进未来 NLP 和事故管理社区的研究。数据集的访问权也提供(IncidentAI 数据集可以在:https://github.com/Cinnamon/incident-ai-dataset 中获取)。
SPEED: Speculative Pipelined Execution for Efficient Decoding
results: 在Transformerdecoder中实现Parameter sharing,通过减少内存操作的负担,提高生成LLM推理的效率,并在模型准确率 versus 延迟时间之间取得平衡。Abstract
Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
摘要
大量的自然语言处理任务中的生成大语言模型(LLM)基于Transformer架构最近占据了主导地位。然而,它们在实时场景中的应用受到了较大的推理延迟的限制。这主要是因为生成LLM的推理是sequential的,每个token都виси于所有前一个输出token。因此,难以实现任务级别的并行计算,使推理变得具有很高的内存约束。在这种情况下,我们提出了SPEED方法,它通过预测基于早期隐藏状态的值来спекулятив执行多个未来的token在并行的方式。对于使用参数共享的Transformer解码器,我们可以归并内存操作,这allow us以加速生成LLM推理。我们通过对响应率和模型精度之间的负载减少来证明我们的方法的效率。此外,我们还示出了通过 especulation进行训练更深的解码器,只需要最小的运行时开销。
Code Book for the Annotation of Diverse Cross-Document Coreference of Entities in News Articles
results: 这篇论文的主要贡献是提供了一种多层次注释方法,可以应用于媒体偏见分析中的词汇选择和标签。Abstract
This paper presents a scheme for annotating coreference across news articles, extending beyond traditional identity relations by also considering near-identity and bridging relations. It includes a precise description of how to set up Inception, a respective annotation tool, how to annotate entities in news articles, connect them with diverse coreferential relations, and link them across documents to Wikidata's global knowledge graph. This multi-layered annotation approach is discussed in the context of the problem of media bias. Our main contribution lies in providing a methodology for creating a diverse cross-document coreference corpus which can be applied to the analysis of media bias by word-choice and labelling.
摘要
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education
results: 研究发现,这些LLM模型在越南语MCQA任务中具有扎实的MCSB能力,特别是在零shot和一shot设置下。Abstract
In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
摘要
在这篇论文中,我们评估了大语言模型(LLM)在零批、一批和几批设置下的多选符号绑定(MCSB)能力,用于多选问答(MCQA)任务。我们关注越南语言,因为越南语言MCQA数据集比英语更少。我们研究的两个现有数据集是 ViMMRC 1.0 和 ViMMRC 2.0,它们都是文学类。近期的越南语言自然语言处理(NLP)研究主要集中在评估 ChatGPT,但是这些研究主要集中在 ChatGPT 如何解决越南语言高中毕业考试(VNHSGE)。我们希望创建一个新的高质量数据集,提供了 LaTeX 格式的结构化指南,以便用于评估 LLM 和更小的语言模型(LM)的 MCSB 能力。我们的评估结果显示,六种著名的 LLM 在 ViMMRC 1.0 和 ViMMRC 2.0 标准和我们提议的数据集上表现出了预期的 MCSB 能力。数据集仅用于研究目的。
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scaling of Texts with Large Language Models
results: 这篇论文使用CGCoT方法和大语言模型(LLM)对 Twitter 上的情感语言进行了扩展,并证明了该方法可以生成与人类评价相符的投票结果,而且不需要大量标注数据。Abstract
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data. We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models (LLMs). Specifically, we propose concept-guided chain-of-thought (CGCoT), which uses prompts designed to summarize ideas and identify target parties in texts to generate concept-specific breakdowns, in many ways similar to guidance for human coder content analysis. CGCoT effectively shifts pairwise text comparisons from a reasoning problem to a pattern recognition problem. We then pairwise compare concept-specific breakdowns using an LLM. We use the results of these pairwise comparisons to estimate a scale using the Bradley-Terry model. We use this approach to scale affective speech on Twitter. Our measures correlate more strongly with human judgments than alternative approaches like Wordfish. Besides a small set of pilot data to develop the CGCoT prompts, our measures require no additional labeled data and produce binary predictions comparable to a RoBERTa-Large model fine-tuned on thousands of human-labeled tweets. We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
摘要
现有的文本缩放方法通常需要大量数据集,困难处理短文本,或需要标注数据。我们开发了一种基于生成大语言模型(LLM)的文本缩放方法,具体来说是思想导向链条(CGCoT)。CGCoT使用用于概述想法和标识文本中targetparty的提示来生成思想特定的拆分,与人工编码分析类似。CGCoT将对比文本的对比问题转化为pattern recognition问题。然后,我们对每个拆分进行对比,使用LLM来对比拆分。我们使用这些对比结果来估算一个排名使用布莱德利-特里模型。我们使用这种方法来尺度Twitter上的情感语言。我们的度量与人类判断更高相关性,与替代方法如Wordfish相比。除了开发CGCoT提示的小数据集外,我们的度量不需要额外的标注数据,并且生成了与RoBERTa-Large模型 fine-tuned on thousands of human-labeled tweets相同的二进制预测。我们示例了如何将专业知识与LLM结合以创建状态的抽象概念度量。
CORE: A Few-Shot Company Relation Classification Dataset for Robust Domain Adaptation
results: 实验结果表明,当前的 RC 模型在 CORE dataset 上 exhibits substantial performance gaps, 并且模型在不同的领域上适应性不高。 however, 模型在 CORE 上训练显示出了改善的 out-of-domain 性能, 这表明高质量数据的重要性 для robust domain adaptation。Abstract
We introduce CORE, a dataset for few-shot relation classification (RC) focused on company relations and business entities. CORE includes 4,708 instances of 12 relation types with corresponding textual evidence extracted from company Wikipedia pages. Company names and business entities pose a challenge for few-shot RC models due to the rich and diverse information associated with them. For example, a company name may represent the legal entity, products, people, or business divisions depending on the context. Therefore, deriving the relation type between entities is highly dependent on textual context. To evaluate the performance of state-of-the-art RC models on the CORE dataset, we conduct experiments in the few-shot domain adaptation setting. Our results reveal substantial performance gaps, confirming that models trained on different domains struggle to adapt to CORE. Interestingly, we find that models trained on CORE showcase improved out-of-domain performance, which highlights the importance of high-quality data for robust domain adaptation. Specifically, the information richness embedded in business entities allows models to focus on contextual nuances, reducing their reliance on superficial clues such as relation-specific verbs. In addition to the dataset, we provide relevant code snippets to facilitate reproducibility and encourage further research in the field.
摘要
我们介绍了CORE数据集,专门用于几个shot关系分类(RC),关注公司关系和企业实体。CORE包含4,708个实例,12种关系类型的文本证据,从公司Wikipedia页面中提取。公司名称和商业实体可能会带来很多挑战,因为它们可能会表示法律实体、产品、人员或业务部门,具体取决于上下文。因此,从文本上提取关系类型 между实体是很有所依赖的。为了评估现有RC模型在CORE数据集上的性能,我们在几个shot领域适应设置下进行了实验。我们的结果表明,模型在不同领域的学习后,很难适应CORE。但是,模型在CORE上进行训练后,在其他领域的表现有所提高,这反映了高质量数据的重要性,以及企业实体中嵌入的信息 ricness,使模型更加注重上下文特征,减少对关系特有词的依赖。此外,我们还提供了相关的代码截图,以便复现和进一步研究。
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation
results: 实验结果显示,现有的两种方法在一些任务上都显示出问题,这表明长期 TABLETOP 推理任务仍然是现代具有问题。Abstract
The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
摘要
<> transtabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.Note that "LoHoRavens" is a simulation benchmark, and "LLMs" stands for "large language models".
Gold: A Global and Local-aware Denoising Framework for Commonsense Knowledge Graph Noise Detection
methods: incorporates entity semantic information, global rules, and local structural information from the CSKG
results: outperforms all baseline methods in noise detection tasks on synthetic noisy CSKG benchmarks, and benefits the downstream zero-shot commonsense question-answering task on a real-world CSKGAbstract
Commonsense Knowledge Graphs (CSKGs) are crucial for commonsense reasoning, yet constructing them through human annotations can be costly. As a result, various automatic methods have been proposed to construct CSKG with larger semantic coverage. However, these unsupervised approaches introduce spurious noise that can lower the quality of the resulting CSKG, which cannot be tackled easily by existing denoising algorithms due to the unique characteristics of nodes and structures in CSKGs. To address this issue, we propose Gold (Global and Local-aware Denoising), a denoising framework for CSKGs that incorporates entity semantic information, global rules, and local structural information from the CSKG. Experiment results demonstrate that Gold outperforms all baseline methods in noise detection tasks on synthetic noisy CSKG benchmarks. Furthermore, we show that denoising a real-world CSKG is effective and even benefits the downstream zero-shot commonsense question-answering task.
摘要
共享常识图(CSKG)是对常识理解的关键,但是通过人工标注可能会成本高。因此,多种自动方法已经被提议用于构建CSKG,以提高 semantic 覆盖率。然而,这些无监督方法会引入干扰噪声,这些噪声难以通过现有的噪声除除算法处理,因为CSKG 中节点和结构的特殊特征。为解决这个问题,我们提出了 Gold(全球和本地化噪声除法),一种特有CSKG噪声除法,该法利用实体 semantic 信息,全球规则和CSKG 本地结构信息。实验结果表明,Gold 在噪声检测任务中击败了所有基线方法。此外,我们还证明了对真实世界CSKG进行噪声除法有效,甚至对下游零shot常识问答任务有益。
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers
for: investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and multiplication.
methods: through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. we show that transformer models are able to generalize to long lengths with the help of targeted attention biasing.
results: we demonstrate that using ABC, the transformer model can achieve unprecedented perfect length generalization on certain arithmetic tasks.Abstract
Since its introduction, the transformer model has demonstrated outstanding performance across various tasks. However, there are still unresolved issues regarding length generalization, particularly in algorithmic tasks. In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and multiplication. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we link to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented perfect length generalization on certain arithmetic tasks.
摘要
自其引入以来,变换模型在不同任务中表现出色。然而,LENGTH总是一个尚未解决的问题,特别是在算法任务中。在这篇论文中,我们调查变换模型是否具备学习算术算法的能力,如加法和乘法。通过实验和注意力分析,我们确定了一些重要的因素,以实现最佳的长度总结。我们发现,变换模型可以通过targeted注意力偏好来总结到长 lengths。然后,我们引入了注意力偏好准备(ABC),一种准备阶段,它使得模型自动学习合适的注意力偏好,我们将其联系到相对位编码机制。我们示出,使用ABC,变换模型可以实现历史性的长度总结在某些算数任务中。
Filling in the Gaps: Efficient Event Coreference Resolution using Graph Autoencoder Networks
results: 本研究在大规模的荷兰事件核心参照 korpus 上显著超过 классиical mention-pair 方法,以至于总分、效率和训练速度。此外,我们的模型能够更好地识别更难的核心参照链,并在低数据设置下显示出较高的Robustness。Abstract
We introduce a novel and efficient method for Event Coreference Resolution (ECR) applied to a lower-resourced language domain. By framing ECR as a graph reconstruction task, we are able to combine deep semantic embeddings with structural coreference chain knowledge to create a parameter-efficient family of Graph Autoencoder models (GAE). Our method significantly outperforms classical mention-pair methods on a large Dutch event coreference corpus in terms of overall score, efficiency and training speed. Additionally, we show that our models are consistently able to classify more difficult coreference links and are far more robust in low-data settings when compared to transformer-based mention-pair coreference algorithms.
摘要
我们提出了一种新的和高效的事件核心关系解决方法(ECR),应用于低资源语言领域。我们将ECR视为图像重建任务,可以结合深度 semantics embedding 和结构核心关系链知识,创建一种参数高效的图像自编码器模型(GAE)。我们的方法在荷兰事件核心 correlate 词汇库中表现出色,在总分、效率和训练速度方面均超过了经典的提及对方法。此外,我们还证明了我们的模型在低数据情况下能够更好地分类更难的核心关系链,并且在基于转换器的提及对方法中更加稳定。
AMR Parsing with Causal Hierarchical Attention and Pointers
results: 实验表明,我们的模型在无额外数据的情况下,在四个benchmark中比基线模型表现出色,提高了性能。Abstract
Translation-based AMR parsers have recently gained popularity due to their simplicity and effectiveness. They predict linearized graphs as free texts, avoiding explicit structure modeling. However, this simplicity neglects structural locality in AMR graphs and introduces unnecessary tokens to represent coreferences. In this paper, we introduce new target forms of AMR parsing and a novel model, CHAP, which is equipped with causal hierarchical attention and the pointer mechanism, enabling the integration of structures into the Transformer decoder. We empirically explore various alternative modeling options. Experiments show that our model outperforms baseline models on four out of five benchmarks in the setting of no additional data.
摘要
听说过的AMR解析器在最近几年内受欢迎,因为它的简单和效果性。它预测了线性图,作为自由文本,避免了Explicit结构化。然而,这种简单性忽略了AMR图中的结构本地性,并且添加了不必要的标记来表示核心引用。在这篇论文中,我们介绍了新的AMR解析目标形式和一种新的模型,即 CHAP,它具有征识层次注意力和指针机制,使得Transformer解码器中的结构可以被集成。我们在不同的模型化选项上进行了实验,实验结果显示,我们的模型在无额外数据的情况下超过基eline模型在四个benchmark中。
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences
results: 比其他高效注意力变体在中等规模数据集上表现更好,具有更大的内存大小和准确率。Abstract
Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.
摘要
tranSformer-based models have achieved state-of-the-art performance in many areas, but the quadratic complexity of self-attention with respect to the input length limits their applicability to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}(\log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Emptying the Ocean with a Spoon: Should We Edit Models?
results: 研究发现,直接修改LLM模型不能被视为一种系统性的解决方案,而且可能会增加风险。在某些情况下,直接修改LLM模型可能会增加风险,而不是减少风险。Abstract
We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in LLMs; (2) concept erasure methods, which aim at preventing systemic bias in generated text; and (3) attribution methods, which aim at grounding generations into identified textual sources. We argue that direct model editing cannot be trusted as a systematic remedy for the disadvantages inherent to LLMs, and while it has proven potential in improving model explainability, it opens risks by reinforcing the notion that models can be trusted for factuality. We call for cautious promotion and application of model editing as part of the LLM deployment process, and for responsibly limiting the use cases of LLMs to those not relying on editing as a critical component.
摘要
我团队提出对直接模型编辑的方法进行批判,作为LLM生成中的错误纠正方法。我们将模型编辑与三种相似 yet distinct的方法进行对比:(1)检索型架构,它将知识存储和LLM中的语言能力分离开来;(2)概念消除方法,它们目的是避免生成文本中的系统偏见;以及(3)归因方法,它们强调将生成文本链接到特定的文本来源。我们认为直接模型编辑无法被视为LLM中的系统疾病纠正方法,尽管它在模型解释方面具有潜力。我们呼吁对模型编辑的推广和应用进行谨慎,并限制LLM的使用场景,以避免依赖于编辑的情况。
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
results: 提供了一个自动化的音乐处理系统,让用户可以快速地找到适合他们需求的音乐工具,减少了用户对音乐处理技术的学习压力,让用户更能专注于音乐创作。Abstract
AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience.
摘要
人工智能 empowered 音乐处理是一个多样化的领域,包括多种任务,例如生成任务(如 timbre 合成)和理解任务(如音乐分类)。为开发者和爱好者而言,抓住这些任务的要求非常困难,尤其是在音乐数据表示和模型在不同平台之间的差异非常大。因此,需要建立一个系统来组织和集成这些任务,以帮助实践者自动分析他们的需求,并选择适合的工具来满足他们的要求。受大语言模型(LLM)的成功启发,我们开发了一个名为 MusicAgent 的系统,它集成了多种音乐相关的工具和一个自动化的工作流程,以解决用户的需求。更具体来说,我们建立了以下两个部分:1. 工具集,收集了来自多种源,包括 Hugging Face、GitHub 和 Web API 等等的工具。2. 由 LLM(如 ChatGPT) empowered 的自动化工作流程,用于组织这些工具,并自动将用户的请求分解成多个子任务,并对应的邀请合适的音乐工具。MusicAgent 系统的Primary Goal 是免除用户对 AI-音乐工具的繁琐,让他们可以专注于创作。通过让用户轻松地组合工具,系统提供了一个无缝和丰富的音乐体验。
Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding
results: 实验结果表明,提供不同的输入模式可以导致模型的不同行为,包括跨模式背景、跨语言背景和未grounded模型的不同行为。这些行为的差异可以在全数据集水平和特定词表示水平上被衡量。Abstract
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.
摘要
文本背景是人工智能系统的重要组成部分,有一些研究者认为它可以帮助系统实现更完整和具有真正含义的语言理解能力。文献被分为两个派别:一些人认为,背景可以带来不同的普遍化,而另一些人则认为,它可以通过大量单Modal数据补做。然而,有限的实验证据已经出现了,支持或反对任一位置。在这篇论文中,我们提出了一种方法ológical框架,用于研究不同输入模式对NLP系统的效果。我们 constructed comparable samples of populations of models trained on different input modalities,以便分离不同输入源的qualitative效果和可衡量的模型性能。实验结果显示,在不同的语言和modalities中训练的模型 exhibit 不同的行为,我们在全数据集级别以及特定词表示性的方面进行了测量。
Investigating semantic subspaces of Transformer sentence embeddings through linear structural probing
results: 发现不同模型家族和模型大小具有不同的表现和层次动态,但表现相对较具同样性Abstract
The question of what kinds of linguistic information are encoded in different layers of Transformer-based language models is of considerable interest for the NLP community. Existing work, however, has overwhelmingly focused on word-level representations and encoder-only language models with the masked-token training objective. In this paper, we present experiments with semantic structural probing, a method for studying sentence-level representations via finding a subspace of the embedding space that provides suitable task-specific pairwise distances between data-points. We apply our method to language models from different families (encoder-only, decoder-only, encoder-decoder) and of different sizes in the context of two tasks, semantic textual similarity and natural-language inference. We find that model families differ substantially in their performance and layer dynamics, but that the results are largely model-size invariant.
摘要
研究各种语言模型层次的语言信息编码是NLPT社区中的一个非常有趣的问题。现有的工作 however,主要集中在单词水平表示和基于encoder-only语言模型的伪token训练目标上。在这篇论文中,我们进行了叙述结构探索,一种研究句子级别表示的方法,通过找到 embedding空间中任务特定的数据点对之间的适当对比距离来实现。我们将这种方法应用于不同家族(encoder-only、decoder-only、encoder-decoder)和不同大小的语言模型中,并在两个任务( semantics 文本相似性和自然语言推理)的上下文中进行了测试。我们发现,模型家族之间存在巨大差异,但结果几乎是模型大小不变的。
Rather a Nurse than a Physician – Contrastive Explanations under Investigation
paper_authors: Oliver Eberle, Ilias Chalkidis, Laura Cabello, Stephanie Brandl
For: The paper aims to investigate the claim that contrastive explanations are closer to human explanations than non-contrastive explanations.* Methods: The paper uses four English text-classification datasets and fine-tunes three different models (RoBERTa, GTP-2, and T5) in three different sizes. It also applies three post-hoc explainability methods (LRP, GradientxInput, and GradNorm) to extract explanations.* Results: The paper finds that there is a high agreement between model-based rationales and human annotations, both in contrastive and non-contrastive settings. Additionally, model-based explanations computed in both settings align equally well with human rationales, indicating that humans do not necessarily explain in a contrastive manner.Abstract
Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain a decision than non-contrastive explanations, where the decision is not necessarily referenced to an alternative. This claim has never been empirically validated. We analyze four English text-classification datasets (SST2, DynaSent, BIOS and DBpedia-Animals). We fine-tune and extract explanations from three different models (RoBERTa, GTP-2, and T5), each in three different sizes and apply three post-hoc explainability methods (LRP, GradientxInput, GradNorm). We furthermore collect and release human rationale annotations for a subset of 100 samples from the BIOS dataset for contrastive and non-contrastive settings. A cross-comparison between model-based rationales and human annotations, both in contrastive and non-contrastive settings, yields a high agreement between the two settings for models as well as for humans. Moreover, model-based explanations computed in both settings align equally well with human rationales. Thus, we empirically find that humans do not necessarily explain in a contrastive manner.9 pages, long paper at ACL 2022 proceedings.
摘要
“对比性解释”,即将一个决策解释为另一个决策的对比,被认为更接近人类的解释方式。然而,这一laim未经验证。我们分析了四个英文文本分类 dataset(SST2、DynaSent、BIOS和DBpedia-Animals),使用三种不同的模型(RoBERTa、GTP-2和T5),每种模型都有三个不同的大小,并应用三种后处 explainability 方法(LRP、GradientxInput和GradNorm)。此外,我们还收集并发布了BIOS dataset中的一百个样本的人类理由标注,用于对比和非对比设置。我们在这两种设置下进行了模型基于的理由和人类理由的交叉比较,发现两者之间存在高度一致性,并且模型基于的解释在两种设置下均与人类理由相吻合。因此,我们employm empirical研究发现,人类并不一定会在对比性下进行解释。Please note that the translation is in Simplified Chinese, and some words or phrases may have been translated differently in Traditional Chinese.
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification
methods: 我们采集了一个新的数据集RAVE:Rationale Variation in ECHR1,该数据集由两名国际人权法律领域专家标注而成,我们观察到了这两个专家之间的弱一致。我们研究了他们的不一致,并构建了两级独立任务的分类法,补充了COC特有的亚分类。这是法律自然语言处理领域中第一次关于人类标注变化的研究。
results: 我们量测不同分类类别的数据,发现主要的不一致来自于法律上下文的不充分规定,这种情况通常具有有限的精度和噪音。我们进一步评估了当今最佳COC模型在RAVE上的解释性,发现模型和专家之间的一致度有限。总之,我们的案例研究暴露了在法律自然语言处理领域创建标准数据集的复杂性,这些复杂性包括确定案例中 факт的重要性。Abstract
In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
摘要
法律自然语言处理(NLP)中的案例结果分类(COC)不仅需要准确,还需要可信和可解释。现有的可解释COC工作都是由单一专家进行标注。然而,法律专业人员在评估案例事实时可能会有差异。因此,我们收集了一个新的数据集RAVE:可理解变化在人权法院1中,该数据集来自两个国际人权法律领域专家,我们观察到了弱一致。我们研究了他们的不一致,并建立了两级无关任务的税onomy,补充了COC特有的亚类。我们知道,这是法律NLP中第一个关注人类标注变化的工作。我们量测不同税onomy类别,并发现,不一致主要来自法律Context的不足,这种情况在COC元数据中通常具有有限的精度和噪音。我们进一步评估了现有最佳COC模型在RAVE上的解释性,并发现模型和专家之间的一致不高。总的来说,我们的案例研究发现了法律NLP中创建benchmark数据集的复杂性,即确定案例事实中可能对结果的影响因素。
The Curious Case of Hallucinatory Unanswerablity: Finding Truths in the Hidden States of Over-Confident Large Language Models
for: investigate the behavior of LLMs when presented with unanswerable queries
methods: use a combination of human evaluation and automated metrics to study the representation of answerability in LLMs’ latent spaces
results: find strong indications that LLMs encode the answerability of input queries, with the representation of the first decoded token often being a strong indicator, which can be used to develop improved decoding techniques for factual generation.Here’s the full translation in Simplified Chinese:
results: 发现 LLMs 对输入问题的表示中具有强度的问题可answerability 表示,首个解码token 的表示frequently 是强度表示。这些发现可以用来发展更好的实际生成技术,特别在问题可answerability 是一个应对的情况下。Abstract
Large language models (LLMs) have been shown to possess impressive capabilities, while also raising crucial concerns about the faithfulness of their responses. A primary issue arising in this context is the management of unanswerable queries by LLMs, which often results in hallucinatory behavior, due to overconfidence. In this paper, we explore the behavior of LLMs when presented with unanswerable queries. We ask: do models \textbf{represent} the fact that the question is unanswerable when generating a hallucinatory answer? Our results show strong indications that such models encode the answerability of an input query, with the representation of the first decoded token often being a strong indicator. These findings shed new light on the spatial organization within the latent representations of LLMs, unveiling previously unexplored facets of these models. Moreover, they pave the way for the development of improved decoding techniques with better adherence to factual generation, particularly in scenarios where query unanswerability is a concern.
摘要
Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the original text provided, and it is not a word-for-word translation. Some phrases and sentences may be rephrased or condensed to improve readability and clarity.
Text Annotation Handbook: A Practical Guide for Machine Learning Projects
paper_authors: Felix Stollenwerk, Joey Öhman, Danila Petrelli, Emma Wallerö, Fredrik Olsson, Camilla Bengtsson, Andreas Horndahl, Gabriela Zarzar Gandler
for: 这份手册是一本关于文本标注任务的实用指南,用于介绍基本概念和实践技巧。
methods: 本文涉及了主要的技术方面,同时也触及了商业、伦理和法规问题。
results: 文件的重点是在于可读性和简洁性,而不是完整性和科学准确性。该手册可能会用于各种职业,如团队领导、项目经理、IT архитек、软件开发者和机器学习工程师。Abstract
This handbook is a hands-on guide on how to approach text annotation tasks. It provides a gentle introduction to the topic, an overview of theoretical concepts as well as practical advice. The topics covered are mostly technical, but business, ethical and regulatory issues are also touched upon. The focus lies on readability and conciseness rather than completeness and scientific rigor. Experience with annotation and knowledge of machine learning are useful but not required. The document may serve as a primer or reference book for a wide range of professions such as team leaders, project managers, IT architects, software developers and machine learning engineers.
摘要
这本手册是一本实用的文本标注任务指南。它提供了一个温顺的引导,覆盖了理论概念以及实践建议。覆盖的主题主要是技术性的,但也涉及到业务、伦理和法规问题。文本的重点是易读性和简洁性,而不是完整性和科学严谨性。经验与标注和机器学习知识可能有助于,但并非必需。这份文档可能作为团队领导、项目经理、IT建筑师、软件开发者和机器学习工程师的引导或参考书。
Language Agents for Detecting Implicit Stereotypes in Text-to-image Models at Scale
results: 研究发现,一些商业产品和开源文本到图像模型中的模型经常在某些提示中表现出严重的刻板印象,这些刻板印象与人类的性别、种族和宗教等社会维度有关。Abstract
The recent surge in the research of diffusion models has accelerated the adoption of text-to-image models in various Artificial Intelligence Generated Content (AIGC) commercial products. While these exceptional AIGC products are gaining increasing recognition and sparking enthusiasm among consumers, the questions regarding whether, when, and how these models might unintentionally reinforce existing societal stereotypes remain largely unaddressed. Motivated by recent advancements in language agents, here we introduce a novel agent architecture tailored for stereotype detection in text-to-image models. This versatile agent architecture is capable of accommodating free-form detection tasks and can autonomously invoke various tools to facilitate the entire process, from generating corresponding instructions and images, to detecting stereotypes. We build the stereotype-relevant benchmark based on multiple open-text datasets, and apply this architecture to commercial products and popular open source text-to-image models. We find that these models often display serious stereotypes when it comes to certain prompts about personal characteristics, social cultural context and crime-related aspects. In summary, these empirical findings underscore the pervasive existence of stereotypes across social dimensions, including gender, race, and religion, which not only validate the effectiveness of our proposed approach, but also emphasize the critical necessity of addressing potential ethical risks in the burgeoning realm of AIGC. As AIGC continues its rapid expansion trajectory, with new models and plugins emerging daily in staggering numbers, the challenge lies in the timely detection and mitigation of potential biases within these models.
摘要
现在的研究强化模型在人工智能生成内容(AIGC)的商业产品中得到了加速。而这些优秀的AIGC产品正在获得消费者的认可,并让人们很感到激动。然而,关于这些模型是否无意中强化社会刻板印象的问题仍然未得到解决。我们在语言代理的最新进展基础上,提出了一种适用于刻板印象检测的新型代理体系。这种多功能的代理体系可以自动采用多种工具来执行整个过程,从生成相应的指令和图像到检测刻板印象。我们基于多个开源文本数据集建立了刻板印象相关的 benchmark,并应用这种体系到商业产品和流行的开源文本到图像模型中。我们发现,这些模型在某些个人特征、社会文化背景和犯罪相关的提示下显示了严重的刻板印象。总之,这些实验结果表明了社会各个维度上的刻板印象的普遍存在,包括 gender、race 和 religion 等,这不仅证明了我们的提出的方法的有效性,而且强调了在AIGC领域的可能性潜在风险的紧迫性。随着AIGC不断扩展,新的模型和插件每天都在各种数字平台上出现,因此,检测和 mitigate 这些模型中的潜在偏见的挑战在继续增长。
Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling
results: 对比旧状态之前方法,Longformer具有我们提出的方法在WIKI-727K上显著提高了$F_1$值(73.74 -> 77.16),并在WikiSection上实现了平均相对减少$P_k$值(15.0 -> 13.89),两个数据集上的average相对减少$P_k$值为4.3%。Abstract
Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve $F_1$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_k$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on $P_k$ on WikiSection. The average relative $P_k$ drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.
摘要
Topic segmentation是文档结构化的关键之一,可以提高后续任务的信息检索性能。由于自动找到话题转换的灵活信息,最近的supervised神经网络模型在长文档话题分 segmentation方面取得了 significanth� development,但是它们之间的coherence关系还未得到充分探讨。因此,这篇论文提出了一种能够更好地捕捉coherence的方法,包括话题感知sentence结构预测(TSSP)和semantic similarity学习(CSSL)。具体来说,我们提出了TSSP任务,强制模型理解文档中的结构信息,通过学习原始文档中的关系来构建不规则的文档。此外,我们利用 между话题和同话题信息来构建对比采样,并设计了CSSL目标,确保同话题中的句子表示更加相似,而不同话题中的句子表示更加不相似。我们对Longformer模型进行了广泛的实验,并证明了我们的方法可以明显超越老的SOTA方法。我们的方法可以提高老SOTA的$F_1$指标的值,从73.74提高到77.16,并将15.0提高到13.89。此外,我们在WikiSection上 achieve了平均 относи于$P_k$指标的减少4.3%,并在两个out-of-domain数据集上实现了平均相对减少8.38%。这些结果表明我们的方法具有较好的Robustness性。
results: 我们对模型的性能进行了报告,并证明了模型的可靠性和高效性。Abstract
We have trained a named entity recognition (NER) model that screens Swedish job ads for different kinds of useful information (e.g. skills required from a job seeker). It was obtained by fine-tuning KB-BERT. The biggest challenge we faced was the creation of a labelled dataset, which required manual annotation. This paper gives an overview of the methods we employed to make the annotation process more efficient and to ensure high quality data. We also report on the performance of the resulting model.
摘要
我们已经训练了一个Named Entity Recognition(NER)模型,用于检测瑞典寻求人员岗位各种有用信息(例如,岗位需求的技能)。它是通过精度调整KB-BERT获得的。我们最大的挑战是创建标注数据集,需要人工标注。本文介绍了我们采用的方法,以提高标注过程的效率和数据质量。我们还对结果模型的性能进行了报告。
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction
results: 研究表明,在提供类案例和多选选项的情况下,大语言模型可以更好地回忆域知识,但是如果 IR 系统的能力较强,则 LLM 的作用变得 redundante。Abstract
Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts can help LLMs recall domain knowledge that is critical for expertise legal reasoning. We additionally present an intriguing paradox wherein an IR system surpasses the performance of LLM+IR due to limited gains acquired by weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes redundant. Our evaluation pipeline can be easily extended into other tasks to facilitate evaluations in other domains. Code is available at https://github.com/srhthu/LM-CompEval-Legal
摘要
大型语言模型(LLMs)在专业应用中展示了很大的潜力,例如法律领域。然而,最近有关GPT-4的法律评估争议引起了对它们在实际法律任务中的表现的质疑。为了系统地探索它们在法律中的能力,我们设计了实用的基线解决方案,并使用LLMs进行法律判断预测任务的评估。在我们的解决方案中,LLMs可以单独回答开问题,或与信息检索(IR)系统配合,从相似的案例或解决简单多选题中学习。我们发现,在提示中包含的相似案例和多选题( Label Candidates)可以帮助LLMs回传专业知识,并且我们还发现了一个有趣的 contradicton,即在IR系统超越了LLM+IR的性能时,LLMs的角色变得redundant。我们的评估管线可以轻松地扩展到其他任务,以便在其他领域进行评估。代码可以在 GitHub 上获取:https://github.com/srhthu/LM-CompEval-Legal。
results: 研究发现,通过精度调整可以提高ChatGPT的情感识别性能,但是不同的情感标签和数据集的选择会影响ChatGPT的情感识别性能,表明了存在内在的不稳定性和可能的偏见。Abstract
This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT.
摘要
这份技术报告探讨了 chatGPT 在文本中识别情感的能力,这可以为互动 chatbot、数据标注和心理健康分析等应用提供基础。 although prior research has shown chatGPT 的基本情感分析能力,其在更复杂的情感识别方面的性能未经探讨。 在这里,我们进行了实验来评估 chatGPT 在不同的数据集和情感标签下的表现。 我们发现了一定的可重复性,通过微调可以得到明显的改进。 然而,不同的情感标签和数据集的表现差异明显,这 highlights the inherent instability and possible bias. 选择数据集和情感标签对 chatGPT 的情感识别表现有着重要的影响。 这篇文章强调了数据集和标签选择的重要性,以及微调的潜在作用,为将来更好地将情感分析 integrate into applications using chatGPT 提供了基础。
Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting
results: 实验结果表明,适应LMs存在两种不同的uncertainty,负责答案决策和格式偏好。此外,研究者还发现了这两种uncertainty对适应LMs的准确性的影响,并提出了一种简单的Synthetic alignment scheme来缓解这种情况。Abstract
Despite the significant progress made in practical applications of aligned language models (LMs), they tend to be overconfident in output answers compared to the corresponding pre-trained LMs. In this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of LMs under the multiple-choice setting. We first conduct a thoughtful empirical study on how aligned LMs differ in calibration from their pre-trained counterparts. Experimental results reveal that there are two distinct uncertainties in LMs under the multiple-choice setting, which are responsible for the answer decision and the format preference of the LMs, respectively. Then, we investigate the role of these two uncertainties on aligned LM's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned LMs' overconfidence is the conflation of these two types of uncertainty. Furthermore, we examine the utility of common post-hoc calibration methods for aligned LMs and propose an easy-to-implement and sample-efficient method to calibrate aligned LMs. We hope our findings could provide insights into the design of more reliable alignment processes for LMs.
摘要
尽管已经在实际应用中采用了对齐语言模型(LMs),但它们往往会比预训练LMs更加自信。在这项工作中,我们系统地评估了对齐过程对logit基于不确定性调整的LMs的影响。我们首先通过思ful的实验研究了对齐LMs与其预训练对应者的不确定性差异。实验结果表明,LMs在多选设定下存在两种不同的不确定性,一是答案决定不确定性,二是格式偏好不确定性。然后,我们通过简单的合成对齐方案进行了调整,并发现对齐LMs的一种原因是这两种不确定性的混淆。此外,我们还检查了常见后处calibration方法对对齐LMs的效果,并提出了一种容易实现和效率高的方法来调整对齐LMs。我们希望我们的发现能为LMs的设计提供指导。
Chain-of-Thought Tuning: Masked Language Models can also Think Step By Step in Natural Language Understanding
for: 这 paper aims to improve the performance of Large Language Models (LLMs) on Natural Language Understanding (NLU) tasks by extending the success of Chain-of-Thought (CoT) technique to MLMs.
methods: The proposed method, Chain-of-Thought Tuning (CoTT), is a two-step reasoning framework based on prompt tuning that enables MLMs to implement step-by-step thinking for NLU tasks.
results: The experiments on two NLU tasks, hierarchical classification and relation extraction, show that CoTT outperforms baselines and achieves state-of-the-art performance.Abstract
Chain-of-Thought (CoT) is a technique that guides Large Language Models (LLMs) to decompose complex tasks into multi-step reasoning through intermediate steps in natural language form. Briefly, CoT enables LLMs to think step by step. However, although many Natural Language Understanding (NLU) tasks also require thinking step by step, LLMs perform less well than small-scale Masked Language Models (MLMs). To migrate CoT from LLMs to MLMs, we propose Chain-of-Thought Tuning (CoTT), a two-step reasoning framework based on prompt tuning, to implement step-by-step thinking for MLMs on NLU tasks. From the perspective of CoT, CoTT's two-step framework enables MLMs to implement task decomposition; CoTT's prompt tuning allows intermediate steps to be used in natural language form. Thereby, the success of CoT can be extended to NLU tasks through MLMs. To verify the effectiveness of CoTT, we conduct experiments on two NLU tasks: hierarchical classification and relation extraction, and the results show that CoTT outperforms baselines and achieves state-of-the-art performance.
摘要
Chain-of-Thought (CoT) 是一种技术,用于导引大型自然语言模型(LLM)来 decomposition 复杂任务为多个步骤的自然语言形式中的逻辑步骤。简单来说,CoT 使得 LLM 可以一步步地思考。然而,许多自然语言理解(NLU)任务也需要一步步地思考,但 LLM 表现比小规模的面纹语言模型(MLM)差。为将 CoT 从 LLM 迁移到 MLM 上,我们提出了链条思维调整(CoTT),一种基于提问调整的两步逻辑框架。从 CoT 的视角来看,CoTT 的两步框架使得 MLM 可以实现任务的分解;CoTT 的提问调整使得中间步骤可以在自然语言形式下使用。因此,CoT 的成功可以通过 MLM 扩展到 NLU 任务。为验证 CoTT 的有效性,我们对两个 NLU 任务进行了实验:层次分类和关系抽取,并得到的结果表明 CoTT 超过基eline和实现了状态的表现。
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
results: LLM经过“反思调教”后,在多种评价指标上表现出色,超过了传统的数据集训练方法。Abstract
Recent advancements in Large Language Models (LLMs) have expanded the horizons of natural language understanding and generation. Notably, the output control and alignment with the input of LLMs can be refined through instruction tuning. However, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading LLM outputs. We propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. Extensive experiments on widely used evaluation benchmarks show that LLMs trained with our recycled data outperform those trained with existing datasets in various benchmarks.
摘要
MISAR: A Multimodal Instructional System with Augmented Reality
results: 研究表明,通过使用大语言模型(LLMs)可以更好地估计任务性能,从而为AR系统提供更加适应性。I hope this helps! Let me know if you have any other questions.Abstract
Augmented reality (AR) requires the seamless integration of visual, auditory, and linguistic channels for optimized human-computer interaction. While auditory and visual inputs facilitate real-time and contextual user guidance, the potential of large language models (LLMs) in this landscape remains largely untapped. Our study introduces an innovative method harnessing LLMs to assimilate information from visual, auditory, and contextual modalities. Focusing on the unique challenge of task performance quantification in AR, we utilize egocentric video, speech, and context analysis. The integration of LLMs facilitates enhanced state estimation, marking a step towards more adaptive AR systems. Code, dataset, and demo will be available at https://github.com/nguyennm1024/misar.
摘要
现实扩展(AR)需要Visual、听力和语言通道的无缝结合,以便最佳化人机交互。听力和视觉输入可以提供实时和上下文相关的用户指导,但是大型自然语言模型(LLM)在这个场景中的潜在仍然未得到充分利用。我们的研究提出了一种新的方法,利用LLM来融合Visual、听力和上下文modalities。我们在实际任务完成量化中采用egocentric视频、Speech和上下文分析。通过LLM的集成,我们可以提高状态估计,这标志着更适应性AR系统的出发点。代码、数据集和示例将在https://github.com/nguyennm1024/misar上提供。
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
paper_authors: Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, Somesh Jha
For: 提高大语言模型在高风险决策场景的可靠性。* Methods: 基于自我评估的参数有效性调整方法,以适应特定任务而进行适应。* Results: 在多个问答(QA)数据集上进行评估,比靡前状态艺的选择预测方法表现更好,例如在CoQA标准测试集上,AUACC从91.23%提高到92.63%,AUROC从74.61%提高到80.25%.Abstract
Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. Selective prediction is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.
摘要
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention
results: 研究发现,软max注意力在大多数场景下具有更高的性能,而linear注意力具有更高的计算复杂度。Abstract
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.
摘要
大型转换器模型在自然语言处理多种任务中取得了状态机器人的Result。转换器架构中的注意机制对于序列中的Token交互进行捕捉具有关键作用,通过使用softmax函数。然而,线性注意presenteda更加计算效率的代替方案,但它在与传统的softmax注意机制相比 exhibits substantial performance degradation。在这篇论文中,我们尝试填补这两种注意机制之间的实践性能差距的理论理解漏洞。通过对这两种注意机制进行全面的比较分析,我们 shed light on the underlying reasons why softmax注意机制在大多数场景下比linear注意机制更高性能。
Open-ended Commonsense Reasoning with Unrestricted Answer Scope
methods: 利用预训练语言模型IterativelyRetrieve reasoning paths on external knowledge base,不需要任务特定的监督。
results: 对两个常识 benchmarck 数据集进行实验,与其他方法相比,提出的方法表现更好, both quantitatively and qualitatively。Abstract
Open-ended Commonsense Reasoning is defined as solving a commonsense question without providing 1) a short list of answer candidates and 2) a pre-defined answer scope. Conventional ways of formulating the commonsense question into a question-answering form or utilizing external knowledge to learn retrieval-based methods are less applicable in the open-ended setting due to an inherent challenge. Without pre-defining an answer scope or a few candidates, open-ended commonsense reasoning entails predicting answers by searching over an extremely large searching space. Moreover, most questions require implicit multi-hop reasoning, which presents even more challenges to our problem. In this work, we leverage pre-trained language models to iteratively retrieve reasoning paths on the external knowledge base, which does not require task-specific supervision. The reasoning paths can help to identify the most precise answer to the commonsense question. We conduct experiments on two commonsense benchmark datasets. Compared to other approaches, our proposed method achieves better performance both quantitatively and qualitatively.
摘要
MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction
results: 实验结果表明,一个具有高吸引力和合适多样性的数据扩充策略可以更好地提高 GEC 模型的性能。此外,提议的 MixEdit 数据扩充方法可以在不需要额外的单语言 corpus 的情况下,STRATEGICALLY AND DYNAMICALLY 扩充真实的数据,以提高 GEC 模型的性能。Abstract
Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of which are motivated by two heuristics, i.e., increasing the distribution similarity and diversity of pseudo data. However, the underlying mechanism responsible for the effectiveness of these strategies remains poorly understood. In this paper, we aim to clarify how data augmentation improves GEC models. To this end, we introduce two interpretable and computationally efficient measures: Affinity and Diversity. Our findings indicate that an excellent GEC data augmentation strategy characterized by high Affinity and appropriate Diversity can better improve the performance of GEC models. Based on this observation, we propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora. To verify the correctness of our findings and the effectiveness of the proposed MixEdit, we conduct experiments on mainstream English and Chinese GEC datasets. The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods.
摘要
<> transtable text into Simplified Chinese.<>数据扩充通过生成假数据已经证明可以有效地解决语法错误修复(GEC)领域中的数据缺乏问题。各种扩充策略已经广泛探索,大多数是基于两个启发,即增加假数据的分布相似性和多样性。然而,这些策略下面的机制仍然不够了解。在这篇论文中,我们目的是解释如何使数据扩充改进GEC模型。为此,我们引入了两种可解释的计算效率的度量:团结度和多样性。我们的发现表明,一个具有高团结度和合适的多样性的GEC数据扩充策略可以更好地改进GEC模型的性能。基于这一观察,我们提出了 MixEdit,一种数据扩充方法,不需要额外的同语言资料卷。为了验证我们的发现和 MixEdit 的有效性,我们在主流的英语和中文GEC数据集上进行了实验。结果表明,MixEdit 可以大幅提高 GEC 模型的性能,并且与传统的数据扩充方法相комplementary。
Field-testing items using artificial intelligence: Natural language processing with transformers
results: 研究发现,RoBERTa模型可以准确地回答英语文本理解测试中的29个多选题。数据还用于计算测试题的心理特性,与人类考生数据显示一定的一致性。Abstract
Five thousand variations of the RoBERTa model, an artificially intelligent "transformer" that can understand text language, completed an English literacy exam with 29 multiple-choice questions. Data were used to calculate the psychometric properties of the items, which showed some degree of agreement to those obtained from human examinee data.
摘要
五千种RoBERTa模型,一种人工智能"变换器",通过完成了29个多选题的英语阅读测验。使用数据计算测验项的心理属性,结果与人类考生数据有一定的相似度。Note:* "RoBERTa" is translated as "RoBERTa模型" in Simplified Chinese.* "transformer" is translated as "变换器" in Simplified Chinese.* "English literacy exam" is translated as "英语阅读测验" in Simplified Chinese.* "multiple-choice questions" is translated as "多选题" in Simplified Chinese.
Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model
results: 实验表明,FFLM 可以与或even outperform ChatGPT 在不同任务上,并且具有24倍 fewer 参数。 FFLM 还可以超过其他强基线。Abstract
Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.
摘要
尽管自然语言生成技术已经做出了很大的进步,摘要模型仍然面临着不忠问题。前一任的工作通常使用其他任务训练的模型或者域内生成的数据来评估忠诚性,或者激活大型模型如ChatGPT。这篇论文提议使用一个 moderately-sized 基础语言模型进行零 shot 忠诚性评估。我们引入了一个新的度量FFLM,它是基于输出预测概率变化的 prefixing 语句的 intuition。实验表明,FFLM 与 ChatGPT 在不一致检测和忠诚评分中表现竞争,并且具有24倍少的参数。FFLM 还超过了其他强大基elines。
Systematic Assessment of Factual Knowledge in Large Language Models
results: 实验表明,ChatGPT在各个领域中表现最佳,而LLM的表现受到 instrucion finetuning、领域、问题复杂度和Contextual Adversarial的影响。Abstract
Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.
摘要
Here's the translation in Simplified Chinese:先前的研究通过现有的问答指标来评估大语言模型(LLM)中的知识,但这种方法有限制,因为它主要集中在通用领域,这可能与预训练数据重叠。本文提出了一个框架,可以系统地评估 LLM 中的事实知识,通过利用知识图(KG)。我们的框架可以自动生成基于 KG 中的事实的问题和预期答案,然后评估 LLM 在回答这些问题时的准确性。我们系统地评估了当今最先进的 LLM 在通用和特定领域中的性能。实验结果显示,ChatGPT 在所有领域中占据了首位。我们还发现,LLM 的性能取决于 instrucion 精度调整、领域和问题复杂度,并且容易受到恶意上下文的影响。
MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations
results: 实验结果表明,LLMs在自然语言描述和长对话中能够很好地理解新的解释,但是我们的研究也发现,当面临不熟悉的词语或同时构建多个新解释时,LLMs的性能仍然有所不足。此外,我们的分析还揭示了LLMs中的semantic predispositions,以及长context中的recency bias的影响。Abstract
Humans possess a remarkable ability to assign novel interpretations to linguistic expressions, enabling them to learn new words and understand community-specific connotations. However, Large Language Models (LLMs) have a knowledge cutoff and are costly to finetune repeatedly. Therefore, it is crucial for LLMs to learn novel interpretations in-context. In this paper, we systematically analyse the ability of LLMs to acquire novel interpretations using in-context learning. To facilitate our study, we introduce MAGNIFICo, an evaluation suite implemented within a text-to-SQL semantic parsing framework that incorporates diverse tokens and prompt settings to simulate real-world complexity. Experimental results on MAGNIFICo demonstrate that LLMs exhibit a surprisingly robust capacity for comprehending novel interpretations from natural language descriptions as well as from discussions within long conversations. Nevertheless, our findings also highlight the need for further improvements, particularly when interpreting unfamiliar words or when composing multiple novel interpretations simultaneously in the same example. Additionally, our analysis uncovers the semantic predispositions in LLMs and reveals the impact of recency bias for information presented in long contexts.
摘要
人类具有强大的语言表达重新解释能力,可以学习新词和社区特有的含义。然而,大型自然语言模型(LLM)具有知识割辑和重新训练成本高的问题。因此,LLM需要在Context中学习新的解释。本文系统地分析了LLM在Context中学习新解释的能力。为了促进我们的研究,我们提出了MAGNIFICo评价集,该集包括多种Token和提示设置,以 simulate real-world complexity。实验结果表明,LLM在自然语言描述和长 conversations中的讨论中能够很好地理解新解释。然而,我们的发现也表明,当解释不熟悉的词语或者在同一个例子中同时构成多个新解释时,LLM的表现仍然需要进一步改进。此外,我们的分析还揭示了LLM中的含义偏好和长 context中的新信息偏好。