results: 该方法可以快速、一键 inspect 文章主题内容,并提供时间序列图表和 word cloud 图表,以便对任意时间窗口中主题的出现进行统计测试。Abstract
The COVID-19 pandemic has changed the research agendas of most scientific communities, resulting in an overwhelming production of research articles in a variety of domains, including medicine, virology, epidemiology, economy, psychology, and so on. Several open-access corpora and literature hubs were established; among them, the COVID-19 Open Research Dataset (CORD-19) has systematically gathered scientific contributions for 2.5 years, by collecting and indexing over one million articles. Here, we present the CORD-19 Topic Visualizer (CORToViz), a method and associated visualization tool for inspecting the CORD-19 textual corpus of scientific abstracts. Our method is based upon a careful selection of up-to-date technologies (including large language models), resulting in an architecture for clustering articles along orthogonal dimensions and extraction techniques for temporal topic mining. Topic inspection is supported by an interactive dashboard, providing fast, one-click visualization of topic contents as word clouds and topic trends as time series, equipped with easy-to-drive statistical testing for analyzing the significance of topic emergence along arbitrarily selected time windows. The processes of data preparation and results visualization are completely general and virtually applicable to any corpus of textual documents - thus suited for effective adaptation to other contexts.
摘要
COVID-19 流行病已经对大多数科学社区的研究议程产生了深见的影响,导致了一些领域的研究文章急剧增加,包括医学、病毒学、流行病学、经济学、心理学等等。此外,一些开放获取的数据库和文献庐也被建立起来,其中COVID-19开放研究数据集(CORD-19)在过去2.5年内系统地收集和索引了大量的科学论文。在这里,我们介绍了CORD-19话题可视化工具(CORToViz),它是基于最新的技术(包括大语言模型)的方法和相应的可视化工具,用于探索COVID-19的文本数据库中的科学摘要。我们的方法包括对各个维度进行分 clustering 和时间序列分析等技术,以及一个交互式的可视化面板,可以快速地Visualize 摘要中的话题内容为云图和时间序列图表,同时提供了一些简单易用的统计测试,以分析选定时间窗口中话题的出现是否为 statistically significant。数据准备和结果可视化的过程是完全通用的,可以方便地适应到其他文本数据库上。
Evaluating Multi-Agent Coordination Abilities in Large Language Models
paper_authors: Saaket Agashe, Yue Fan, Xin Eric Wang
for: 这项研究的目标是开发能够与人类和其他系统合作 efectively 的多智能体代理人。
methods: 这项研究使用了 Large Language Models (LLMs),可以理解、生成和解释人类语言的方式,以开发多智能体 coordination 代理人。
results: 研究表明,使用 LLMs 可以在多智能体协调场景中实现高效的协调,包括理解伙伴的意图、 reasoning 行为、持续协调和对不熟悉的伙伴的Robustness。此外,研究还发现 LLMS 可以在 Overcooked-AI benchmark 中提供有用的帮助,并且可以快速学习和适应新的协调场景。Abstract
A pivotal aim in contemporary AI research is to develop agents proficient in multi-agent coordination, enabling effective collaboration with both humans and other systems. Large Language Models (LLMs), with their notable ability to understand, generate, and interpret language in a human-like manner, stand out as promising candidates for the development of such agents. In this study, we build and assess the effectiveness of agents crafted using LLMs in various coordination scenarios. We introduce the LLM-Coordination (LLM-Co) Framework, specifically designed to enable LLMs to play coordination games. With the LLM-Co framework, we conduct our evaluation with three game environments and organize the evaluation into five aspects: Theory of Mind, Situated Reasoning, Sustained Coordination, Robustness to Partners, and Explicit Assistance. First, the evaluation of the Theory of Mind and Situated Reasoning reveals the capabilities of LLM to infer the partner's intention and reason actions accordingly. Then, the evaluation around Sustained Coordination and Robustness to Partners further showcases the ability of LLMs to coordinate with an unknown partner in complex long-horizon tasks, outperforming Reinforcement Learning baselines. Lastly, to test Explicit Assistance, which refers to the ability of an agent to offer help proactively, we introduce two novel layouts into the Overcooked-AI benchmark, examining if agents can prioritize helping their partners, sacrificing time that could have been spent on their tasks. This research underscores the promising capabilities of LLMs in sophisticated coordination environments and reveals the potential of LLMs in building strong real-world agents for multi-agent coordination.
摘要
当代人工智能研究的核心目标是开发多智能体协作的能力,以便和人类以及其他系统有效协作。大型自然语言模型(LLM)因其能够理解、生成和解释人类语言方式而出众,因此在开发这类多智能体协作代理人方面表现出了扎实的潜力。在这项研究中,我们采用LLM-Coordination(LLM-Co)框架,以便LLM在协作游戏中表现出色。我们通过三个游戏环境进行评估,并将评估分为五个方面:理解伙伴意图、地域思维、持续协作、对伙伴强健和显式帮助。经过评估,我们发现LLM在理解伙伴意图和地域思维方面具有出色的能力,并在持续协作和对伙伴强健方面超越了强化学习基eline。最后,为了测试显式帮助,我们在Overcooked-AI bencmark中引入了两个新的布局,以测试代理人是否可以主动为伙伴提供帮助,牺牲一些时间来完成自己的任务。这项研究表明LLM在复杂多智能体协作环境中的潜力,并探讨LLM在实际世界中建立强大的多智能体协作代理人的可能性。
Trustworthy Formal Natural Language Specifications
results: 这篇论文的实验结果表明,使用这种方法可以正确地翻译多种来自popular textbook的英语描述 formal specifications into Lean formalizations,而无需大量修改词汇库。Abstract
Interactive proof assistants are computer programs carefully constructed to check a human-designed proof of a mathematical claim with high confidence in the implementation. However, this only validates truth of a formal claim, which may have been mistranslated from a claim made in natural language. This is especially problematic when using proof assistants to formally verify the correctness of software with respect to a natural language specification. The translation from informal to formal remains a challenging, time-consuming process that is difficult to audit for correctness. This paper shows that it is possible to build support for specifications written in expressive subsets of natural language, within existing proof assistants, consistent with the principles used to establish trust and auditability in proof assistants themselves. We implement a means to provide specifications in a modularly extensible formal subset of English, and have them automatically translated into formal claims, entirely within the Lean proof assistant. Our approach is extensible (placing no permanent restrictions on grammatical structure), modular (allowing information about new words to be distributed alongside libraries), and produces proof certificates explaining how each word was interpreted and how the sentence's structure was used to compute the meaning. We apply our prototype to the translation of various English descriptions of formal specifications from a popular textbook into Lean formalizations; all can be translated correctly with a modest lexicon with only minor modifications related to lexicon size.
摘要
交互证明助手是计算机程序,它们仔细构建,可以快速地检查人类设计的数学陈述的真实性。然而,这只有确认形式陈述的真实性,而不是自然语言中的陈述。这尤其是在使用证明助手来正式验证软件是否符合自然语言规范时,会出现问题。翻译自然语言中的陈述到形式语言仍然是一项困难的、耗时的任务,难以审核正确性。这篇论文展示了可以在现有的证明助手中支持基于表达ive subset of natural language的规范,并遵循证明助手自己的原则来建立信任和审核性。我们实现了一种方法,可以在Lean证明助手中提供表达ive subset of English的模块化可扩展的 формаль subsets,并自动将自然语言中的陈述翻译成形式索引。我们的方法是可扩展的(不会对语法结构做永久性的限制),可模块化(可以在库中分发信息),并生成证明证明,解释每个单词的解释和句子结构如何计算meaning。我们使用我们的原型将各种自然语言中的英语描述翻译成Lean形式化,所有可以正确地翻译,只需要一个小型词汇库,只需要一些相应的修改。
Automatic and Human-AI Interactive Text Generation
results: 研究人员通过不同的数据集、模型和评估方法来评估和提高文本生成模型的性能,并发现了一些新的技术和方法,如非回退式方法、大语言模型的提前定型、可学习度量和细致人类评估框架等,以提高文本生成的可读性和语言风格。Abstract
In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simplification, paraphrase generation, style transfer, etc. In contrast to text summarization and open-ended text completion (e.g., story), the text-to-text generation tasks we discuss in this tutorial are more constrained in terms of semantic consistency and targeted language styles. This level of control makes these tasks ideal testbeds for studying the ability of models to generate text that is both semantically adequate and stylistically appropriate. Moreover, these tasks are interesting from a technical standpoint, as they require complex combinations of lexical and syntactical transformations, stylistic control, and adherence to factual knowledge, -- all at once. With a special focus on text simplification and revision, this tutorial aims to provide an overview of the state-of-the-art natural language generation research from four major aspects -- Data, Models, Human-AI Collaboration, and Evaluation -- and to discuss and showcase a few significant and recent advances: (1) the use of non-retrogressive approaches; (2) the shift from fine-tuning to prompting with large language models; (3) the development of new learnable metric and fine-grained human evaluation framework; (4) a growing body of studies and datasets on non-English languages; (5) the rise of HCI+NLP+Accessibility interdisciplinary research to create real-world writing assistant systems.
摘要
在这个教程中,我们关注文本到文本生成任务,这是自然语言生成(NLG)任务的一种,它从一段文本输入中生成一个改进后的文本,保持原始意思和长度,同时符合某些特定的标准(如可读性或语言风格)。这包括了许多有用的应用,如文本简化、重叠生成、风格传递等。与文本概要和开放式文本完成(如故事)不同,文本到文本生成任务在Semantic consistency和targeted language styles方面更加具有制约,这使得这些任务成为模型生成文本的semantic adequacy和风格适应能力的 идеальtestbed。此外,这些任务也具有技术上的挑战,需要复杂的词汇和语法变换、风格控制和事实知识的结合,全面来说。本教程将从数据、模型、人工智能合作和评估四个方面提供文本生成领域的现状报告,并讲解和展示一些最近的进步:(1)非退化方法的使用;(2)大语言模型的 Fine-tuning 到提示;(3)开发新的可学习度量和细化人类评估框架;(4)非英语语料的增长和应用;(5)人工智能+计算机科学+访问性研究的协同发展,以创造真实世界的写作助手系统。
Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report
paper_authors: Jason Holmes, Lian Zhang, Yuzhen Ding, Hongying Feng, Zhengliang Liu, Tianming Liu, William W. Wong, Sujay A. Vora, Jonathan B. Ashman, Wei Liu for: 本研究旨在使用大型自然语言模型(LLM)标准化 radiation oncology 领域中的结构名称,并为未来研究提供参考基准。methods: 本研究使用 Generative Pre-trained Transformer(GPT)-4 API 实现 DICOM 存储服务器,当接收到结构集 DICOM 文件时,GPT-4 会根据 American Association of Physicists in Medicine(AAPM)任务组(TG)-263 标准重新标注结构名称。选择了三个疾病位置:肾病、头颈部和胸部,对每个疾病类型,随机选择 150 名病人进行手动调整指令提示(分 batches of 50),并随机选择 50 名病人进行评估。results: 结果显示,肾病、头颈部和胸部疾病情况下的结构名称重新标注精度为 96.0%、98.5% 和 96.9% соответственно。重新标注目标体部分的精度较低,除了肾病情况下的 100% 外,其他两个疾病类型的平均精度分别为 93.1% 和 91.1%。Abstract
Purpose: To introduce the concept of using large language models (LLMs) to re-label structure names in accordance with the American Association of Physicists in Medicine (AAPM) Task Group (TG)-263 standard, and to establish a benchmark for future studies to reference. Methods and Materials: The Generative Pre-trained Transformer (GPT)-4 application programming interface (API) was implemented as a Digital Imaging and Communications in Medicine (DICOM) storage server, which upon receiving a structure set DICOM file, prompts GPT-4 to re-label the structure names of both target volumes and normal tissues according to the AAPM TG-263. Three disease sites, prostate, head and neck, and thorax were selected for evaluation. For each disease site category, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50) and 50 patients were randomly selected for evaluation. Structure names that were considered were those that were most likely to be relevant for studies utilizing structure contours for many patients. Results: The overall re-labeling accuracy of both target volumes and normal tissues for prostate, head and neck, and thorax cases was 96.0%, 98.5%, and 96.9% respectively. Re-labeling of target volumes was less accurate on average except for prostate - 100%, 93.1%, and 91.1% respectively. Conclusions: Given the accuracy of GPT-4 in re-labeling structure names of both target volumes and normal tissues as presented in this work, LLMs are poised to be the preferred method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.
摘要
目的:介绍使用大语言模型(LLM)来按照美国物理学会医学分会(AAPM)任务组(TG)263标准重新标注结构名称,并建立参考基准 для未来研究。方法和材料:使用生成预训练的变换器(GPT)4应用程序编程接口(API)将其作为数字医疗影像和通信(DICOM)存储服务器,当接收到结构集DICOM文件时,请求GPT-4将结构名称重新标注为符合AAPM TG-263标准。选择了三个疾病站点,即肾病、头颈部和胸部,进行评估。每个疾病站点类别中随机选择50名病人进行手动调整说明提示(batches of 50),并随机选择50名病人进行评估。考虑重新标注的结构名称是最有可能被用于多个患者的研究中的结构辐射。结果:评估结果显示,肾病、头颈部和胸部疾病 случа例中结构名称重新标注的精度为96.0%、98.5%和96.9% соответственно。重新标注目标体部分的精度较低,除了肾病外,其中的精度为100%、93.1%和91.1%分别。结论:根据这些结果,GPT-4在重新标注结构名称方面的精度很高,LLMs可能成为医学物理学会标准化结构名称的首选方法,特别是考虑到LLM技术的快速发展,未来的进步也可能会继续。
Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
results: 在零shot cross-modal speech translation中获得了显著改善,甚至超过了基于 XLSR 的超vised方法。Abstract
Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
摘要
latest research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.Note:* "speech-to-text translation" is translated as "语音至文本翻译" (yǔyīn zhì wén tiě bīng yì)* "independently trained encoders and decoders" is translated as "独立训练的编码器和解码器" (dāng zhì xiǎng zhì de biān mǎo yǔ jiě mǎo yì)* "combined through a shared fixed-size representation" is translated as "通过共享固定大小的表示" (tōng guò gòng xiāng gòng dào zhì yǐ jīng)* "multilingual training" is translated as "多语言训练" (duō yǔ yán xiǎng zhì)* "zero-shot cross-modal speech translation" is translated as "零发射跨模态语音翻译" (líng fā shè qū mó dài yǔ yīn bīng yì)* "outperforming a supervised approach based on XLSR" is translated as "超过基于 XLSR 的指导方法" (chāo guò jī bù xīn xiǎng yǐ jīng fāng mó)
A Long Way to Go: Investigating Length Correlations in RLHF
results: RLHF通常能够提高模型的性能,但是研究发现,RLHF的主要原因是提高输出长度。在调整奖励模型的时候,可以通过调整输出长度来提高奖励得分。Abstract
Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.
摘要
很大的成功有被报告使用从人类反馈学习(RLHF)调整大型语言模型。开源的偏好数据和奖励模型使得更多的实验可以进行 beyond 通用的对话设定,特别是对 tasks like 网页问答、摘要和多轮对话进行调整。当优化为“帮助”时,RLHF 被观察到 consistently 驱动模型生成更长的输出。本文证明了优化响应length 是 RLHF 报告的改善的重要因素。我们首先研究了奖励和长度之间的关系,发现长度和奖励Score 间存在强正相关,并且改善奖励Score 的主要原因是将输出长度的分布shift。然后我们在RL和奖励模型学习过程中进行了 intervene ,以看看我们可以在不增加长度的情况下 achieving 同等的下游改善。我们发现了一些 intervene 可以 mitigate length increases,但不是uniformly effective across settings。此外,我们发现了在Running RLHF with a reward based solely on length 可以重现大部分的下游改善,显示奖励模型在这些设定下有很长的方向。
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
results: 应用DecoderLens于问答、逻辑推理、语音识别和机器翻译模型后,发现这些模型在低或中间层解决了一些具体的子任务,从而新的推测了encoder组件内部信息的流动。Abstract
In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a simple, new method: DecoderLens. Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers instead of using the final encoder output, as is normally done in encoder-decoder models. The method thus maps previously uninterpretable vector representations to human-interpretable sequences of words or symbols. We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation. The DecoderLens reveals several specific subtasks that are solved at low or intermediate layers, shedding new light on the information flow inside the encoder component of this important class of models.
摘要
paper_authors: Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, Eneko Agirre for: 这篇论文旨在提高无seen任务泛化的大语言模型(LLMs)的表现,具体来说是在信息提取(IE)任务上。methods: 该论文提出了一种基于指南的大语言模型(GoLLIE),通过遵循指南来提高无seen任务的泛化表现。results: 实验表明,GoLLIE可以成功地遵循未看过的指南,并在无seen任务上表现出优于之前的尝试。减少细节指南的研究也表明了指南的重要性。Abstract
Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines which describe the task and give examples to humans. Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out-of-the-box. In this paper we propose GoLLIE (Guideline-following Large Language Model for IE), a model able to improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to comply with annotation guidelines. Comprehensive evaluation empirically demonstrates that GoLLIE is able to generalize to and follow unseen guidelines, outperforming previous attempts at zero-shot information extraction. The ablation study shows that detailed guidelines is key for good results.
摘要
大型语言模型(LLM)结合指南调整已经在无seen任务中做出了 significi cant进步,但在信息提取(IE)方面表现不佳,落后于任务特定模型。通常,IE任务被特定的注释指南描述,这些指南给出了人类的示例。过去尝试使用这些信息来优化模型的尝试都失败了,即使使用最大化模型。在这篇论文中,我们提出了GoLLIE(指南遵循的大语言模型 дляIE),一个能够通过遵循未看过的指南来提高零shot结果的IE任务。经验证明,GoLLIE能够遵循未看过的指南,并且在无seen任务中表现出色。ablation研究表明,详细的指南是关键获得好结果。
TRAM: Bridging Trust Regions and Sharpness Aware Minimization
paper_authors: Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao Peng
for: 提高领域跨越和表示通用性
methods: 使用信任区 bounds Inform SAM-style regularizers,并提出 Trust Region Aware Minimization(TRAM)算法,用于优化极小化函数和有用的表示,而不会忘记预训练结构
results: TRAM 在跨区域语言模型和cross-语言传输中表现出色,并且超过了锐度感知和信任区基于优化方法。TRAM 成为训练通用模型的新标准,需要最小额外计算。Abstract
By reducing the curvature of the loss surface in the parameter space, Sharpness-aware minimization (SAM) yields widespread robustness improvement under domain transfer. Instead of focusing on parameters, however, this work considers the transferability of representations as the optimization target for out-of-domain generalization in a fine-tuning setup. To encourage the retention of transferable representations, we consider trust region-based fine-tuning methods, which exploit task-specific skills without forgetting task-agnostic representations from pre-training. We unify parameter- and representation-space smoothing approaches by using trust region bounds to inform SAM-style regularizers on both of these optimization surfaces. We propose Trust Region Aware Minimization (TRAM), a fine-tuning algorithm that optimizes for flat minima and smooth, informative representations without forgetting pre-trained structure. We find that TRAM outperforms both sharpness-aware and trust region-based optimization methods on cross-domain language modeling and cross-lingual transfer, where robustness to domain transfer and representation generality are critical for success. TRAM establishes a new standard in training generalizable models with minimal additional computation.
摘要
通过减少参数空间中折枝的弯曲率,锐度意识化最小化(SAM)得到了广泛的鲁棒性改进。而不是关注参数,这项工作强调了投入的表示空间中的转移性。为了促进保留转移性的表示,我们考虑了信任区域基于的练习方法,该方法利用任务特定技能而不忘记任务无关的表示。我们将参数空间和表示空间的平滑方法统一到了信任区域约束中,以便在SAM风格的正则化中使用信任区域约束。我们提出了信任区域意识化最小化(TRAM)算法,它在微调设置中优化了平均陡峭和有用的表示,而无需忘记先验结构。我们发现TRAM在跨频道语言模型和跨语言传输中表现出色,其中鲁棒性和表示总体性是成功的关键因素。TRAM设置了训练普适模型的新标准,并且增加了最小的额外计算。
Evaluating Self-Supervised Speech Representations for Indigenous American Languages
paper_authors: Chih-Chen Chen, William Chen, Rodolfo Zevallos, John E. Ortega
for: 这 paper 是为了研究自动语音识别(ASR)领域中的自我监督学习(SSL)技术。
methods: 这 paper 使用了现有的大规模 SSL 模型,对 Quechua 语言和其他六种原住民语言进行了低资源 ASR 测试。
results: 结果表明,使用现有的大规模 SSL 模型可以在 Quechua 语言和其他原住民语言的低资源 ASR 中实现出色的表现。Abstract
The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.
摘要
“自动监督学习应用于语音表示学习领域已经吸引了相当多的关注,因为它可以承载大量的无标注数据。然而,许多进步,包括预训练和下游评估,都集中在英语之上,只有一些模型考虑了其他语言,而几乎没有考虑过传统语言。在我们的ASRU 2023 ML-SUPERB挑战提交中,我们提供了一个Quechua语言的ASR数据集,同时也评估了6种其他原住民语言,如Guarani和Bribri的低资源ASR。我们的结果显示了现有大规模SSL模型在实际数据上表现了惊人的好几何性。”Note: "ASR" stands for "Automatic Speech Recognition", "SSL" stands for "Self-Supervised Learning", and "ML-SUPERB" is a challenge for multilingual speech recognition.
Redefining Digital Health Interfaces with Large Language Models
results: 论文通过使用外部工具,解决了使用LLMs在临床设置中的问题,如幻觉。同时,论文还提供了卡ди奥vascular疾病和diabetes风险预测的例子,展示了这种新的接口的优势。Abstract
Digital health tools have the potential to significantly improve the delivery of healthcare services. However, their use remains comparatively limited due, in part, to challenges surrounding usability and trust. Recently, Large Language Models (LLMs) have emerged as general-purpose models with the ability to process complex information and produce human-quality text, presenting a wealth of potential applications in healthcare. Directly applying LLMs in clinical settings is not straightforward, with LLMs susceptible to providing inconsistent or nonsensical answers. We demonstrate how LLMs can utilize external tools to provide a novel interface between clinicians and digital technologies. This enhances the utility and practical impact of digital healthcare tools and AI models while addressing current issues with using LLM in clinical settings such as hallucinations. We illustrate our approach with examples from cardiovascular disease and diabetes risk prediction, highlighting the benefit compared to traditional interfaces for digital tools.
摘要
“数字健康工具有可能提高医疗服务的提供,但其使用仍然相对有限,主要因为使用困难和信任问题。最近,大型语言模型(LLM)在医疗领域的应用已经具有广泛的潜在应用。直接在临床设置中使用LLM并不直接,LLM容易提供不一致或无意义的答案。我们示示了如何使用外部工具将LLM与临床技术相连,从而提高数字医疗工具和AI模型的实用性和实际效果,同时解决现有的LLM在临床设置中的问题,如投射。我们通过心血管疾病和 диабе层诊断预测的例子示出了我们的方法的优势,比传统界面更加有利。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.
PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction
results: 本研究通过分析不同的训练和模型参数对模型性能的影响,提出了简单 yet effective的指南,以实现KIE任务下的优质隐私融合。此外,本研究还引入了FeAm-DP算法,可以有效地在多个客户端 Federated 环境中实现globally DP的扩展。通过对不同客户端和隐私设置进行广泛的评估,本研究证明了FeAm-DP算法可以在多个参与客户端的情况下保持相同的性能和隐私保障。Abstract
In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.
摘要
在这篇论文中,我们介绍了如何开发 private Key Information Extraction(KIE)系统,通过利用大型预训练文档基础模型、分布式学习(FL)和不同化分布式学习(DP-FL)等技术,以保持强大的隐私保证。通过对六个benchmark数据集(FUNSD、CORD、SROIE、WildReceipts、XFUND和DOCILE)进行广泛的实验,我们证明了大型文档基础模型可以在private Setting下高效地 Fine-tune для KIE任务,以实现适当的性能while maintaining strong privacy guarantees。此外,我们对模型训练和参数的影响进行了系统的分析,并提出了简洁 yet effective的指南,以实现KIE任务下的优质隐私融合。最后,我们引入了FeAm-DP算法,一种基于DP-FL的新算法,可以高效地将全局DP从独立上下文扩展到多客户联邦环境。我们对该算法进行了广泛的评估,并在不同的客户和隐私设置下示出其可以实现相同的性能和隐私保证。总之,我们的研究提供了开发private KIE系统的有价值的信息,并高调了文档基础模型的隐私保护能力,用于隐私保护文档AI应用。作者认为,这是开发private KIE系统的首次研究。
Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards
results: 根据ROUGE指标的评估和人工评估,该方法的result比baseline更高,在凝聚性方面也表现出优异。Abstract
Memory-efficient large language models are good at refining text input for better readability. However, controllability is a matter of concern when it comes to text generation tasks with long inputs, such as multi-document summarization. In this work, we investigate for a generic controllable approach for multi-document summarization that leverages the capabilities of LLMs to refine the text. In particular, we train a controllable content extraction scheme to extract the text that will be refined by an LLM. The scheme is designed with a novel coverage and coherence intuitive policy, which is duly rewarded by a passively trained LLM. Our approach yields competitive results in the evaluation using ROUGE metrics and outperforms potential baselines in coherence, as per human evaluation.
摘要
大型语言模型具有很好的缩写能力,可以为文本输入提高可读性。然而,在长输入文本生成任务中,控制性是一个关注的问题。在这种情况下,我们研究了一种通用可控的多文摘要方法,利用 LLMS 的能力来改进文本。我们培训了一种可控内容提取方案,通过一种新的覆盖率和吸引力感知策略来提取需要改进的文本。这种策略通过一个通过训练的 LLMS 得到的奖励。我们的方法在使用 ROUGE 指标进行评估中获得了竞争力的结果,并在人工评估中超过了可能的基准值。
The North System for Formosa Speech Recognition Challenge 2023
results: 这个系统的示范已经在https://asrvm.iis.sinica.edu.tw/hakka_sixian中公开。Abstract
This report provides a concise overview of the proposed North system, which aims to achieve automatic word/syllable recognition for Taiwanese Hakka (Sixian). The report outlines three key components of the system: the acquisition, composition, and utilization of the training data; the architecture of the model; and the hardware specifications and operational statistics. The demonstration of the system has been made public at https://asrvm.iis.sinica.edu.tw/hakka_sixian.
摘要
这份报告提供了北系自动词/音节识别系统的简洁概述,旨在实现台湾闽南话(六年)自动识别。报告介绍了系统的三个关键组成部分:训练数据的获取、组合和使用;模型的架构;以及硬件规格和运行统计。系统的示例已经在https://asrvm.iis.sinica.edu.tw/hakka_sixian上公开。Note: "北系" (Běixì) is a shortened form of "北部自动识别系统" (Běibù Zìdòng Shìbié Xìtsū) in Simplified Chinese, which means "North system" in English.
Neural Language Model Pruning for Automatic Speech Recognition
results: 主要的结果包括:a) data-driven剔除在一些场景下超过了大小剔除的性能; b) 逐步剔除在目标模型更小时比一次性剔除更高的精度; c) 低级approximation在一定的压缩率下提供了最佳的贸易协议 между压缩和执行速度。Abstract
We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their contribution in terms of accuracy and inference speed. To the best of our knowledge, such in-depth analyses on large-scale recognition systems has not been reported in the literature. In addition, we propose a variant of low-rank approximation suitable for incrementally compressing models, and delivering multiple models with varied target sizes. Among other results, we show that a) data-driven pruning outperforms magnitude-driven in several scenarios; b) incremental pruning achieves higher accuracy compared to one-shot pruning, especially when targeting smaller sizes; and c) low-rank approximation presents the best trade-off between size reduction and inference speed-up for moderate compression.
摘要
我们研究基于Transformer语言模型的模型剪辑方法,用于自动语音识别。我们分析了三个方面的剪辑框架,即标准、方法和调度器,对准确率和执行速度进行分析。根据我们所知,这种大规模识别系统的深入分析没有在文献中报道过。此外,我们还提出了适用于步骤式压缩模型的低级approximation方法,并实现了多个模型 TargetSize不同的多种模型。其中一些结果包括: a) 数据驱动剪辑比例驱动更高的性能在一些场景中表现出色; b) 逐步剪辑比一次剪辑更高的准确率,特别是targetSize更小的场景中; c) 低级approximation具有最佳的剪辑尺度和执行速度之间的折衔,尤其是在moderate压缩情况下。
results: 评估结果表明,本方法可以准确地捕捉新闻文摘中关键事件的核心信息,同时保持文摘的准确性和完整性。Abstract
Multi-document summarization is a challenging task due to its inherent subjective bias, highlighted by the low inter-annotator ROUGE-1 score of 0.4 among DUC-2004 reference summaries. In this work, we aim to enhance the objectivity of news summarization by focusing on the main event of a group of related news documents and presenting it coherently with sufficient context. Our primary objective is to succinctly report the main event, ensuring that the summary remains objective and informative. To achieve this, we employ an extract-rewrite approach that incorporates a main-event biased monotone-submodular function for content selection. This enables us to extract the most crucial information related to the main event from the document cluster. To ensure coherence, we utilize a fine-tuned Language Model (LLM) for rewriting the extracted content into a coherent text. The evaluation using objective metrics and human evaluators confirms the effectiveness of our approach, as it surpasses potential baselines, demonstrating excellence in both content coverage, coherence, and informativeness.
摘要
多文摘要是一项具有主观偏见的任务, Duc-2004 参考摘要的低 ROUGE-1 分数为 0.4 表明这一点。在这项工作中,我们想增强新闻摘要的 объектив性,通过关注一组相关新闻文档中的主要事件,并以充分的上下文进行摘要。我们的主要目标是简要报道主要事件,以确保摘要具有 objetivity 和信息性。为实现这一目标,我们采用了提取-重写方法,通过主要事件偏见的 monotone-submodular 函数进行内容选择。这使得我们可以从文档集中提取关键相关主要事件的信息。为确保准确性,我们使用了调整的语言模型(LLM)进行重写提取的内容,以确保摘要具有准确性和一致性。经过对象指标和人类评估者的评估,我们的方法得到了证明,它在内容覆盖率、一致性和信息性等方面具有优异性。
Evaluating Hallucinations in Chinese Large Language Models
results: 我们对24个大型语言模型进行了广泛的实验,发现18个模型的非幻视率低于50%。这表明HalluQA是非常具有挑战性的。我们分析了不同类型的模型中的主要幻视类型和其原因。此外,我们还讨论了不同类型的模型中哪些幻视类型应被优先顾及。Abstract
In this paper, we establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models. HalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. During the construction of HalluQA, we consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT. For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated. We conduct extensive experiments on 24 large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than 50%. This indicates that HalluQA is highly challenging. We analyze the primary types of hallucinations in different types of models and their causes. Additionally, we discuss which types of hallucinations should be prioritized for different types of models.
摘要
在这篇论文中,我们建立了一个名为哈鲁QA(中文幻想问答)的标准测试集,用于衡量中文大语模型中的幻想现象。哈鲁QA包含450个精心设计的对抗问题,覆盖多个领域,并考虑了中国历史文化、习俗和社会现象。在建立哈鲁QA时,我们考虑了两种类型的幻想:模仿错误和事实错误,并基于GLM-130B和ChatGPT构建了对抗样本。为评估,我们设计了一种自动评估方法,使用GPT-4来判断模型输出是否幻想。我们对24个大语模型进行了广泛的实验,其中18个模型的非幻想率低于50%。这表明哈鲁QA是非常具有挑战性。我们分析了不同类型的模型中的主要幻想类型和其原因。此外,我们还讨论了不同类型的模型中应该优先级幻想的类型。
Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise
results: 与直接使用 GPT-4 生成比较,在四个中文法律任务中获得了33.3%的提高率,并较两个更强的搜寻基于的基准值得15.4%和23.9%的提高率。Abstract
While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released
摘要
While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.
results: 我们的实验结果表明,PERSE在Kendall相关度和对比性准确率上都高于GPT-4,提高了15.8%和13.7%。这两个数据集和代码将被发布。Abstract
While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets Per-MPST and Per-DOC for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model PERSE to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, PERSE predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that PERSE outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy. Both datasets and code will be released.
摘要
大型语言模型(LLM)在问答和搜寻任务上表现出色,但在开放式文本生成任务上仍然是一个挑战,因为存在以下问题:(1)数据污染;(2)多维评估标准;以及(3)由评论者们的个人偏好所致的主观性。为了解决这些问题,我们提议一个个人化的开放式生成评估方法。我们创建了两个新的数据集Per-MPST和Per-DOC,并发展了一个个人化的故事评估模型(PERSE),以推断评论者的偏好和提供个人化评估。具体来说,给定一些特定评论者的几则评论,PERSE预测该评论者对新文本的评估,包括有兴趣和惊喜等方面的细部评价。我们的实验结果显示,PERSE比GPT-4提高了15.8%的柯德兹通信相互评价准确性,并且比GPT-4提高了13.7%的对照选择准确性。我们将数据和代码发布。
A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User’s Intentions
results: 在两个任务oriented对话任务(巫匠ofWikipedia和Holl-E)上,EDIT比其他LLMs表现出色,提高了对话响应的质量和准确性。Abstract
Large Language Models (LLMs), such as ChatGPT, have recently been applied to various NLP tasks due to its open-domain generation capabilities. However, there are two issues with applying LLMs to dialogue tasks. 1. During the dialogue process, users may have implicit intentions that might be overlooked by LLMs. Consequently, generated responses couldn't align with the user's intentions. 2. It is unlikely for LLMs to encompass all fields comprehensively. In certain specific domains, their knowledge may be incomplete, and LLMs cannot update the latest knowledge in real-time. To tackle these issues, we propose a framework~\emph{using LLM to \textbf{E}nhance dialogue response generation by asking questions to \textbf{D}etect user's \textbf{I}mplicit in\textbf{T}entions} (\textbf{EDIT}). Firstly, EDIT generates open questions related to the dialogue context as the potential user's intention; Then, EDIT answers those questions by interacting with LLMs and searching in domain-specific knowledge bases respectively, and use LLMs to choose the proper answers to questions as extra knowledge; Finally, EDIT enhances response generation by explicitly integrating those extra knowledge. Besides, previous question generation works only focus on asking questions with answers in context. In order to ask open questions, we construct a Context-Open-Question (COQ) dataset. On two task-oriented dialogue tasks (Wizard of Wikipedia and Holl-E), EDIT outperformed other LLMs.
摘要
大型自然语言模型(LLM),如ChatGPT,在近期被应用于各种自然语言处理任务中,它们具有开放预测能力。然而,在对话任务中,用户可能有隐藏的意图, LLM 可能会忽略这些隐藏的意图,导致生成的响应与用户的意图不符。其次, LLM 可能不能全面覆盖所有领域,在特定领域中,它们的知识可能是不完整的,无法在实时更新最新的知识。为解决这些问题,我们提出了一个框架,使用 LLM 来增强对话响应生成,通过问题来探测用户的隐藏意图。我们的方法包括以下三个步骤:首先,生成对话上下文相关的开放问题,作为用户的隐藏意图;然后,通过与 LLM 交互和在具体领域知识库中搜索,解决这些问题,并使用 LLM 选择最佳答案作为额外知识;最后,将这些额外知识Explicitly integrate into response generation。此外,以前的问题生成工作只是在 Context 中寻找答案。为了生成开放问题,我们构建了 Context-Open-Question(COQ)数据集。在两个任务(巫医与 Holl-E)上,EDIT 比其他 LLM 高效。
A Formalism and Approach for Improving Robustness of Large Language Models Using Risk-Adjusted Confidence Scores
for: This paper aims to provide a systematic understanding of the risks posed by large language models (LLMs) in natural language inference (NLI) tasks, and to propose a risk-centric evaluation framework and a risk-adjusted calibration method to mitigate these risks.
methods: The paper defines and formalizes two types of risk in LLMs, decision risk and composite risk, and proposes four novel metrics for assessing these risks in both in-domain and out-of-domain settings. The proposed risk-adjusted calibration method, called DwD, helps LLMs minimize these risks in an overall NLI architecture.
results: The paper presents detailed experiments using four NLI benchmarks, three baselines, and two LLMs, including ChatGPT, to demonstrate the practical utility of the evaluation framework and the efficacy of DwD in reducing decision and composite risk. For instance, the paper shows that DwD can help an underlying LLM address an extra 20.1% of low-risk inference tasks and skip a further 19.8% of high-risk tasks, which would have been answered incorrectly without risk adjustment.Abstract
Large Language Models (LLMs), such as ChatGPT, have achieved impressive milestones in natural language processing (NLP). Despite their impressive performance, the models are known to pose important risks. As these models are deployed in real-world applications, a systematic understanding of different risks posed by these models on tasks such as natural language inference (NLI), is much needed. In this paper, we define and formalize two distinct types of risk: decision risk and composite risk. We also propose a risk-centric evaluation framework, and four novel metrics, for assessing LLMs on these risks in both in-domain and out-of-domain settings. Finally, we propose a risk-adjusted calibration method called DwD for helping LLMs minimize these risks in an overall NLI architecture. Detailed experiments, using four NLI benchmarks, three baselines and two LLMs, including ChatGPT, show both the practical utility of the evaluation framework, and the efficacy of DwD in reducing decision and composite risk. For instance, when using DwD, an underlying LLM is able to address an extra 20.1% of low-risk inference tasks (but which the LLM erroneously deems high-risk without risk adjustment) and skip a further 19.8% of high-risk tasks, which would have been answered incorrectly.
摘要
大型自然语言模型(LLM),如ChatGPT,在自然语言处理(NLP)中取得了吸引人的成绩。尽管它们的表现很出色,但这些模型也存在重要的风险。在这些模型被实际应用时,对它们在不同任务上的风险进行系统性的理解是非常重要。在这篇论文中,我们定义和正式化了两种不同的风险类型:决策风险和复杂风险。我们还提出了一种风险中心的评估框架,以及四种新的评估指标,用于评估 LLM 在这些风险上的表现。最后,我们提出了一种名为 DwD 的风险补偿方法,用于帮助 LLM 在整体 NLI 架构中减少这些风险。我们的实验,使用四个 NLI benchmark,三个基eline和两个 LLM,包括 ChatGPT,显示了评估框架的实用性,以及 DwD 在减少决策风险和复杂风险方面的效果。例如,使用 DwD,一个基eline LLM 可以处理更多的低风险推理任务(原本 LLM 错误地认为高风险),并且可以跳过更多的高风险任务( LLM 会错误地答案)。
Investigating Alternative Feature Extraction Pipelines For Clinical Note Phenotyping
results: 这个研究发现,这种新方法可以减少计算时间,并且可以检测不在医疗记录中的病种。相比之下,使用BERT和LSTM的方法有较高的计算时间和较低的准确率。Abstract
A common practice in the medical industry is the use of clinical notes, which consist of detailed patient observations. However, electronic health record systems frequently do not contain these observations in a structured format, rendering patient information challenging to assess and evaluate automatically. Using computational systems for the extraction of medical attributes offers many applications, including longitudinal analysis of patients, risk assessment, and hospital evaluation. Recent work has constructed successful methods for phenotyping: extracting medical attributes from clinical notes. BERT-based models can be used to transform clinical notes into a series of representations, which are then condensed into a single document representation based on their CLS embeddings and passed into an LSTM (Mulyar et al., 2020). Though this pipeline yields a considerable performance improvement over previous results, it requires extensive convergence time. This method also does not allow for predicting attributes not yet identified in clinical notes. Considering the wide variety of medical attributes that may be present in a clinical note, we propose an alternative pipeline utilizing ScispaCy (Neumann et al., 2019) for the extraction of common diseases. We then train various supervised learning models to associate the presence of these conditions with patient attributes. Finally, we replicate a ClinicalBERT (Alsentzer et al., 2019) and LSTM-based approach for purposes of comparison. We find that alternative methods moderately underperform the replicated LSTM approach. Yet, considering a complex tradeoff between accuracy and runtime, in addition to the fact that the alternative approach also allows for the detection of medical conditions that are not already present in a clinical note, its usage may be considered as a supplement to established methods.
摘要
医疗行业常用临床笔记,其中包含细致的患者观察记录。然而,电子健康记录系统通常不会将这些观察记录structured format中,使得患者信息具有自动评估和评估的挑战。使用计算机系统进行医学属性提取有很多应用,包括长期分析患者、风险评估和医院评估。最近的工作已经建立了成功的方法,用于从临床笔记中提取医学属性。BERT基于模型可以将临床笔记转换为一系列表示,然后将这些表示condensed into a single document representation based on their CLS embeddings,并将其传递给LSTM(Mulyar et al., 2020)。虽然这个管道可以提高性能,但它需要广泛的对接时间。此外,这种方法不能预测没有在临床笔记中出现的医学属性。Considering the wide variety of medical attributes that may be present in a clinical note, we propose an alternative pipeline utilizing ScispaCy (Neumann et al., 2019) for the extraction of common diseases. We then train various supervised learning models to associate the presence of these conditions with patient attributes. Finally, we replicate a ClinicalBERT (Alsentzer et al., 2019) and LSTM-based approach for purposes of comparison. We find that alternative methods moderately underperform the replicated LSTM approach. Yet, considering a complex tradeoff between accuracy and runtime, in addition to the fact that the alternative approach also allows for the detection of medical conditions that are not already present in a clinical note, its usage may be considered as a supplement to established methods.
InstructProtein: Aligning Human and Protein Language via Knowledge Instruction
results: 实验结果显示,InstructProtein比现有的LLM出色地表现,对于 bidirectional protein-text生成任务具有优秀的表现。此外,InstructProtein可以实现蛋白质功能预测和序列设计,帮助bridging蛋白质和人工语言理解之间的距离。Abstract
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.
摘要
大型自然语言模型(LLM)已经革命化自然语言处理领域,但它们在蛋白序列上的理解 still falls short. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.
Unlock Predictable Scaling from Emergent Abilities
results: 发现小型模型具有重要的任务性能改进,这些改进不被传统的评估策略所捕捉。通过PassUntil评估策略,发现任务性能随模型大小增长的准确性有限,并提出了一种新的emergent能力定义,以推翻一种流行的多步逻辑假设。Abstract
The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy through massive sampling in the decoding phase. We conduct quantitative investigations into the scaling law of task performance. Firstly, a strict task scaling law is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts. Secondly, underpinned by PassUntil, we observe concrete evidence of emergent abilities and ascertain that they are not in conflict with the continuity of performance improvement. Their semblance to break-through is that their scaling curve cannot be fitted by standard scaling law function. We then introduce a mathematical definition for the emergent abilities. Through the definition, we refute a prevalent ``multi-step reasoning hypothesis'' regarding the genesis of emergent abilities and propose a new hypothesis with a satisfying fit to the observed scaling curve.
摘要
科学级大语言模型(LLM)的扩大 Properties需要一个全面的理解。然而,现有的文献中的扩大 Properties只提供了一个不完整的答案:优化损失随模型大小减少,与已知的扩大法律一致;然而,没有任何任务扩大法律,并且任务性能并不是预测可能的,任务性能通常在小模型上显示微增长,直到模型大小超过一个阈值,然后快速增长,这是“emergent abilities”的表现。在这项研究中,我们发现,即使小模型表现不佳,它们仍然表现出重要和一致的任务性能改进,这些改进不被传统的评价策略捕捉,因为测量分辨率不够高。为了测量这些改进,我们引入了PassUntil评价策略,通过大量采样在解码阶段进行测量。我们进行了量化的研究,探索任务性能扩大的法律。首先,我们发现了一个严格的任务扩大法律,从而提高了任务性能的预测可能性。特别是,我们可以在代码生成任务上预测2.4B模型的性能,只需0.05%的差异。其次,通过PassUntil,我们发现了真正的emergent abilities,并证明它们与突破性有很大相似之处,它们的扩大曲线不能用标准扩大法律函数描述。然后,我们提出了一个数学定义,用于描述emergent abilities。通过定义,我们否定了一种流行的“多步逻辑假设”,即emergent abilities的起源,并提出了一种新的假设,具有满足性的适应性。
Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning
results: 通过使用不同的几何示例和精度调整方法,我们发现GPT-4在干预示中间 reasoning能力较强,但在长期时间征 reasoning方面仍然失败。相比之下,经过精度调整的LLM在适应环境中表现出色,但在更大的环境或具有更多障碍物的环境中表现较差。Abstract
Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed $\textbf{P}$ath $\textbf{P}$lanning from $\textbf{N}$atural $\textbf{L}$anguage ($\textbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies and BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act interleavedly, although it still fails to make long-term temporal reasoning. In contrast, while fine-tuned LLMs achieved impressive results on in-distribution reasoning tasks, they struggled to generalize to larger environments or environments with more obstacles.
摘要
大型语言模型(LLM)在各种任务上取得了优异成绩,但仍面临长期观念和空间理解的限制。为进一步推进这一研究,在这个工作中,我们提出了一个新的benchmark,称为“路径观察”(PPNL)。我们的benchmark评估LLM的空间-时间理解,通过要求LLM通过避免障碍物并遵循限制进行路径观察。我们通过不同的几何提示方法和精致适应的LLM进行系统性的探索,包括GPT-4、BART和T5。我们的实验结果显示GPT-4在几何提示下能够具有优异的空间理解能力,但仍无法进行长期时间的观察。相比之下,精致适应的LLM在类型内的理解任务上表现出色,但是它们在更大的环境或具有更多障碍的环境中则对应不佳。
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
results: 作者通过人工评估,发现所有模型都在快速变化的知识和谬误前提下表现不佳,而 FreshPrompt 方法可以显著提高 LLM 的性能,并在与其他搜索引擎增强的提示方法和商业系统如 Perplexity.AI 进行比较中表现出色。Abstract
Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
摘要
大多数大型语言模型(LLM)在训练后从未更新,因此它们缺乏能够动态适应我们世界的变化的能力。在这项工作中,我们进行了详细的实验,以确定LLM生成的文本是否准确。 Specifically,我们引入了一个新的动态问答数据集,名为FreshQA,其包含了多种问题和答案类型,包括需要快速更新的世界知识以及含有谬误的前提的问题。我们对一些关闭和开源LLM进行了两种评估方法,以评估它们的正确性和幻想。通过人工评估,我们发现了这些模型的局限性,并证明了它们在快速更新的知识和谬误前提下的表现不佳。为了解决这些问题,我们提出了FreshPrompt,一种简单的几个示例提示法,可以使LLM在FreshQA上表现更好。我们的实验显示,FreshPrompt不仅比自然语言提示法(Press et al., 2022)和商业系统(Perplexity.AI)更高效,还能够在不同的问题类型下提高LLM的表现。我们的分析表明,检索引擎搜索结果的数量和顺序均对LLM生成答案的正确性产生重要影响。此外, instrucibing LLM生成简洁和直接的答案可以减少幻想。为了促进未来的研究,我们在github.com/freshllms/freshqa上发布了FreshQA,并将在定期更新。