paper_authors: Erfan Al-Hossami, Razvan Bunescu, Justin Smith, Ryan Teehan
for: 本研究旨在开发一个自动化的索引教学机器人,以帮助新手程序员 debug 缺陷的解决方案。
methods: 本研究使用了 manually created dataset of multi-turn Socratic advice,并使用了多种语言模型进行评估,包括 Flan-T5 和 GPT-4。
results: 研究发现,使用自动化索引教学机器人可以提高学习效果,但是需要更多的数据和评估方法来进一步改进。Note: “索引教学” (Socratic teaching) refers to a teaching method that guides students towards solving problems on their own, rather than providing the solution directly.Abstract
When employing the Socratic method of teaching, instructors guide students toward solving a problem on their own rather than providing the solution directly. While this strategy can substantially improve learning outcomes, it is usually time-consuming and cognitively demanding. Automated Socratic conversational agents can augment human instruction and provide the necessary scale, however their development is hampered by the lack of suitable data for training and evaluation. In this paper, we introduce a manually created dataset of multi-turn Socratic advice that is aimed at helping a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of a number of language models, ranging from fine-tuning the instruction-based text-to-text transformer Flan-T5 to zero-shot and chain of thought prompting of the much larger GPT-4. The code and datasets are made freely available for research at the link below. https://github.com/taisazero/socratic-debugging-benchmark
摘要
使用索底里亚方法教学时,教师会导学生解决问题而不直接提供解决方案。这种策略可以大幅提高学习效果,但是它通常需要较长的时间和更多的认知努力。自动化索底里亚对话代理可以增强人类教学,但是它们的开发受到数据集的限制。在这篇论文中,我们介绍了一个手动创建的多轮索底里亚建议数据集,用于帮助新手程序员修复buggy解决方案。这个数据集后来用于评估一些语言模型的索底里亚调试能力,包括练习基于文本的文本转换器Flan-T5的特点调试,以及GPT-4的零损环境和链接思维提示。代码和数据集均为研究用途而免费提供,可以在以下链接下载:https://github.com/taisazero/socratic-debugging-benchmark。
The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices
results: 研究发现,在physics、math和computer science领域,数据和方法共享实践在时间上逐渐普及,文章中包含链接的数量也在增加。同时,这些链接的重复使用也在增加,特别是在计算机科学领域。此外,研究还发现,分享数据和方法链接的文章会受到更多的引用,活动链接的影响更加明显。Abstract
In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at scale has proven difficult. In this work, we leverage a large-scale dataset of 1.1M papers from arXiv that are representative of the fields of physics, math, and computer science to analyze the adoption of data and method link-sharing practices over time and their impact on article reception. To identify links to data and methods, we train a neural text classification model to automatically classify URL types based on contextual mentions in papers. We find evidence that the practice of link-sharing to methods and data is spreading as more papers include such URLs over time. Reproducibility efforts may also be spreading because the same links are being increasingly reused across papers (especially in computer science); and these links are increasingly concentrated within fewer web domains (e.g. Github) over time. Lastly, articles that share data and method links receive increased recognition in terms of citation count, with a stronger effect when the shared links are active (rather than defunct). Together, these findings demonstrate the increased spread and perceived value of data and method sharing practices in open science.
摘要
近年来,资金机构和学术刊物 increasingly 强调开放科学实践(例如数据和方法分享),以提高科学的透明度、访问性和重复性。然而,规模化这些实践的量化仍然是一个挑战。在这项工作中,我们利用 arXiv 上的 1.1 万篇论文数据,这些论文代表物理、数学和计算机科学领域,以分析时间的推广和影响。为了识别数据和方法链接,我们使用 neural 网络文本分类模型,自动根据论文中的上下文提取 URL 类型。我们发现,在更多的论文中包含链接的做法在时间上扩散,并且这些链接在论文之间的重复使用也在增长(尤其在计算机科学领域)。此外,这些链接在时间的推移中变得更加集中在 fewer web 域(如 Github)。最后,分享数据和方法链接的论文会 receiving 更多的引用,其中活动链接的影响更加强大。总之,这些发现表明开放科学实践的扩散和价值的增加。
Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference
results: 根据多种条件的评估, authors发现了RAG系统可以提高中学生数学问答的响应质量,但是设计者需要考虑在生成响应时与教学资源的匹配程度和学生喜好之间的平衡。Abstract
For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.
摘要
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models
results: 这篇论文系统地评估了其方法在八个医疗影像分类 datasets 上的效果,并证明其能够对抗潜在的干扰因素,并与标准的视觉嵌入oder 和其他基elines 相比,具有优秀的性能。此外,这篇论文还通过实际应用中的实例研究,详细介绍了这种方法在医疗影像分类中的解释性。Abstract
Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients. However, two challenges arise when deploying deep learning models to real-world healthcare applications. First, neural models tend to learn spurious correlations instead of desired features, which could fall short when generalizing to new domains (e.g., patients with different ages). Second, these black-box models lack interpretability. When making diagnostic predictions, it is important to understand why a model makes a decision for trustworthy and safety considerations. In this paper, to address these two limitations, we propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model. We systematically evaluate our method on eight medical image classification datasets to verify its effectiveness. On challenging datasets with strong confounding factors, our method can mitigate spurious correlations thus substantially outperform standard visual encoders and other baselines. Finally, we show how classification with a small number of concepts brings a level of interpretability for understanding model decisions through case studies in real medical data.
摘要
医疗图像分类是医疗领域的关键问题,有助于减轻医生的工作负担并促进病人的诊断。然而,在实际应用深度学习模型时,存在两个挑战。首先,神经网络模型往往学习潦上欺诈的相关性而不是需要的特征,这可能会导致在新领域(例如不同年龄的病人)中generalization异常。第二,这些黑盒模型缺乏可读性。在作出诊断时,理解模型为什么做出了决定非常重要,以确保信任和安全考虑。在这篇论文中,我们提出了一种新的方法,用于建立坚实和可读的医疗图像分类器,基于自然语言概念。具体来说,我们首先从GPT-4中查询临床概念,然后将潜在的图像特征转换成显式的概念使用视力语言模型。我们系统地评估我们的方法在八个医疗图像分类 dataset 上,以验证其效果。在具有强调因素的 dataset 上,我们的方法可以减少潦上欺诈,因此与标准视觉编码器和其他基线之间具有显著性能优势。最后,我们通过实际医疗数据的 caso 研究,示出分类器使用少量概念可以提供一定的可读性,以便理解模型做出的决定。
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis
paper_authors: Zishun Yu, Yunzhe Tao, Liyu Chen, Tao Sun, Hongxia Yang
For: This paper focuses on program synthesis, which aims to generate accurate and executable code from natural language descriptions. The authors explore the use of reinforcement learning (RL) and large language models (LLMs) to enhance code generation capabilities.* Methods: The authors propose a value-based approach to program synthesis, which differs from the predominant policy-based methods. They develop a novel RL agent called $\mathcal{B}$-Coder, which leverages pre-trained LLMs and a conservative Bellman operator to reduce training complexities.* Results: The authors demonstrate the effectiveness of their approach through empirical evaluations, achieving state-of-the-art performance compared to policy-based methods. Notably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL.Abstract
Program synthesis aims to create accurate, executable code from natural language descriptions. This field has leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. This integration focuses on directly optimizing functional correctness, transcending conventional supervised losses. While current literature predominantly favors policy-based algorithms, attributes of program synthesis suggest a natural compatibility with value-based methods. This stems from rich collection of off-policy programs developed by human programmers, and the straightforward verification of generated programs through automated unit testing (i.e. easily obtainable rewards in RL language). Diverging from the predominant use of policy-based algorithms, our work explores the applicability of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we propose an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance compared with policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
摘要
(简化中文)Program synthesis 目标是从自然语言描述中生成正确可执行代码。这个领域通过结合大型自然语言模型(LLM)和强化学习(RL),提高了代码生成能力。这种集成关注直接优化功能正确性,超越传统的监督损失。当前文献主要倾向于使用策略型算法,但是程序生成的特点表明值型算法具有天然的相容性。这是因为人工程序员开发的庞大范围内的偏离策略,以及通过自动单元测试(RL语言中的容易获得奖励)直接验证生成的程序。在政策型算法的主导下,我们的工作探索了值型方法的可行性,并开发了我们的 Bellman 编程器(简称 Bellman 编程器)。然而,训练值型方法存在巨大的搜索空间问题,为此,我们提出了使用预训练 LLM 和保守的 Bellman 算子来降低训练复杂性。此外,我们还示出了如何利用学习到的值函数作为双重策略来后处生成的程序。我们的实验证明了 Bellman 编程器 可以达到与政策型算法相同或更高的性能,并且减少了奖励工程学的努力。这种成就表明了值型 RL 的有效性,不需要奖励设计。
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
paper_authors: Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun for: This paper aims to evaluate the tool usage awareness and selection ability of large language models (LLMs) in various scenarios, with the goal of determining whether LLMs can effectively serve as intelligent agents.methods: The authors create a benchmark called MetaTool, which includes a dataset called ToolE that contains various user queries in the form of prompts that trigger LLMs to use tools. They define four subtasks for tool selection and conduct experiments involving nine popular LLMs.results: The majority of the LLMs struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through error analysis, the authors found significant room for improvement. The paper provides insights for tool developers to enhance the tool selection performance of LLMs.Here is the simplified Chinese text:for: 这篇论文目的是评估大语言模型(LLMs)在不同场景下是否具备工具使用意识和选择能力,以验证LLMs是否能够成为智能代理。methods: 作者们创建了一个名为MetaTool的benchmark,包括一个名为ToolE的数据集,该数据集包含各种用户查询,通过让LLMs使用工具来触发。他们定义了四个工具选择任务,包括相似选择、特定场景选择、可能可靠性问题选择和多工具选择。results: 大多数LLMs仍然困难地选择工具,这反映了现有的差距。但是通过错误分析,作者们发现还有很大的改进空间。 paper提供了工具开发者可以遵循ChatGPT的细节描述,以提高LLMs的工具选择性能。Abstract
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving nine popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers that follow ChatGPT to provide detailed descriptions that can enhance the tool selection performance of LLMs.
摘要
大型语言模型(LLMs)在近期引起了广泛关注,因为它们在自然语言处理(NLP)能力方面表现出色。最近,许多研究专注于 LLMS 的工具使用能力。他们主要探讨了 LLMS 如何与特定工具合作。但在应用程序中,LLMS 扮演智能代理的情况下,它们需要进行复杂的决策过程,包括是否使用工具和选择适合用户需求的最佳工具。因此,在这篇研究中,我们提出了 MetaTool,一个用于评估 LLMS 是否具有工具使用意识和正确选择工具的对benchmark。具体来说,我们创建了一个名为 ToolE 的 dataset。这个 dataset 包含了各种用户请求的形式,将 LLMS 触发使用工具,包括单一工具和多工具的情况。接着,我们设定了工具使用意识和工具选择的四个任务,包括工具选择与相似选择、特定情况下的工具选择、可能存在可靠性问题下的工具选择、以及多工具选择。我们对九个流行的 LLMs 进行了实验,发现大多数 LLMS 仍然对工具选择产生问题,这显示了现有的 gap между LLMS 和真正的智能代理。但是,通过错误分析,我们发现仍然有很大的改善空间。最后,我们提出了对 ChatGPT 的工具开发者的启示,以帮助提高 LLMS 的工具选择性能。
Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages
methods: 我们使用语言模型化 discrete units 作为基线系统,以评估无参量代码混合语音编码器的代码混合能力。
results: 我们的实验涵盖了多种知名的语音编码器,包括 Wav2vec 2.0、HuBERT 等。我们发现,使用多语言预训练(如 XLSR)的编码器在代码混合场景下表现较好,但 ainda 有很大的改进空间以提高其代码混合语言能力。Abstract
We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech encoders, including Wav2vec 2.0, HuBERT, XLSR, etc. We examine the impact of pre-training languages and model size on benchmark performance. Notably, though our results demonstrate that speech encoders with multilingual pre-training, exemplified by XLSR, outperform monolingual variants (Wav2vec 2.0, HuBERT) in code-switching scenarios, there is still substantial room for improvement in their code-switching linguistic abilities.
摘要
我们介绍了一个新的零资源代码换语音benchmark,用于直接评估自动学习语音编码器的代码换语能力。我们展示了一个基线系统,利用语言模型在粒度单位上进行语言模型化,以示 zero-resource 的方式评估代码换语言编码器的能力。我们的实验包括了许多常见的语音编码器,如 Wav2vec 2.0、HuBERT 和 XLSR 等。我们研究了预训练语言和模型大小对 benchmark 性能的影响。结果显示,使用多语言预训练的 XLSR 在代码换语言场景中表现出色,超过单语言变体(Wav2vec 2.0、HuBERT)的表现。然而,我们还发现在代码换语言能力方面,这些编码器仍然有很大的改进空间。
Multimodal Question Answering for Unified Information Extraction
results: 对六个数据集进行了广泛的实验,显示了我们的MQA框架可以在不同的任务和设置下,有效地提高多Modal大模型(LMM)的性能,并在零shot设置下,与前一代基elines比较得到了大幅度的提升。Abstract
Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of our framework can successfully transfer to the few-shot setting, enhancing LMMs on a scale of 10B parameters to be competitive or outperform much larger language models such as ChatGPT and GPT-4. Our MQA framework can serve as a general principle of utilizing LMMs to better solve MIE and potentially other downstream multimodal tasks.
摘要
多Modal信息提取(MIE)的目标是从不结构化多媒体内容中提取结构化信息。由于任务和设置的多样性,当前大多数MIE模型是任务特定和数据耗费的,这限制了它们在真实世界情况下的普适性。为解决这些问题,我们提议一种多Modal问答(MQA)框架,将MIE任务转化为一个统一的Span抽取和多选问答管道。广泛的实验表明:1. 我们的MQA框架在不同的 dataset上 consistently 和 statistically 改进了多种 Off-the-shelf 大型多Modal模型(LMM)的MIE任务性能,比 vanilla prompting 更好。2. 在零shot设定下,MQA 超越了之前的基eline。此外,我们的框架的效果可以成功传承到几个 shot 设定下,使用10B参数的LMM在一个竞争或超越大型语言模型such as ChatGPT和GPT-4。3. 我们的MQA框架可以作为使用LMM解决MIE和其他下游多Modal任务的一般原则。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions
results: 研究发现,transformer可以在一些简单的任务上nearly匹配最佳学习算法,但在更复杂的任务上表现下降。此外,一些没有注意力的模型也可以与transformer相似的性能。在提供教学序列后,transformer可以更加sample-efficiently学习。最后,研究发现,现有的LLMs可以与基于 nearest-neighbor的基准值竞争。Abstract
In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.
摘要
为了理解Contextual Learning现象,latest works启用了彩绘的实验方案,并证明了Transformers可以学习Gradient-based learning算法 для不同类型的实数函数。然而,Transformers在实现学习算法方面的局限性和其他类型的算法学习能力还不够清楚。另外,关注型模型是否能够学习其他类型的算法的能力也不了解。此外,这些发现是否可以推广到预训练的Large Language Models(LLMs)还需要进一步研究。在这个工作中,我们通过以下方式回答了这些问题:(a) 我们在一个包含多种布尔函数类型的测试环境中发现,Transformers可以在简单任务上几乎与最佳学习算法匹配,而在更复杂任务上,其性能会下降。此外,我们发现某些无关注意力模型在一系列任务上表现几乎与Transformers一样。(b) 当给Transformers一个教学序列,即一组唯一标识函数类型的示例,我们发现Transformers可以更加效率地学习。有趣的是,我们发现Transformers可以学习并实现两种不同的算法来解决同一个任务,并可以根据示例序列选择更加sample-efficient的算法。(c) 最后,我们发现现有的LLMs,如LLaMA-2和GPT-4,可以与 nearest-neighbor baselines竞争在不在其训练集中的预测任务上。
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference
paper_authors: Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally
results: 论文的结果表明,LLaMA模型在不同的GPU和数据集下的推理性能和能源消耗存在很大差异。通过模型分割技术,可以在多达32个GPU上进行并发推理,从而提高性能和降低能源消耗。Abstract
Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.
摘要
大型语言模型(LLM)在Popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 & A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
results: 该论文显示了 zero-shot 多实体驱动生成的能力,而且不需要修改图像解码器。这使得可以轻松地替换CLIP并集成多种 U-Net 技术,从细化控制到个性化图像解码器。Abstract
Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation."
摘要
最近的文本到图像(T2I)和视觉语言到图像(VL2I)生成技术已经取得了 significiant 进步。然而,从通用视觉语言输入开始生成,特别是包含多个图像的情况,仍然是未explored 领域。本文提出了 Kosmos-G 模型,利用多modal大语言模型(MLLMs)的高级见解能力来解决上述挑战。我们的方法将 MLLM 的输出空间与 CLIP 进行了对应,并通过文本modalities 进行了 compositional instruction tuning 的 curated 数据。Kosmos-G 表现了一种无需修改图像解码器的零shot 多实体主题驱动生成能力。审查 instruction tuning 不需要修改图像解码器,这使得可以顺利地替换 CLIP 并轻松地与多种 U-Net 技术结合,从精细控制到个性化图像解码器。我们认为 Kosmos-G 是对 "图像为外语在图像生成" 的初步尝试。
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
results: 该论文发现,通过预训练,vanilla Transformer 可以与 State Space Model (SSM)匹配在 Long Range Arena 上的表现,并在 PathX-256 任务上提高 SSM 的最佳记录Result by 20 个绝对点。此外,论文还发现,在带有数据驱动初始化的情况下,之前提出的结构化参数化方法对 SSM became redundant。Abstract
Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using $\textit{only the downstream task data}$, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.
摘要
In contrast to prior works, we find that vanilla Transformers can match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Furthermore, we analyze the utility of previously proposed structured parameterizations for SSMs and show that they become redundant in the presence of data-driven initialization obtained through pretraining. Our work demonstrates that incorporating data-driven priors via pretraining is essential for reliable performance estimation when evaluating different architectures on supervised tasks, and can be done efficiently.
T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation
results: 这篇论文提出了一个名为 T$^3$Bench 的全面的文本到3D测试集,并提出了两种自动度量器来评估文本到3D模型的性能,即多视图图像评分和文本-3D一致度评估。这两种度量器与人类评价有高度相关,可以有效地评估文本到3D模型的性能。Abstract
Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and Large Language Model (LLM) evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among six prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com.
摘要
现有的文本到3D方法利用强大预训 diffusion 模型优化 NeRF,不需要训练3D数据可以生成高质量3D场景。由于这是一个开放式任务,大多数研究通过subjective case study和用户实验评估自己的结果,因此存在评估当前进展的问题的量化问题。在这篇论文中,我们介绍T$^3$Bench,首个包含多种文本提示的三个不同复杂度水平的文本到3D benchmark。为了评估文本到3D模型的资源和文本对齐,我们提出了两种自动度量器,它们基于多视图图像生成的3D内容。一个是文本-图像多视图分数和区域卷积来检测质量和视图不一致的指标。另一个是文本-3D多视图描述和大语言模型评估来度量文本-3D一致性的指标。这两个指标与人类评价的不同维度呈正相关,为efficiently评估文本到3D模型提供了一个方框。 Fig. 1 中显示的 benchmarking 结果显示了六种流行的文本到3D方法之间的性能差异。我们的分析还指出了当前方法在生成周围和多对象场景时的普遍困难,以及利用2D导航来生成3D内容的瓶颈。我们的项目页面可以在:https://t3bench.com 上找到。
UniverSLU: Universal Spoken Language Understanding for Diverse Classification and Sequence Generation Tasks with a Single Network
results: 研究表明,这个单一多任务学习(MTL)模型“UniverSLU”在12种语音分类和序列生成任务中表现了竞争力,甚至超过了专门为这些任务预训练的模型。Abstract
Recent studies have demonstrated promising outcomes by employing large language models with multi-tasking capabilities. They utilize prompts to guide the model's behavior and surpass performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly perform various spoken language understanding (SLU) tasks? To address this, we utilize pre-trained automatic speech recognition (ASR) models and employ various task and dataset specifiers as discrete prompts. We demonstrate efficacy of our single multi-task learning (MTL) model "UniverSLU" for 12 different speech classification and sequence generation tasks across 17 datasets and 9 languages. Results show that UniverSLU achieves competitive performance and even surpasses task-specific models. We also conduct preliminary investigations into enabling human-interpretable natural phrases instead of task specifiers as discrete prompts and test the model's generalization capabilities to new paraphrases.
摘要
最近的研究表明,使用大型自然语言模型并行多任务能够获得扎实的成果。它们使用提示来引导模型的行为,并超越专门为某个任务设计的模型的性能。受到这些研究的启发,我们问:我们可以建立一个能够同时执行多种口语理解(SLU)任务的单一模型吗?为解决这个问题,我们利用预训练的自动语音识别(ASR)模型,并使用不同的任务和数据集规定器作为批处理的提示。我们称之为“UniverSLU”。我们在17个数据集和9种语言上进行了12种语音分类和序列生成任务的测试,结果表明UniverSLU可以达到竞争性的表现,甚至超越专门为某个任务设计的模型。我们还进行了初步的研究,使用人类可理解的自然短语而不是任务规定器作为提示,并测试模型的泛化能力。
Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model
results: 实验结果显示,启发询问在语音识别和插槽填充等序列生成任务中能够 achieve 53% 的Relative Improvement in Word Error Rate 和 27% 的F1 Score。此外,启发询问在低资源enario中与 Fine-Tuning 方法竞争。此外,这篇论文还证明了启发询问和适束调整在不同语言的cross-Lingual ASR中的传递性。Abstract
Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT) methods. However, existing studies on speech prompting focused on classification tasks and failed on more complex sequence generation tasks. Besides, adapter tuning is primarily applied with a focus on encoder-only self-supervised models. Our experiments show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks. It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling. Additionally, prompting competes with the FT method in the low-resource scenario. Moreover, we show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR. When limited trainable parameters are involved, prompting and adapter tuning consistently outperform conventional FT across 7 languages. Notably, in the low-resource scenario, prompting consistently outperforms adapter tuning.
摘要
<>translate "Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT) methods. However, existing studies on speech prompting focused on classification tasks and failed on more complex sequence generation tasks. Besides, adapter tuning is primarily applied with a focus on encoder-only self-supervised models. Our experiments show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks. It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling. Additionally, prompting competes with the FT method in the low-resource scenario. Moreover, we show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR. When limited trainable parameters are involved, prompting and adapter tuning consistently outperform conventional FT across 7 languages. Notably, in the low-resource scenario, prompting consistently outperforms adapter tuning." into Simplified Chinese.干���addle和适配调整已经成为精细调整(FT)方法的有效替代方案。然而,现有的speech prompting研究主要集中在分类任务上,并未能够处理更复杂的序列生成任务。此外,适配调整主要应用于encoder-only自动学习模型。我们的实验表明,在Wav2Seq模型上进行提示,超过了先前的工作在序列生成任务中。它在ASR中实现了53%的关系改进率,并在插槽填充任务中实现了27%的F1分数。此外,提示和适配调整在低资源enario中竞争FT方法。此外,我们还证明了Wav2Seq模型上的提示和适配调整在跨语言ASR中的传送性。当有限的可学习参数参与时,提示和适配调整一致地超越了传统FT。特别是在低资源enario中,提示一直超越了适配调整。
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning
for: This paper focuses on improving the automatic selection of exemplars for in-context learning in natural language processing, specifically using Large Language Models (LLMs).
methods: The proposed method, called Dual Queries and Low-rank approximation Re-ranking (DQ-LoRe), utilizes two stages of querying: first, LLM-generated knowledge is obtained through Dual Queries, and then, the retriever is queried to obtain final exemplars that align with the input question’s knowledge. Additionally, LoRe employs dimensionality reduction techniques to refine exemplar selection.
results: The proposed DQ-LoRe method significantly outperforms prior state-of-the-art methods in selecting exemplars for GPT-4, with a performance increase from 92.5% to 94.2%. The method also consistently outperforms retrieval-based approaches in terms of both performance and adaptability, especially in scenarios with distribution shifts.Abstract
Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effective selection of exemplars for facilitating in-context learning. In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking (DQ-LoRe) to automatically select exemplars for in-context learning. Dual Queries first query LLM to obtain LLM-generated knowledge such as CoT, then query the retriever to obtain the final exemplars via both question and the knowledge. Moreover, for the second query, LoRe employs dimensionality reduction techniques to refine exemplar selection, ensuring close alignment with the input question's knowledge. Through extensive experiments, we demonstrate that DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%. Our comprehensive analysis further reveals that DQ-LoRe consistently outperforms retrieval-based approaches in terms of both performance and adaptability, especially in scenarios characterized by distribution shifts. DQ-LoRe pushes the boundaries of in-context learning and opens up new avenues for addressing complex reasoning challenges. We will release the code soon.
摘要
近期自然语言处理技术的发展,主要受到大型语言模型(LLM)的推动,展现了其在context learning中的强大能力。为了引导LLM进行复杂的推理任务,一个有前途的方向是在Chain-of-Thought(CoT) парадигме中使用中间推理步骤。然而,中间推理步骤的选择是一个主要挑战。在本研究中,我们提出了一种框架,利用Dual Queries和Low-rank approximation Re-ranking(DQ-LoRe)自动选择中间推理步骤。Dual Queries首先询问LLM获取LLM生成的知识,如CoT,然后询问检索器获取最终的例子via问题和知识。此外,在第二个询问中,LoRe使用维度减少技术来精细地选择例子,确保与输入问题的知识相互Alignment。通过广泛的实验,我们证明了DQ-LoRe在自动选择GPT-4中间推理步骤方面显著超越了前一个状态的方法,从92.5%提高到94.2%。我们的全面分析还表明,DQ-LoRe在检索器基于方法的场景下表现出了明显的优势,特别是在存在分布shift的情况下。DQ-LoRe推动了context learning的边缘和开创了新的解决复杂推理挑战的方向。我们即将发布代码。
JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning
paper_authors: Chang Gao, Wenxuan Zhang, Guizhen Chen, Wai Lam for: This paper aims to improve the performance of large language models (LLMs) in various tasks by providing explicit task instructions through a novel structure-to-structure approach called JsonTuning.methods: The JsonTuning approach leverages the versatility and structured nature of JSON to represent tasks, enhancing generalization, improving robustness, and increasing controllability over the output.results: The experimental results show that JsonTuning outperforms TextTuning in various applications, demonstrating improved performance, adaptability, robustness, and controllability.Abstract
Instruction tuning has emerged as a crucial process for harnessing the capabilities of large language models (LLMs) by providing explicit task instructions, leading to improved performance in various tasks. However, prevalent text-to-text instruction tuning (TextTuning) methods suffer from limitations in generalization, robustness, and controllability due to the ambiguity and lack of explicit structure in tasks. In this paper, we propose JsonTuning, a novel structure-to-structure approach for instruction tuning. By leveraging the versatility and structured nature of JSON to represent tasks, JsonTuning enhances generalization by helping the model understand essential task elements and their relations, improves robustness by minimizing ambiguity, and increases controllability by providing explicit control over the output. We conduct a comprehensive comparative study with diverse language models and evaluation benchmarks. Experimental results show that JsonTuning outperforms TextTuning in various applications, showcasing improved performance, adaptability, robustness, and controllability. By overcoming the limitations of TextTuning, JsonTuning demonstrates significant potential for more effective and reliable LLMs capable of handling diverse scenarios.
摘要
A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4
for: This paper is written to provide a comprehensive survey of recent research progress in the field of GPT-3 family large language models (GLLMs), including their performances in various downstream tasks, domains, and languages.
methods: The paper uses a brief overview of transformers, transfer learning, self-supervised learning, pretrained language models, and large language models as foundation concepts, and discusses the data labelling and data augmentation abilities, robustness, effectiveness, and future research directions of GLLMs.
results: The paper presents a comprehensive overview of the recent research progress in GLLMs, including their performances in various downstream tasks, domains, and languages, and provides insightful future research directions for the field.Here are the three key information points in Simplified Chinese text:
results: 这篇论文提供了GLLMs在不同下游任务、领域和语言中的全面评估,并提供了一些有价值的未来研究方向。Abstract
Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.
摘要
大型语言模型(LLM)是一种特殊的预训练语言模型,通过扩大模型大小、预训练文献和计算来获得。由于它们的大型和预训练大量文本数据,LLM具有特殊的能力,可以在许多自然语言处理任务中达到很高的性能,无需任务特定的训练。LLM的时代开始于OpenAI GPT-3模型,而GPT-3家族大语言模型(GLLM)的流行程度在不断增长,特别在研究社区。随着GLLM的普及,特别是在研究领域,有一个强烈的需求:即对多个维度的研究进展进行总结,并提供有用的未来研究方向。我们的论文开始于基础概念,如变换器、转移学习、自我超vised学习、预训练语言模型和大语言模型。然后,我们提供GLLM的简要概述,讨论GLLM在各种下游任务、特定领域和多种语言中的表现。我们还讨论GLLM的数据标注和数据扩展能力,GLLM的Robustness,GLLM作为评估器的效果,并最后结束于多个有用的未来研究方向。总之,这篇总结论文将成为学术和工业人员关注GPT-3家族大语言模型最新研究的好资源。
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
paper_authors: Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
for: 本文旨在评估端到端自动语音识别(ASR)模型的括号和大写预测能力。
methods: 本文使用LibriSpeech-PC数据集,并提出了一种新的评估指标 called Punctuation Error Rate(PER)来评估括号预测的准确性。
results: 本文提供了一些初步的基准模型,并通过对LibriSpeech-PC数据集进行测试,证明了该评估指标的有用性。Abstract
Traditional automatic speech recognition (ASR) models output lower-cased words without punctuation marks, which reduces readability and necessitates a subsequent text processing model to convert ASR transcripts into a proper format. Simultaneously, the development of end-to-end ASR models capable of predicting punctuation and capitalization presents several challenges, primarily due to limited data availability and shortcomings in the existing evaluation methods, such as inadequate assessment of punctuation prediction. In this paper, we introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models. The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models. All code, data, and models are publicly available.
摘要
传统的自动话语识别(ASR)模型输出下划线字符的话语无标点符号,这会降低可读性,需要 subsequential 文本处理模型来将 ASR 笔记转换成正确的格式。同时,开发端到端 ASR 模型可以预测标点和大小写存在几个挑战,主要是因为数据的有限性和现有评估方法的缺陷,如标点预测评估不够精准。本文介绍了一个基于 LibriSpeech-PC 的比较基准,用于评估端到端 ASR 模型的标点和大小写预测能力。该比较基准包括 LibriSpeech-PC 数据集,修复了标点和大小写,以及一种新的评估指标called Punctuation Error Rate(PER),它专注于标点符号。此外,我们还提供了一些初步的基线模型。所有代码、数据和模型都公开可用。
Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation
paper_authors: Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best
for: 提高在有限数据上的 hate speech 检测性能
methods: 使用数据生成技术生成假数据,并将 hate sentiment 保留在原始示例中转移到目标语言中的新示例中
results: 使用生成的数据训练 hate speech 分类模型,在 Hindi 和 Vietnamese 等有限数据上显示了比较好的性能,可以帮助 bootstrap hate speech 检测模型从scratch 在有限数据上Here’s the translation in English for reference:
for: Improving the performance of hate speech detection in limited data contexts
methods: Using data generation techniques to synthesize new examples of hate speech data in the target language, while retaining the hate sentiment in the original examples
results: Training a hate speech classification model using the synthesized data shows comparable or even better performance than training only on the limited data available in the target domain, which can help bootstrap hate speech detection models from scratch in limited data contexts.Abstract
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech.
摘要
“一些研究已经集中在网络上的讯息分类方法,以探测增加的网络上的仇恨言论。然而,这些进步仅限于一些高度资源的语言,使得检测系统在有限数据上 either 下perform 或无法存在。这主要是因为训练数据的缺乏,收集和整理这些数据是 expensive 的。在这个工作中,我们提出了一个数据增压方法,使用生成的 synthetic 数据来解决有限数据上 hate speech 检测的问题。我们使用英文中的 hate speech 例子,生成目标语言中的新的 hate speech 数据,保留了原始例子中的仇恨情感,但将仇恨目标转移到新的语言中。我们将这个方法应用到帮助 hate speech 检测模型在印地语和越南语的检测任务中获得更好的性能。我们的发现显示,使用生成的数据训练的模型在目标领域中的性能与使用仅有可用的数据训练的模型相比,有时会更好,有时会相等。这种方法可以用来启动 hate speech 检测模型,尤其在有限数据上。随着社交媒体在这些情况下的增长,这些研究将进一步我们对仇恨言论的检测、理解和回应的能力。”
Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness
methods: BLOOD 方法基于层次变换平滑性,利用 ID 数据的变换表现更加平滑 than OOD 数据,这也在 Transformer 网络中得到了实验证明。
results: 在文本分类任务中,BLOOD 方法与比较资源占用相同的方法进行比较,并表现出比较好的性能。分析还表明,当学习更加简单的任务时,OOD 数据变换保持原始的锐度,而学习更加复杂的任务时,锐度增加。Abstract
Effective OOD detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in deep neural networks based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically for Transformer networks. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.
摘要
实用的OOD检测是机器学习模型的可靠性关键,但现有的方法受到训练数据的限制,无法实际使用。我们提出了一种基于对于几个层的网络中间层的变数平滑性(BLOOD)来检测OOD数据的方法,不需要训练数据。BLOOD利用了对于ID数据的 между层表示变数转换是稳定的,而OOD数据的转换则是不稳定的,这是我们在Transformer网络上验证的。我们在多个文本分类任务上评估BLOOD,并证明它在与其他方法相比有相似的资源需求下表现更好。我们的分析也显示,当学习较简单的任务时,OOD数据的转换仍然保持原始的锋利度,而当学习较复杂的任务时,锋利度则增加。
Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation
results: 实验结果显示,本研究所提出的模型在ERC中比前方法更高的表现,在两个referencedataset上都达到了顶尖水平。Abstract
Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.
摘要
人机交互中的情感认知(ERC)发挥着重要的作用。情感可以存在多个modalities,而多modalities ERC主要面临两个问题:(1)在跨modal信息融合过程中的噪声问题,和(2)使用少量样本的情感标签,这些标签semantically similar yet belong to different categories。为了解决这些问题并充分利用每个modalities的特征,我们采用了以下策略:首先,深度情感cue extraction被 Performing on modalities with strong representation ability,并设计了多modal prompt信息的特征过滤器。然后,我们设计了一种Multimodal Prompt Transformer(MPT)来实现跨modal信息融合。MPT将multimodal融合信息嵌入到每个Attention层中, allowing prompt information参与文本特征编码和融合多级文本信息以获得更好的跨modal融合特征。最后,我们使用Hybrid Contrastive Learning(HCL)策略来优化模型对具有少量样本的标签的处理能力。这种策略使用了无监督的对比学习提高跨modal融合的表示能力,并使用监督的对比学习挖掘标签中的信息。实验结果显示,我们提出的模型在ERC中比州OF-the-art模型更高。
DOMINO: A Dual-System for Multi-step Visual Language Reasoning
paper_authors: Peifang Wang, Olga Golovneva, Armen Aghajanyan, Xiang Ren, Muhao Chen, Asli Celikyilmaz, Maryam Fazel-Zarandi for:这 paper 的目的是提出一种多步多模态理解方法,用于解决图表和图像中的信息抽取和逻辑或数学计算问题。methods:这 paper 使用了一种双系统方法,包括一个 “System-1” 步骤 для视觉信息抽取,以及一个 “System-2” 步骤 для慎重的逻辑计算。在给定输入时,System-2 将问题分解成多个原子步骤,每个步骤导航 System-1 提取图像中需要进行逻辑计算的信息。results:实验表明,我们的方法在图表和图像 datasets 上表现竞争力强,与先前的干预式模型和管道方法相比。在多步逻辑计算任务中,通过练化 System-2 模块(LLaMA-2 70B)只需要一小段数据,我们的方法的准确率得到了进一步改进,并超越了最佳完全监督的端到端方法(5.7%)和管道方法(7.5%)在一个复杂的数据集上。Abstract
Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning. Given an input, System-2 breaks down the question into atomic sub-steps, each guiding System-1 to extract the information required for reasoning from the image. Experiments on chart and plot datasets show that our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on only a small amount of data on multi-step reasoning, the accuracy of our method is further improved and surpasses the best fully-supervised end-to-end approach by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging dataset with human-authored questions.
摘要
视觉语言理解需要一个系统可以从信息厚度图表或图表中提取文本或数字,并通过逻辑或算术理解来获得答案。现有的方法可以分为两种:一种是使用一个终到终的视力语言模型,另一种是使用两个阶段管道,其中一个captioning模型将图表转换为文本,然后另一个大型语言模型来读取这个文本来推理出答案。然而,前者方法会让模型在一个步骤中回答一个复杂的问题,而后者方法容易因为转换后的文本中含有错误或干扰信息而导致模型混乱。在这种情况下,我们提出了一种多步骤多模态逻辑理解的双系统,它包括一个“系统1”步骤用于视觉信息提取,以及一个“系统2”步骤用于推理。给定输入,系统2将问题分解成原子步骤,每个步骤都会导航系统1提取图表中需要进行逻辑reasoning的信息。我们在图表和图表 datasets上进行实验,结果表明我们的方法与一个预训练的系统2模块相比,在不同的数据上都能够竞争。而通过练习系统2模块(LLaMA-2 70B)在小量数据上进行多步骤逻辑reasoning,我们的方法的准确率得到进一步提高,并在一个复杂的 dataset 上超过了最佳完全监督的端到终结构和一个管道结构(FlanPaLM 540B)的最高值。
Low Resource Summarization using Pre-trained Language Models
results: 提出了一种基线方法,可以在限制资源的情况下进行低资源语言的自动概要生成,并达到了与高资源语言英语相同的评价成绩(PEGASUS: 47.21,BART: 45.14 on XSUM Dataset),同时提供了一种可重复的方法,可以应用于其他低资源语言。Abstract
With the advent of Deep Learning based Artificial Neural Networks models, Natural Language Processing (NLP) has witnessed significant improvements in textual data processing in terms of its efficiency and accuracy. However, the research is mostly restricted to high-resource languages such as English and low-resource languages still suffer from a lack of available resources in terms of training datasets as well as models with even baseline evaluation results. Considering the limited availability of resources for low-resource languages, we propose a methodology for adapting self-attentive transformer-based architecture models (mBERT, mT5) for low-resource summarization, supplemented by the construction of a new baseline dataset (76.5k article, summary pairs) in a low-resource language Urdu. Choosing news (a publicly available source) as the application domain has the potential to make the proposed methodology useful for reproducing in other languages with limited resources. Our adapted summarization model \textit{urT5} with up to 44.78\% reduction in size as compared to \textit{mT5} can capture contextual information of low resource language effectively with evaluation score (up to 46.35 ROUGE-1, 77 BERTScore) at par with state-of-the-art models in high resource language English \textit{(PEGASUS: 47.21, BART: 45.14 on XSUM Dataset)}. The proposed method provided a baseline approach towards extractive as well as abstractive summarization with competitive evaluation results in a limited resource setup.
摘要
The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models
results: 作者建议了一种不听语言优先的Compositionality metric。Abstract
Compositionality is a common property in many modalities including natural languages and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image. We also propose a new metric for compositionality without such linguistic priors.
摘要
《作品性》是许多Modalities中的共有特性,包括自然语言和图像,但现有的多modal模型的compositional generalization不够了解。本文认为,图像和文本之间的互动和语言优先顺序是两个主要的visual-linguistic compositionality来源。我们发现现有的改进compositional generalization尝试都是通过语言优先顺序进行,而不是从图像中获取信息。我们还提出了一个不含语言优先顺序的新的compositional metric。
Comparative Study and Framework for Automated Summariser Evaluation: LangChain and Hybrid Algorithms
methods: 本研究使用Large Language Models进行分析,通过利用Langchain工具SUMMARIZE PDF文档,提取主要信息,以测量用户对摘要内容的理解程度。
results: 本研究可以帮助学习者了解他们对某个主题的理解程度,并且可以帮助教育专业人员进一步改善学习能力。Abstract
Automated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring techniques are used for various purposes. Reliable scores are calculated based on influential variables. Such variables can be computed by different methods based on the domain. The research is concentrated on the user's understanding of a given topic. The analysis is based on a scoring index by using Large Language Models. The user can then compare and contrast the understanding of a topic that they recently learned. The results are then contributed towards learning analytics and progression is made for enhancing the learning ability. In this research, the focus is on summarizing a PDF document and gauging a user's understanding of its content. The process involves utilizing a Langchain tool to summarize the PDF and extract the essential information. By employing this technique, the research aims to determine how well the user comprehends the summarized content.
摘要
自动化文章分数(AES)是一种先进技术,用于多种目的。分数计算基于重要的变量,这些变量可以根据域 Compute 多种方法。研究专注于用户对某个主题的理解,通过分数指数来进行分析。用户可以比较和对比他们最近学习的主题理解程度。结果对学习统计和进步做出贡献。在这项研究中,我们关注将 PDF 文档概要并评估用户对其内容的理解程度。过程中使用 Langchain 工具概要 PDF 并提取重要信息。通过这种方法,我们希望确定用户对概要内容的理解程度。
LC-Score: Reference-less estimation of Text Comprehension Difficulty
paper_authors: Paul Tardy, Charlotte Roze, Paul Poupet
for: This paper aims to improve text comprehension for readers with comprehension issues, particularly in the French language.
methods: The paper proposes a simple approach called \textsc{LC-Score} to train text comprehension metrics for any French text without reference. The approach uses linguistically motivated indicators to train statistical models, as well as neural learning directly from text leveraging pre-trained language models.
results: The paper finds that both approaches (indicator-based and neural) outperform commonly used readability and comprehension metrics such as FKGL, based on two human annotation experiments.Abstract
Being able to read and understand written text is critical in a digital era. However, studies shows that a large fraction of the population experiences comprehension issues. In this context, further initiatives in accessibility are required to improve the audience text comprehension. However, writers are hardly assisted nor encouraged to produce easy-to-understand content. Moreover, Automatic Text Simplification (ATS) model development suffers from the lack of metric to accurately estimate comprehension difficulty We present \textsc{LC-Score}, a simple approach for training text comprehension metric for any French text without reference \ie predicting how easy to understand a given text is on a $[0, 100]$ scale. Our objective with this scale is to quantitatively capture the extend to which a text suits to the \textit{Langage Clair} (LC, \textit{Clear Language}) guidelines, a French initiative closely related to English Plain Language. We explore two approaches: (i) using linguistically motivated indicators used to train statistical models, and (ii) neural learning directly from text leveraging pre-trained language models. We introduce a simple proxy task for comprehension difficulty training as a classification task. To evaluate our models, we run two distinct human annotation experiments, and find that both approaches (indicator based and neural) outperforms commonly used readability and comprehension metrics such as FKGL.
摘要
在数字时代,能够阅读和理解written文本是关键。然而,研究表明,大量人口受到理解问题的压力。在这种情况下,进一步的访问ibility措施是必需的,以提高读者文本理解能力。然而,作者几乎没有被帮助,也没有劝导以生成易于理解的内容。此外,自动文本简化(ATS)模型的开发受到了参照文本的缺乏,导致缺乏准确度测试的精度。我们提出了\textsc{LC-Score},一种简单的方法,可以在不使用参照文本的情况下,训练文本理解度量。我们的目标是在 $[0, 100]$ 分范围内,量化文本是否遵循法国《Langage Clair》(LC)指南,这与英语平易途同。我们探索了两种方法:(i)使用语言学上的驱动因素,用于训练统计模型,和(ii)直接从文本中学习,利用预训练语言模型。我们介绍了一个简单的代理任务,以作为理解难度训练的分类任务。为了评估我们的模型,我们进行了两个独立的人类标注实验,并发现,我们的指南(指标)和神经网络方法都高于常用的阅读和理解指标 such as FKGL。
COVID-19 South African Vaccine Hesitancy Models Show Boost in Performance Upon Fine-Tuning on M-pox Tweets
results: 经过调整后,F1-scores提高了超过8%,达到了69.6%的最高值,超过了现有的模型和知名的分类算法。Abstract
Very large numbers of M-pox cases have, since the start of May 2022, been reported in non-endemic countries leading many to fear that the M-pox Outbreak would rapidly transition into another pandemic, while the COVID-19 pandemic ravages on. Given the similarities of M-pox with COVID-19, we chose to test the performance of COVID-19 models trained on South African twitter data on a hand-labelled M-pox dataset before and after fine-tuning. More than 20k M-pox-related tweets from South Africa were hand-labelled as being either positive, negative or neutral. After fine-tuning these COVID-19 models on the M-pox dataset, the F1-scores increased by more than 8% falling just short of 70%, but still outperforming state-of-the-art models and well-known classification algorithms. An LDA-based topic modelling procedure was used to compare the miss-classified M-pox tweets of the original COVID-19 RoBERTa model with its fine-tuned version, and from this analysis, we were able to draw conclusions on how to build more sophisticated models.
摘要
很多非典国家reported大量的M-pox cases since May 2022, causing concerns that the M-pox outbreak could rapidly escalate into another pandemic, while the COVID-19 pandemic continues to spread. Given the similarities between M-pox and COVID-19, we decided to test the performance of COVID-19 models trained on South African Twitter data on a manually labeled M-pox dataset before and after fine-tuning. Over 20,000 M-pox-related tweets from South Africa were manually labeled as positive, negative, or neutral. After fine-tuning these COVID-19 models on the M-pox dataset, the F1-scores increased by more than 8%, reaching nearly 70%, but still outperforming state-of-the-art models and well-known classification algorithms. Using an LDA-based topic modeling procedure, we compared the misclassified M-pox tweets of the original COVID-19 RoBERTa model with its fine-tuned version, and from this analysis, we were able to draw conclusions on how to build more sophisticated models.
AGIR: Automating Cyber Threat Intelligence Reporting with Natural Language Generation
results: AGIR可以准确地传达正式语言表达的信息,提高了报告的流畅性和实用性,并且可以大幅减少CTI报告的写作时间,提高了CTI生成的效率。Abstract
Cyber Threat Intelligence (CTI) reporting is pivotal in contemporary risk management strategies. As the volume of CTI reports continues to surge, the demand for automated tools to streamline report generation becomes increasingly apparent. While Natural Language Processing techniques have shown potential in handling text data, they often struggle to address the complexity of diverse data sources and their intricate interrelationships. Moreover, established paradigms like STIX have emerged as de facto standards within the CTI community, emphasizing the formal categorization of entities and relations to facilitate consistent data sharing. In this paper, we introduce AGIR (Automatic Generation of Intelligence Reports), a transformative Natural Language Generation tool specifically designed to address the pressing challenges in the realm of CTI reporting. AGIR's primary objective is to empower security analysts by automating the labor-intensive task of generating comprehensive intelligence reports from formal representations of entity graphs. AGIR utilizes a two-stage pipeline by combining the advantages of template-based approaches and the capabilities of Large Language Models such as ChatGPT. We evaluate AGIR's report generation capabilities both quantitatively and qualitatively. The generated reports accurately convey information expressed through formal language, achieving a high recall value (0.99) without introducing hallucination. Furthermore, we compare the fluency and utility of the reports with state-of-the-art approaches, showing how AGIR achieves higher scores in terms of Syntactic Log-Odds Ratio (SLOR) and through questionnaires. By using our tool, we estimate that the report writing time is reduced by more than 40%, therefore streamlining the CTI production of any organization and contributing to the automation of several CTI tasks.
摘要
现代风险管理策略中,Cyber Threat Intelligence(CTI)报告的重要性日益凸显。随着CTI报告的数量不断增加,自动化工具的需求也日益凸显。然而,自然语言处理技术在处理文本数据方面表现出了潜在的优势,但它们通常无法处理多元数据源的复杂关系。此外,已有的标准如STIX在CTI社区中得到了广泛采用,强调通过正式分类实体和关系来促进数据共享的一致性。本文介绍一种名为AGIR(自动生成情报报告)的革命性自然语言生成工具,特点是为CTI报告自动生成全面的情报报告。AGIR使用了两个阶段管道,结合了模板方法和大语言模型如ChatGPT的优势。我们对AGIR的报告生成能力进行了量化和质量的评估。生成的报告准确地表达了通过正式语言表达的信息,具有高回归值(0.99)而无需幻化。此外,我们比较了AGIR的报告流畅性和实用性与现有方法,显示AGIR在SLOR和问卷方面的分数高于其他方法。通过使用我们的工具,我们估计CTI报告的写作时间可以减少超过40%,因此加速CTI生产和自动化一些CTI任务。
I$^2$KD-SLU: An Intra-Inter Knowledge Distillation Framework for Zero-Shot Cross-Lingual Spoken Language Understanding
results: 对于 MultiATIS++ 数据集,我们的提议的框架与强大的基线模型比较,显著提高了总准确率,并创造了跨语言 SLU 中新的状态势。Abstract
Spoken language understanding (SLU) typically includes two subtasks: intent detection and slot filling. Currently, it has achieved great success in high-resource languages, but it still remains challenging in low-resource languages due to the scarcity of labeled training data. Hence, there is a growing interest in zero-shot cross-lingual SLU. Despite of the success of existing zero-shot cross-lingual SLU models, most of them neglect to achieve the mutual guidance between intent and slots. To address this issue, we propose an Intra-Inter Knowledge Distillation framework for zero-shot cross-lingual Spoken Language Understanding (I$^2$KD-SLU) to model the mutual guidance. Specifically, we not only apply intra-knowledge distillation between intent predictions or slot predictions of the same utterance in different languages, but also apply inter-knowledge distillation between intent predictions and slot predictions of the same utterance. Our experimental results demonstrate that our proposed framework significantly improves the performance compared with the strong baselines and achieves the new state-of-the-art performance on the MultiATIS++ dataset, obtaining a significant improvement over the previous best model in overall accuracy.
摘要
通常的语音理解理解(SLU)包括两个子任务:意图检测和插槽填充。在高资源语言中,SLU已经取得了很大的成功,但在低资源语言中仍然存在很大的挑战,这是因为这些语言的标注训练数据的缺乏。因此,随着零shot cross-语言SLU的兴趣的增长,我们提出了一个Intra-Inter知识填充框架(I$^2$KD-SLU),以模型语音理解中的相互协作关系。具体来说,我们不仅在不同语言的同一个句子中进行了内知识填充,还进行了间知识填充,以确保意图和插槽之间的相互关系。我们的实验结果表明,我们提出的框架可以与强大的基线模型相比,并在MultiATIS++数据集上达到新的状态态标记,在总准确率方面取得了显著的提升。
NOLA: Networks as Linear Combination of Low Rank Random Basis
for: 这 paper 旨在提高大型语言模型 (LLM) 的减少 Parameters 和存储方法,使其在不同的下游任务和领域中能够快速适应和提高性能。
methods: 这 paper 使用了 LoRA 方法,但是它面临两个主要的限制:1) parameter reduction 是固定的,即 rank one decomposition; 2) parameter reduction 受到模型架构和选择的排名影响。因此,这 paper 引入了 NOLA 方法,通过使用随机生成的基础 (matrix) 进行线性组合,并且只进行 linear mixture 的优化,以解耦参数数量和网络架构之间的关系。
results: 根据 GPT-2 和 ViT 在自然语言和计算机视觉任务中的适应结果,NOLA 能够与相等参数数量的模型进行比较,并且在更大的模型中可以减少参数数量一半,无需牺牲性能。Abstract
Large Language Models (LLMs) have recently gained popularity due to their impressive few-shot performance across various downstream tasks. However, fine-tuning all parameters and storing a unique model for each downstream task or domain becomes impractical because of the massive size of checkpoints (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: 1) the parameter reduction is lower-bounded by the rank one decomposition, and 2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank. For instance, in larger models, even a rank one decomposition might exceed the number of parameters truly needed for adaptation. In this paper, we introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis) and optimizing the linear mixture coefficients only. This approach allows us to decouple the number of trainable parameters from both the choice of rank and the network architecture. We present adaptation results using GPT-2 and ViT in natural language and computer vision tasks. NOLA performs as well as, or better than models with equivalent parameter counts. Furthermore, we demonstrate that we can halve the parameters in larger models compared to LoRA with rank one, without sacrificing performance.
摘要
大型语言模型(LLM)最近受到了广泛关注,因为它们在多个下游任务中表现出了惊人的几次训练成果。然而,对于每个下游任务或领域都需要独立存储和参数调整的方法成为不实际,因为模型检查点的大小(例如GPT-3的350GB)。现有文献,如LoRA,表明了使用低级 modificatioin 方法可以提高LLM的效率和存储。这些方法可以将LLM的参数数量减少到数个数量级。然而,这些方法受到两个主要限制:1)参数减少是基于一个约束的,2)减少的程度受到模型架构和选择的级别的影响。例如,在更大的模型中,即使使用约束一 decomposition,也可能超过实际需要的参数数量。在这篇论文中,我们介绍了NOLA,它可以超越LoRA中的约束一lower bound。它通过将LoRA中的低级矩阵重新参数化为线性组合的随机生成矩阵(基准)和优化线性混合系数,从而解耦参数数量与选择级别和网络架构之间的关系。我们在GPT-2和ViT上进行了自适应任务,并得到了与相同参数数量的性能。此外,我们还证明了可以在更大的模型中减少参数数量,而不会影响性能。