results: 我们的研究表明,使用PTNN方法可以在BERT和ViT模型中提高精度,最高提高5%,而无需进行后处理调整。这些结果在tensor decomposition领域中做出了新的贡献。Abstract
The transformer architecture has revolutionized Natural Language Processing (NLP) and other machine-learning tasks, due to its unprecedented accuracy. However, their extensive memory and parameter requirements often hinder their practical applications. In this work, we study the effect of tensor-train decomposition to improve the accuracy and compress transformer vision-language neural networks, namely BERT and ViT. We focus both on embedding-layer compression and partial tensorization of neural networks (PTNN) through an algorithmic approach. Our novel PTNN approach significantly improves the accuracy of existing models by up to 5%, all without the need for post-training adjustments, breaking new ground in the field of tensor decomposition.
摘要
transformer 架构在自然语言处理(NLP)和其他机器学习任务中取得了无 precedent 的精度,但它们的广泛内存和参数需求经常限制其实际应用。在这项工作中,我们研究tensor-train decompositions以提高bert和vit transformer视语言神经网络的准确率和压缩,包括嵌入层压缩和partial tensorization of neural networks(PTNN)。我们的新的PTNN方法可以在不需要后处理调整的前提下,提高现有模型的准确率,最高提高5%。
Automatic Evaluation of Generative Models with Instruction Tuning
results: 研究发现,通过对HEAP数据集(包含多种NLG任务和评价标准)进行 instrucion 微调,可以获得良好的性能表现,但有些评价标准 harder to learn than others。此外,同时训练多个任务可以提供更好的性能改进,这可能对未来具有有限的人工标注数据的任务有所帮助。Abstract
Automatic evaluation of natural language generation has long been an elusive goal in NLP.A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning language models on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data.
摘要
自然语言生成自动评估长期以来是NLP领域的抢险目标。一种最近的方法是使用预训练语言模型来模拟人类评估标准,并在特定任务和评价标准下进行细化调整。受普遍能力的指示调教模型所启发,我们提议一种学习度量,基于指示调教。为评估我们的方法,我们收集了HEAP数据集,这是各种NLG任务和评价标准下的人类评估判断。我们的发现表明,将语言模型 instrucion tuning 到HEAP数据集上,可以在许多评估任务上达到良好的性能,但有些评价标准更难于学习。此外,同时训练多个任务可以获得额外的性能提升,这可以对未来具有少量或无人标注数据的任务产生帮助。
Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection
results: 实验结果表明,AdaICL可以提高性能的精度值4.4%,相比SOTA(7.7%相对提高),并且可以在有限预算下选择更多的示例,从而提高效果。Abstract
Large Language Models (LLMs) can adapt to new tasks via in-context learning (ICL). ICL is efficient as it does not require any parameter updates to the trained LLM, but only few annotated examples as input for the LLM. In this work, we investigate an active learning approach for ICL, where there is a limited budget for annotating examples. We propose a model-adaptive optimization-free algorithm, termed AdaICL, which identifies examples that the model is uncertain about, and performs semantic diversity-based example selection. Diversity-based sampling improves overall effectiveness, while uncertainty sampling improves budget efficiency and helps the LLM learn new information. Moreover, AdaICL poses its sampling strategy as a Maximum Coverage problem, that dynamically adapts based on the model's feedback and can be approximately solved via greedy algorithms. Extensive experiments on nine datasets and seven LLMs show that AdaICL improves performance by 4.4% accuracy points over SOTA (7.7% relative improvement), is up to 3x more budget-efficient than performing annotations uniformly at random, while it outperforms SOTA with 2x fewer ICL examples.
摘要
Early Detection of Depression and Eating Disorders in Spanish: UNSL at MentalRiskES 2023
results: 在任务1和任务2中,我们的方法获得了第二好的表现,包括分类和延迟时间的排名。这表明了我们的方法在西班牙语早期探测问题中的效果和一致性。Abstract
MentalRiskES is a novel challenge that proposes to solve problems related to early risk detection for the Spanish language. The objective is to detect, as soon as possible, Telegram users who show signs of mental disorders considering different tasks. Task 1 involved the users' detection of eating disorders, Task 2 focused on depression detection, and Task 3 aimed at detecting an unknown disorder. These tasks were divided into subtasks, each one defining a resolution approach. Our research group participated in subtask A for Tasks 1 and 2: a binary classification problem that evaluated whether the users were positive or negative. To solve these tasks, we proposed models based on Transformers followed by a decision policy according to criteria defined by an early detection framework. One of the models presented an extended vocabulary with important words for each task to be solved. In addition, we applied a decision policy based on the history of predictions that the model performs during user evaluation. For Tasks 1 and 2, we obtained the second-best performance according to rankings based on classification and latency, demonstrating the effectiveness and consistency of our approaches for solving early detection problems in the Spanish language.
摘要
MENTALRISKES是一个新的挑战,旨在解决西班牙语早期风险检测中的问题。该挑战的目标是,以最快速的速度可能,检测泰格拉姆用户是否显示精神障碍的迹象。任务1涉及到用户识别饮食障碍,任务2关注于抑郁症检测,任务3旨在检测未知的精神障碍。这些任务被分解成多个子任务,每个子任务定义了解决方案。我们的研究组参与了任务1和2的子任务A:一个二分类问题,以确定用户是否为正或负。为解决这些任务,我们提出了基于转换器的模型,并采用根据早期检测框架定义的决策策略。我们的模型还包括了每个任务的重要词汇扩展 vocabulary。此外,我们还应用了基于历史预测结果的决策策略。在任务1和2中,我们获得了第二名的成绩,根据分类和延迟时间的排名。这表明我们的方法在西班牙语早期检测问题中具有效果和一致性。
Generative retrieval-augmented ontologic graph and multi-agent strategies for interpretive large language model-based materials design
For: The paper explores the use of large language models (LLMs) as a tool for engineering analysis of materials, specifically for retrieving key information, developing research hypotheses, discovering mechanistic relationships, and writing and executing simulation codes.* Methods: The paper uses a fine-tuned model called MechGPT, which is developed based on training data in the mechanics of materials domain. The authors also employ retrieval-augmented Ontological Knowledge Graph strategies to address the issue of LLMs recalling correct information outside the context of learned matter.* Results: The paper shows that LLMs can provide powerful problem solution strategies for applications in analysis and design problems, and that retrieval-augmented Ontological Knowledge Graph strategies can provide an interpretable graph structure with rich information at the node, edge, and subgraph level. The authors also discuss nonlinear sampling strategies and agent-based modeling applied to complex question answering, code generation, and execution in the context of automated force field development from actively learned DFT modeling, and data analysis.Here’s the Chinese version of the three key information points:* For: 这篇论文探讨了大语言模型(LLMs)在材料分析和设计中的应用,特别是在检索关键信息、发展研究假设、发现不同领域知识之间的机制关系,以及写入和执行模拟代码方面。* Methods: 这篇论文使用了精度调整的模型——MechGPT,该模型基于机械性物质领域的培训数据进行训练。作者还使用了检索加持知识图的策略来解决 LLMS 在不同领域知识上的回忆问题。* Results: 这篇论文表明了 LLMS 可以为材料分析和设计问题提供强大的问题解决策略,并且使用检索加持知识图的策略可以提供可解释的知识图结构,包括节点、边和子图等级别的信息。作者还讨论了基于非线性抽样策略和智能代理模型的复杂问题回答、代码生成和执行在自动学习DFT模型中的应用。Abstract
Transformer neural networks show promising capabilities, in particular for uses in materials analysis, design and manufacturing, including their capacity to work effectively with both human language, symbols, code, and numerical data. Here we explore the use of large language models (LLMs) as a tool that can support engineering analysis of materials, applied to retrieving key information about subject areas, developing research hypotheses, discovery of mechanistic relationships across disparate areas of knowledge, and writing and executing simulation codes for active knowledge generation based on physical ground truths. When used as sets of AI agents with specific features, capabilities, and instructions, LLMs can provide powerful problem solution strategies for applications in analysis and design problems. Our experiments focus on using a fine-tuned model, MechGPT, developed based on training data in the mechanics of materials domain. We first affirm how finetuning endows LLMs with reasonable understanding of domain knowledge. However, when queried outside the context of learned matter, LLMs can have difficulty to recall correct information. We show how this can be addressed using retrieval-augmented Ontological Knowledge Graph strategies that discern how the model understands what concepts are important and how they are related. Illustrated for a use case of relating distinct areas of knowledge - here, music and proteins - such strategies can also provide an interpretable graph structure with rich information at the node, edge and subgraph level. We discuss nonlinear sampling strategies and agent-based modeling applied to complex question answering, code generation and execution in the context of automated force field development from actively learned Density Functional Theory (DFT) modeling, and data analysis.
摘要
transformer神经网络表现出了扎根的能力,特别是在材料分析、设计和生产领域,包括它们能够与人类语言、符号、代码和数字数据进行有效的交互。我们在这里探索使用大语言模型(LLM)作为工程分析材料的工具,包括检索关键信息、发展研究假设、在不同领域知识之间发现机制关系以及基于物理真实的编程和执行代码。当用作具有特定特征、能力和指令的AI代理时,LLM可以提供有力的问题解决策略。我们的实验将注重使用微调的模型——MechGPT,基于材料力学领域的训练数据进行微调。我们首先证明了训练后LLM具有适当的领域知识理解能力。然而,当被询问在学习的知识外时,LLM可能会办不到正确的回答。我们示出了如何使用检索扩展的 Ontological Knowledge Graph 策略,以便评估模型对概念的理解和关系的扩展。例如,在音乐和蛋白质之间的关系问题上,我们可以提供可读的图structures,并且在节点、边和子图等级具有丰富的信息。我们讨论了不对称采样策略和基于智能代理的问题回答、代码生成和执行在自动学习DFT模型中的应用。
Strategies to Harness the Transformers’ Potential: UNSL at eRisk 2023
results: 在这个研究中,我们在两个任务中获得了良好的表现,包括决策基于指标、排名基于指标和运行时间。Abstract
The CLEF eRisk Laboratory explores solutions to different tasks related to risk detection on the Internet. In the 2023 edition, Task 1 consisted of searching for symptoms of depression, the objective of which was to extract user writings according to their relevance to the BDI Questionnaire symptoms. Task 2 was related to the problem of early detection of pathological gambling risks, where the participants had to detect users at risk as quickly as possible. Finally, Task 3 consisted of estimating the severity levels of signs of eating disorders. Our research group participated in the first two tasks, proposing solutions based on Transformers. For Task 1, we applied different approaches that can be interesting in information retrieval tasks. Two proposals were based on the similarity of contextualized embedding vectors, and the other one was based on prompting, an attractive current technique of machine learning. For Task 2, we proposed three fine-tuned models followed by decision policy according to criteria defined by an early detection framework. One model presented extended vocabulary with important words to the addressed domain. In the last task, we obtained good performances considering the decision-based metrics, ranking-based metrics, and runtime. In this work, we explore different ways to deploy the predictive potential of Transformers in eRisk tasks.
摘要
(Simplified Chinese translation)CLEF eRisk Laboratory 研究不同的互联网风险探测任务。2023年版本中,任务1是搜寻抑郁的征候,目标是将用户文章与BDDI询问题的征候相关。任务2是早期检测疯狂赌博风险,参赛者需要快速检测用户是否有风险。最后一个任务是估计进食障碍的严重程度。我们的研究小组参加了第一个和第二个任务,提出了基于传播的解决方案。在任务1中,我们运用了不同的方法,包括相似的上下文化嵌入vector的类比和Prompting技术。在任务2中,我们提出了三个精革化的模型,然后根据早期检测框架的参数进行决策。一个模型增加了重要领域的词汇。在最后一个任务中,我们获得了良好的表现,考虑到决策基于的指标、排名基于的指标和时间。在这个工作中,我们探索了不同的方法来实现传播的预测潜力在eRisk任务中。
The Impact of Depth and Width on Transformer Language Model Generalization
results: 研究发现,在 fine-tuning 后,深度较大的模型在 out-of-distribution 总结方面表现更好,但随着层数的增加,模型的性能下降速度加剧。此外, deeper 模型在 each family 中的语言模型性能也更高,但返回也随着层数增加而减少。最后,研究发现,depth 对 compositional 总结的好处不能归结于语言模型性能或在预测数据上的表现。Abstract
To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by recent theoretical and empirical work, that transformers generalize more compositionally when they are deeper (have more layers). Because simply adding layers increases the total number of parameters, confounding depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data.
摘要
更深的模型在这些任务中的调整后,能够更好地扩展到未知的数据上,但这些模型的优势逐渐减弱。2. 在每个家族中,更深的模型都会表现更好,但回报也逐渐减弱。3. 深度的优势不能完全归因于更好的语言模型性能或在这些数据上的表现。Note:* “compose”in the original text is translated as “扩展”in Simplified Chinese, which means to expand or extend something.* “generalize”in the original text is translated as “扩展”in Simplified Chinese, which means to make something apply to a wider range of cases.* “layer”in the original text is translated as “层”in Simplified Chinese, which refers to a specific level of a neural network.* “parameters”in the original text is translated as “参数”in Simplified Chinese, which refers to the learnable weights and biases of a neural network.
Split-NER: Named Entity Recognition via Two Question-Answering-based Classifications
results: 实验结果表明,这种两步方法比基eline有效,在 Ontotes5.0、WNUT17 和一个Cybersecurity数据集上都超过了基eline,在 BioNLP13CG 上具有相当的性能,同时具有显著降低训练时间的优势。Abstract
In this work, we address the NER problem by splitting it into two logical sub-tasks: (1) Span Detection which simply extracts entity mention spans irrespective of entity type; (2) Span Classification which classifies the spans into their entity types. Further, we formulate both sub-tasks as question-answering (QA) problems and produce two leaner models which can be optimized separately for each sub-task. Experiments with four cross-domain datasets demonstrate that this two-step approach is both effective and time efficient. Our system, SplitNER outperforms baselines on OntoNotes5.0, WNUT17 and a cybersecurity dataset and gives on-par performance on BioNLP13CG. In all cases, it achieves a significant reduction in training time compared to its QA baseline counterpart. The effectiveness of our system stems from fine-tuning the BERT model twice, separately for span detection and classification. The source code can be found at https://github.com/c3sr/split-ner.
摘要
在这个工作中,我们解决了NER问题,将其拆分成两个逻辑子任务:(1)Span Detection,它简单地提取实体提及span,不论实体类型;(2)Span Classification,它将span分类为实体类型。然后,我们将这两个子任务转化为问答问题,生成了两个简单的模型,可以分别优化每个子任务。实验表明,我们的SplitNER系统在四个跨领域数据集上表现出色,效果明显优于基eline。具体来说,我们的系统在OntoNotes5.0、WNUT17和一个信息安全数据集上均表现出色,与 BioNLP13CG 的性能相当。同时,我们的系统在训练时间方面也有显著的提升。这种效果源于我们在 BERT 模型上进行了两次细化,分别为 span detection 和 classification。相关的源代码可以在 GitHub 上找到:https://github.com/c3sr/split-ner。
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
paper_authors: Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger
for: 这篇研究的目的是探讨提示和评分的应用在自然语言处理中,特别是在翻译和摘要评估中。
methods: 研究使用了一些已知的大语言模型,并将其用于提示和评分。
results: 研究发现,即使将大型语言模型限制在允许的列表上,仍可以 achieves results on par with or even surpassing recent reference-free metrics developed using larger models。Abstract
With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.
摘要
随着参数和预训练数据的增加,生成大型自然语言模型(LLM)在解决无或少关注任务示例的情况下表现出色。特别是,LLM在文本生成任务中作为评价指标得到了广泛的应用。在这个上下文中,我们介绍了2023年的Eval4NLP任务,询问参与者探索提示和分析抽取在机器翻译(MT)和概要生成评价中的应用。我们提出了一种新的竞赛设定,在选择允许的LLM列表并禁用微调的情况下,以便强调提示。我们对参与者的方法进行了概述,并对新的无参考测试集跨三种语言对MT和概要生成进行评估。尽管任务有限制,最佳系统的表现与或者超过了最近发展的无参考度量器,包括GEMB和Comet-Kiwi-XXL。最后,我们在一个小规模的人工评估中评估了LLM的解释可信度。
What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
results: 研究人员发现,许多现有的视言预训练数据集,如LAION-2B,含有少量可靠的数据,用于学习空间关系。此外,研究人员还发现,基本的模型改进方法,如升重前置词包含的实例或 Fine-tuning 在这些数据集上,并不能解决这些测试集带来的挑战。Abstract
Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.
摘要
最近的视力语言(VL)模型强大,但它们能够准确地 distinguishes "right" 和 "left" 吗?我们创建了三个新的 corpora 来量化模型对这些基本的空间关系的理解。这些测试更 preciselly than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.
Chain-of-Thought Embeddings for Stance Detection on Social Media
results: 本研究实现了SOTA的立场检测性能在多个社交媒体上的 datasets。Abstract
Stance detection on social media is challenging for Large Language Models (LLMs), as emerging slang and colloquial language in online conversations often contain deeply implicit stance labels. Chain-of-Thought (COT) prompting has recently been shown to improve performance on stance detection tasks -- alleviating some of these issues. However, COT prompting still struggles with implicit stance identification. This challenge arises because many samples are initially challenging to comprehend before a model becomes familiar with the slang and evolving knowledge related to different topics, all of which need to be acquired through the training data. In this study, we address this problem by introducing COT Embeddings which improve COT performance on stance detection tasks by embedding COT reasonings and integrating them into a traditional RoBERTa-based stance detection pipeline. Our analysis demonstrates that 1) text encoders can leverage COT reasonings with minor errors or hallucinations that would otherwise distort the COT output label. 2) Text encoders can overlook misleading COT reasoning when a sample's prediction heavily depends on domain-specific patterns. Our model achieves SOTA performance on multiple stance detection datasets collected from social media.
摘要
大型自然语言模型(LLM)在社交媒体上进行立场检测是具有挑战性的,因为在线上对话中的新词汇和口语语言经常含有深层次的立场标签。链条思维(COT)推断技术最近在立场检测任务上得到了改进,减轻了一些问题。然而,COT推断仍然努力地处理深层次的立场标识。这个挑战的原因在于许多样本需要模型通过训练数据来学习不同话题的流行语言和演化知识,这些知识需要在模型中建立链条思维。在这种情况下,我们解决这个问题 by introducing COT Embeddings,它们可以改进COT推断的性能在立场检测任务中。我们的分析表明:1)文本编码器可以利用COT推断的小误差或幻见,而不会扭曲COT输出标签。2)文本编码器可以忽略域pecificpatterns中的诱导性COT推断。我们的模型在多个社交媒体上收集的多个立场检测 dataset上达到了顶尖性能。
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation
results: 研究发现,通过利用LLM,CoEval可以准确地评估长文本,提高评估效率和可靠性,同时仍然保留了人类审核的作用,以确保最终结果的可靠性。Abstract
Humans are widely involved in the evaluation of open-ended natural language generation tasks (NLG) that demand creativity, as automatic metrics often exhibit weak correlations with human judgments. Large language models (LLMs) recently have emerged as a scalable and cost-effective alternative to human evaluations. However, both humans and LLMs have limitations, i.e., inherent subjectivity and unreliable judgments, particularly for open-ended tasks that require adaptable metrics tailored to diverse task requirements. To explore the synergy between humans and LLM-based evaluators and address the challenges of existing inconsistent evaluation criteria in open-ended NLG tasks, we propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts, in which LLM generates initial ideation, and then humans engage in scrutiny. We conducted a series of experiments to investigate the mutual effects between LLMs and humans in CoEval. Results show that, by utilizing LLMs, CoEval effectively evaluates lengthy texts, saving significant time and reducing human evaluation outliers. Human scrutiny still plays a role, revising around 20% of LLM evaluation scores for ultimate reliability.
摘要
人类广泛参与开放型自然语言生成任务(NLG)的评估,因为自动度量器经常表现出较弱的相关性与人类评价。大型语言模型(LLM)最近在可扩展性和成本效益方面出现为一种可靠的代替方案。然而,人类和 LLM 都有局限性,即内在的主观性和不可靠的评价,特别是开放型任务需要适应性的多样化任务需求。为了探索人类和 LLM 评价者之间的共同作用和解决现有的不一致评价标准问题,我们提出了一个协同评价管道 CoEval,其中包括设计任务特定的标准列表和细化文本评价。在 CoEval 中,LLM 生成初步的想法,然后人类进行审核。我们进行了一系列实验,发现 CoEval 可以有效评估长文本, saves significant time and reduces human evaluation outliers。然而,人类审核仍然发挥着一定的作用,对 LLM 评估得到的分数进行修改,以确保最终的可靠性。
Combining Language Models For Specialized Domains: A Colorful Approach
results: 实验结果显示,这种方法可以将专业领域的名词和技术词融入语言任务中,并且可以降低专业领域的误差率无需对通用领域的性能产生影响。Abstract
General purpose language models (LMs) encounter difficulties when processing domain-specific jargon and terminology, which are frequently utilized in specialized fields such as medicine or industrial settings. Moreover, they often find it challenging to interpret mixed speech that blends general language with specialized jargon. This poses a challenge for automatic speech recognition systems operating within these specific domains. In this work, we introduce a novel approach that integrates domain-specific or secondary LM into general-purpose LM. This strategy involves labeling, or "coloring", each word to indicate its association with either the general or the domain-specific LM. We develop an optimized algorithm that enhances the beam search algorithm to effectively handle inferences involving colored words. Our evaluations indicate that this approach is highly effective in integrating jargon into language tasks. Notably, our method substantially lowers the error rate for domain-specific words without compromising performance in the general domain.
摘要
通用语言模型(LM)在处理域围绕专业术语和技术名词时遇到困难,这些术语和名词在医学或工业领域中非常常见。此外,它们也有 difficulty 处理混合语言,这种语言混合通用语言和专业术语。这对自动语音识别系统在这些具体领域中操作带来了挑战。在这项工作中,我们介绍了一种新的方法,即将域围绕专业语言模型(LM)与通用语言模型(LM)集成。这种策略的核心思想是对每个词语进行标记,以便指示它们与通用语言模型或域围绕语言模型相关。我们开发了一种优化的算法,以便在搜索算法中有效地处理彩色词语的推理。我们的评估结果表明,这种方法在混合语言任务中非常有效,并且可以大幅降低域围绕专业词语的错误率,而无需妨碍通用领域的表现。
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
methods: 这些方法使用 Continuous embedding space 和 discrete token space 进行研究,并证明了这些方法在同样的 Parameters 上是 strictly less expressive than full fine-tuning。
results: 研究发现,虽然 context-based fine-tuning 方法可以很好地启动 pretrained model 中的技能,但它们无法学习新的任务,因为它们无法改变内部模型的关注模式。Abstract
Context-based fine-tuning methods, including prompting, in-context learning, soft prompting (also known as prompt tuning), and prefix-tuning, have gained popularity due to their ability to often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning, even with the same number of learnable parameters. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. This suggests that while techniques like prompting, in-context learning, soft prompting, and prefix-tuning can effectively elicit skills present in the pretrained model, they cannot learn novel tasks that require new attention patterns.
摘要
Context-based 精度调整方法,包括提示、在 Context 中学习、软提示(也称为提示调整)和 prefix-tuning,因其能够匹配全部 fine-tuning 的性能,而具有许多参数的优势。 despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning, even with the same number of learnable parameters. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. This suggests that while techniques like prompting, in-context learning, soft prompting, and prefix-tuning can effectively elicit skills present in the pretrained model, they cannot learn novel tasks that require new attention patterns.Note that the phrase "context-based fine-tuning" is translated as "Context-based 精度调整" in Simplified Chinese, which is a combination of "Context-based" and "精度调整" (fine-tuning).
Sentiment Analysis in Digital Spaces: An Overview of Reviews
for: This study is a systematic review that summarizes 38 systematic reviews and 2,275 primary studies.
methods: The study uses a bespoke quality assessment framework to evaluate the rigor and quality of systematic review methodologies and reporting standards.
results: The study finds diverse applications and methods, limited reporting quality, and challenges over time.Abstract
Sentiment analysis (SA) is commonly applied to digital textual data, revealing insight into opinions and feelings. Many systematic reviews have summarized existing work, but often overlook discussions of validity and scientific practices. Here, we present an overview of reviews, synthesizing 38 systematic reviews, containing 2,275 primary studies. We devise a bespoke quality assessment framework designed to assess the rigor and quality of systematic review methodologies and reporting standards. Our findings show diverse applications and methods, limited reporting rigor, and challenges over time. We discuss how future research and practitioners can address these issues and highlight their importance across numerous applications.
摘要
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
paper_authors: Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto, Tobias Gerstenberg
For: The paper aims to investigate how well large language models (LLMs) align with human intuitions in making causal and moral judgments about text-based scenarios.* Methods: The paper uses a dataset of stories from 24 cognitive science papers and develops a system to annotate each story with the factors investigated. The authors then test the alignment of LLMs with human participants’ judgments using statistical analyses.* Results: The results show that while LLMs have improved in aligning with human participants’ judgments in recent years, they still weigh the different factors quite differently. The study demonstrates the importance of curated, challenge datasets combined with insights from cognitive science to evaluate LLMs’ performance and understand their implicit tendencies.Abstract
Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. On the aggregate level, alignment has improved with more recent LLMs. However, using statistical analyses, we find that LLMs weigh the different factors quite differently from human participants. These results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover LLMs implicit tendencies and show to what extent these align with human intuitions.
摘要
人类常识理解物理和社会世界是通过直觉理论来组织的。这些理论支持我们作出 causal 和 moral 判断。当omething bad happens,我们就会自然地问:who did what, and why?一大量的认知科学研究已经研究了人们的 causal 和 moral 直觉。这些研究发现了一些系统性地影响人们的判断的因素,如违反 norms 和是否可避免或不可避免。我们收集了 24 篇认知科学论文中的故事,并开发了一个系统来注释每个故事中 investigate 的因素。使用这个数据集,我们测试了大语言模型(LLMs)是否对文本场景中的 causal 和 moral 判断与人类参与者的判断相align。在聚合水平上,与更新的 LLMs 相比,Alignment 有所提高。然而,通过统计分析,我们发现 LLMs 对不同因素进行了不同的评重。这些结果表明,可以通过精心编辑、挑战数据集和认知科学的知识来让 LLMs 的偏好和人类直觉之间进行比较,以了解 LLMS 的隐藏偏好是否与人类直觉相align。
Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck
results: 在12种多样化的 dataset 上,使用 GPT-4 生成和测量概念,TBM 可以与已有的黑盒模型相比,其性能与 GPT-4 fewshot 和 DeBERTa 训练后的性能相似,而与 GPT-3.5 训练后的性能相差不大。总之,我们的发现表明,TBM 是一种有前途的新框架,它可以增强可解释性,并且减少性能损失,特别是在通用领域的文本分类任务上。Abstract
Deep neural networks excel in text classification tasks, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBMs), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBMs predict categorical values for a sparse set of salient concepts and use a linear layer over those concept values to produce the final prediction. These concepts can be automatically discovered and measured by a Large Language Model (LLM), without the need for human curation. On 12 diverse datasets, using GPT-4 for both concept generation and measurement, we show that TBMs can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa, while falling short against finetuned GPT-3.5. Overall, our findings suggest that TBMs are a promising new framework that enhances interpretability, with minimal performance tradeoffs, particularly for general-domain text.
摘要
深度神经网络在文本分类任务中表现出色,但在高风险领域应用受其解释性的限制。为解决这问题,我们提议文本瓶颈模型(TBM),一种内在可解释的文本分类框架,可以提供全局和局部解释。TBM不直接预测输出标签,而是预测一个稀缺的核心概念的 categorical 值,然后使用这些概念值进行线性变换来生成最终预测。这些概念可以通过大语言模型(LLM)自动发现和测量,无需人工筛选。在12种多样化的数据集上,使用 GPT-4 进行概念生成和测量,我们发现TBM可以与已有的黑盒子基eline相比,在通用领域文本中表现出类似的性能,而与训练过 GPT-3.5 的情况下表现略为落后。总的来说,我们的发现表明TBM是一种有前途的新框架,可以增强解释性,而无需付出明显的性能成本,特别是在通用领域文本中。
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace
results: 研究发现,虽然数据量和参数 scale直接影响模型的总性能,但一些能力更sensitive于数据量和参数的增加,而其他一些能力却很难受到这些变化的影响。此外,人类审核的数据能够在数据量增加时Constantly improve模型性能,而Synthetic数据则不可能达到这种效果。Abstract
Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). However, the creation of instruction data is still largely heuristic, leading to significant variation in quality and distribution across existing datasets. Experimental conclusions drawn from these datasets are also inconsistent, with some studies emphasizing the importance of scaling instruction numbers, while others argue that a limited number of samples suffice. To better understand data construction guidelines, we deepen our focus from the overall model performance to the growth of each underlying ability, such as creative writing, code generation, and logical reasoning. We systematically investigate the effects of data volume, parameter size, and data construction methods on the development of various abilities, using hundreds of model checkpoints (7b to 33b) fully instruction-tuned on a new collection of over 40k human-curated instruction data. This proposed dataset is stringently quality-controlled and categorized into ten distinct LLM abilities. Our study reveals three primary findings: (i) Despite data volume and parameter scale directly impacting models' overall performance, some abilities are more responsive to their increases and can be effectively trained using limited data, while some are highly resistant to these changes. (ii) Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases, but is unachievable with synthetic data. (iii) Instruction data brings powerful cross-ability generalization, with evaluation results on out-of-domain data mirroring the first two observations. Furthermore, we demonstrate how these findings can guide more efficient data constructions, leading to practical performance improvements on public benchmarks.
摘要
instrucion 调教是一种迅速发展的方法,用于激发大型自然语言模型(LLM)的通用智能。然而,创建 instrucion 数据仍然受限于较为规则的创建方法,导致现有数据集的质量和分布存在显著的差异。这些数据集的实验结论也存在差异,一些研究认为需要扩大 instrucion 的数量,而其他研究则认为只需要一些样本即可。为了更好地理解数据构建指南,我们将我们的注意力从整体模型性能深入调整为每个基础能力的发展,如创作写作、代码生成和逻辑推理。我们系统地 investigate 数据量、参数大小和数据构建方法对模型的不同能力的影响,使用了百种模型检查点(7b-33b),全面 instruction-tuned 在一个新收集的人类检查的 instrucion 数据集上。这个提议的数据集 stringently 质控rolled 并分为十种不同的 LLM 能力。我们的研究发现了以下三个主要结论:(i)尽管数据量和参数缩放直接影响模型的总性能,但一些能力更sensitive于这些变化,可以使用有限数据进行有效地训练,而一些能力却具有很高的抵抗力。(ii)人类检查的数据在效率和可 reuse 方面胜过 GPT-4 生成的 sintetic 数据,但是不可能通过 sintetic 数据进行持续改进。(iii) instrucion 数据带来了强大的跨能力泛化,评估结果表明,模型在尝试数据上的性能和数据集上的性能呈现相似的趋势。此外,我们还示出了如何将这些发现应用于更有效的数据构建,以实现公共benchmark上的实践性能提升。
KeyGen2Vec: Learning Document Embedding via Multi-label Keyword Generation in Question-Answering
results: 我们的实验结果表明,KeyGen2Vec在整体上比多标签关键词分类器高效,最高达14.7%的纯度、 нор化共享信息(NMI)和F1-Score指标。 Interestingly,尽管在评估数据集上,获得标签超级视的学习嵌入的绝对优势极高,KeyGen2Vec在Yahoo! cQA中的更多标签 Label Supervision情况下与类ifier竞争。Abstract
Representing documents into high dimensional embedding space while preserving the structural similarity between document sources has been an ultimate goal for many works on text representation learning. Current embedding models, however, mainly rely on the availability of label supervision to increase the expressiveness of the resulting embeddings. In contrast, unsupervised embeddings are cheap, but they often cannot capture implicit structure in target corpus, particularly for samples that come from different distribution with the pretraining source. Our study aims to loosen up the dependency on label supervision by learning document embeddings via Sequence-to-Sequence (Seq2Seq) text generator. Specifically, we reformulate keyphrase generation task into multi-label keyword generation in community-based Question Answering (cQA). Our empirical results show that KeyGen2Vec in general is superior than multi-label keyword classifier by up to 14.7% based on Purity, Normalized Mutual Information (NMI), and F1-Score metrics. Interestingly, although in general the absolute advantage of learning embeddings through label supervision is highly positive across evaluation datasets, KeyGen2Vec is shown to be competitive with classifier that exploits topic label supervision in Yahoo! cQA with larger number of latent topic labels.
摘要
现有的文本表示学习模型主要依靠标签超vision来增加表示结果的表达力。然而,这些标签超vision通常是可获得的,但它们经常无法捕捉目标句子的隐式结构,特别是来自不同分布的样本。我们的研究旨在减少依赖于标签超vision的关系,通过使用序列到序列(Seq2Seq)文本生成器来学习文档表示。我们将关键短语生成任务转换为多标签关键词生成在社区基于问答(cQA)中。我们的实验结果显示,KeyGen2Vec在整体上高于多标签关键词分类器,具体来说,与Purity、Normalized Mutual Information(NMI)和F1-Score度量相比,KeyGen2Vec在Yahoo! cQA上表现出14.7%的绝对优势。虽然在评估数据集上,通过标签超vision学习表示的绝对优势通常很高,但KeyGen2Vec在Yahoo! cQA上与具有更多隐藏话题标签的类ifier竞争。
results: 对于声音提取任务,DPATD模型比现有的方法更高效,并且可以更好地处理长声音序列。Abstract
Recent high-performance transformer-based speech enhancement models demonstrate that time domain methods could achieve similar performance as time-frequency domain methods. However, time-domain speech enhancement systems typically receive input audio sequences consisting of a large number of time steps, making it challenging to model extremely long sequences and train models to perform adequately. In this paper, we utilize smaller audio chunks as input to achieve efficient utilization of audio information to address the above challenges. We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising. DPATD splits the audio input into smaller chunks, where the input length can be proportional to the square root of the original sequence length. Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module. Extensive experiments demonstrate that our model outperforms state-of-the-art methods.
摘要
Here's the text in Simplified Chinese:现代高性能 transformer 基于 speech 增强模型表明,时域方法可以达到相似的性能,与时域频谱方法相比。然而,时域speech 增强系统通常处理大量的音频数据,这使得模型模型 extremely long sequences 和训练模型成为挑战。在这篇文章中,我们使用 smaller audio chunks 作为输入,以实现高效地利用音频信息。我们提出了一种 dual-phase audio transformer for denoising (DPATD),这是一种新的模型,用于在 deep structure 中学习干净的音频序列。DPATD 将 audio 输入拆分成 smaller chunks,其输入长度与原始序列长度的平方根成正比。我们的 memory-compressed explainable attention 具有高效性和快速收敛,与常用的 self-attention 模块相比。广泛的实验表明,我们的模型超过了当前最佳方法。
Improving Input-label Mapping with Demonstration Replay for In-context Learning
methods: 我们提出了一种新的ICL方法,即重复示例with Sliding Causal Attention(RdSca)。我们在示例后 duplicates later demonstrations and concatenates them to the front, allowing the model to `observe’ the later information even under the causal restriction。此外,我们引入了滑动 causal attention,以适应 causal attention 的自适应。
results: 我们的方法在ICL示例中显著提高了输入标签之间的映射。我们还进行了深入的分析,探讨如何适应 causal attention 的自适应,这是在前一个研究中未explored的领域。Abstract
In-context learning (ICL) is an emerging capability of large autoregressive language models where a few input-label demonstrations are appended to the input to enhance the model's understanding of downstream NLP tasks, without directly adjusting the model parameters. The effectiveness of ICL can be attributed to the strong language modeling capabilities of large language models (LLMs), which enable them to learn the mapping between input and labels based on in-context demonstrations. Despite achieving promising results, the causal nature of language modeling in ICL restricts the attention to be backward only, i.e., a token only attends to its previous tokens, failing to capture the full input-label information and limiting the model's performance. In this paper, we propose a novel ICL method called Repeated Demonstration with Sliding Causal Attention, (RdSca). Specifically, we duplicate later demonstrations and concatenate them to the front, allowing the model to `observe' the later information even under the causal restriction. Besides, we introduce sliding causal attention, which customizes causal attention to avoid information leakage. Experimental results show that our method significantly improves the input-label mapping in ICL demonstrations. We also conduct an in-depth analysis of how to customize the causal attention without training, which has been an unexplored area in previous research.
摘要
卷积语言模型(ICL)是一种现代语言模型技术,通过在输入上附加一些标签示例来提高模型对下游自然语言处理任务的理解,而不需要直接调整模型参数。ICL的效果可以归结于大型语言模型(LLM)的强语言模型能力,它们能够基于示例学习映射输入和标签之间的关系。然而,ICL中的语言模型归因约束限制了注意力的向前传递,即每个token只能注意前一个token,这会导致模型表现有限。在这篇论文中,我们提出了一种新的ICL方法,即重复示例与滑动 causal attention(RdSca)。具体来说,我们将后续示例重复并 concatenate 到输入的开头,这样允许模型在 causal 约束下 still 可以 observe 后续信息。此外,我们引入了滑动 causal attention,以适应不同的输入和标签信息。实验结果表明,我们的方法可以显著提高ICL示例中的输入-标签映射。此外,我们还进行了对自然语言处理任务的深入分析,以探讨在不需要训练的情况下如何自适应 causal attention。
A Novel Representation to Improve Team Problem Solving in Real-Time
results: 一个案例研究表明,该表示方式可以帮助理解和改进团队的行为。Abstract
This paper proposes a novel representation to support computing metrics that help understanding and improving in real-time a team's behavior during problem solving in real-life. Even though teams are important in modern activities, there is little computing aid to improve their activity. The representation captures the different mental images developed, enhanced, and utilized during solving. A case study illustrates the representation.
摘要
这篇论文提出了一种新的表示方式,用于支持在实时中对团队的行为进行理解和改进。即使团队在现代活动中具有重要地位, yet there is little computing aid to improve their activity. 该表示方式 capture了解决过程中发展、加强和使用的不同MENTAL IMAGES。一个案例研究 illustrate了该表示方式。Note: "MENTAL IMAGES" in the original text is translated as "不同MENTAL IMAGES" in Simplified Chinese, as there is no direct equivalent of "mental images" in Chinese.
InfoEntropy Loss to Mitigate Bias of Learning Difficulties for Generative Language Models
results: 在PILE数据集上,通过不同缩放因子的训练,实验表明,将InfoEntropy Loss函数添加到生成语言模型训练中可以持续提高下游任务表现。Abstract
Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose an Information Entropy Loss (InfoEntropy Loss) function. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 436M, 1.1B, and 6.7B parameters. Experiments reveal that models incorporating the proposed InfoEntropy Loss can gain consistent performance improvement on downstream benchmarks.
摘要
<>转换文本到简化中文。>大多数生成语言模型通常通过预测下一个token(即子字/词/短语)来在前一个token的基础上进行预训练。 current works have shown that large generative language models can achieve impressive performance on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpora during training, i.e., the imbalance between frequent tokens and infrequent ones. This can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To address this, we propose an Information Entropy Loss (InfoEntropy Loss) function. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 436M, 1.1B, and 6.7B parameters. Experiments show that models incorporating the proposed InfoEntropy Loss can gain consistent performance improvement on downstream benchmarks.
results: 我们在一个多样化的 LLM 中进行了实验,包括 ChatGPT、GPT-4、OPT、LLaMA 和 Alpaca,并与现有的状态机制构成解析器进行比较。我们在零shot、几shot 和全训练学习 Setting 中进行了实验,并在一个内部测试集和五个外部测试集上评估了模型的性能。我们的实验结果显示了 LLMS 的性能、泛化能力以及构成分析中的挑战。Abstract
Constituency parsing is a fundamental yet unsolved natural language processing task. In this paper, we explore the potential of recent large language models (LLMs) that have exhibited remarkable performance across various domains and tasks to tackle this task. We employ three linearization strategies to transform output trees into symbol sequences, such that LLMs can solve constituency parsing by generating linearized trees. We conduct experiments using a diverse range of LLMs, including ChatGPT, GPT-4, OPT, LLaMA, and Alpaca, comparing their performance against the state-of-the-art constituency parsers. Our experiments encompass zero-shot, few-shot, and full-training learning settings, and we evaluate the models on one in-domain and five out-of-domain test datasets. Our findings reveal insights into LLMs' performance, generalization abilities, and challenges in constituency parsing.
摘要
《干部分析是自然语言处理中的基本任务,但是这个任务还未得到解决。在这篇论文中,我们探索了最近的大语言模型(LLMs)在不同领域和任务中表现出色的潜力,以解决这个任务。我们采用三种线性化策略将输出树转换为符号序列,使LLMs可以通过生成线性化树来解决干部分析。我们在多种LLMs上进行实验,包括ChatGPT、GPT-4、OPT、LLaMA和Alpaca,并与现有的状态机构分析器进行比较。我们的实验包括零shot、几shot和全training学习设定,并在一个内域测试集和五个外域测试集上评估模型的表现。我们的发现反映了LLMs的表现、泛化能力和干部分析中的挑战。》
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
results: 我们的实验表明,使用幽杂启动可以有效地从有限资源中获取语言知识。我们在BabyLM共同任务中进行了实验,该任务包括预训两个小型 curaated corpus,并在四个语言标准测试上进行评估。Abstract
This paper explores the use of latent bootstrapping, an alternative self-supervision technique, for pretraining language models. Unlike the typical practice of using self-supervision on discrete subwords, latent bootstrapping leverages contextualized embeddings for a richer supervision signal. We conduct experiments to assess how effective this approach is for acquiring linguistic knowledge from limited resources. Specifically, our experiments are based on the BabyLM shared task, which includes pretraining on two small curated corpora and an evaluation on four linguistic benchmarks.
摘要
A Lightweight Method to Generate Unanswerable Questions in English
results: 相比之前的状态艺术,使用该数据生成方法可以获得更好的模型 (+1.6 F1点在SQuAD 2.0数据上,使用BERT-large),并且人类评分的相关性和可读性更高。Abstract
If a question cannot be answered with the available information, robust systems for question answering (QA) should know _not_ to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.
摘要
Translated into Simplified Chinese:如果问题无法由已有的信息回答,then robust question answering(QA)系统应该知道不回答。一种方式建立QA模型是通过额外的训练数据包括无法回答的问题,使用注解员或自动生成无法回答问题的方法。为了证明现有的自动化方法的模型复杂性不当,我们研究一种更简单的数据增强方法 для无法回答问题的生成:在可回答问题上进行反义和实体交换。与过去的状态艺术比较,我们的训练自由和轻量级策略生成的数据比之前的状态艺术更好,在 SQuAD 2.0 数据上使用 BERT-large 模型时,提高了 +1.6 F1 分数。此外,我们还评估了人类评分的相关性和可读性,发现我们的方法在这两个方面都有了显著的提升。我们对不同的编码器模型进行了评估,使用不同的生成数据量,并在 TydiQA-MinSpan 数据上 (+9.3 F1 分数 with BERT-large) 得到了类似的结果。我们的结果证明了交换是一种简单 yet strong 的基线 для未来的工作。
results: 本研究获得了丰富的评估结果,包括句子嵌入模型的训练设置和评估结果。Abstract
We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four unsupervised datasets. In this report, we provide the detailed training setup for Japanese SimCSE and their evaluation results.
摘要
我们报道了日本SimCSE的开发,是基于SimCSE的日语句子嵌入模型的精心调教。由于日语句子嵌入模型的基线研究缺乏日语句子嵌入模型,我们在日语句子嵌入领域进行了广泛的实验,使用24种预训练的日语或多语言模型,5个supervised数据集和4个Unsupervised数据集。在这份报告中,我们提供了日本SimCSE的详细训练设置和评估结果。
Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES
results: 结果表明系统在正常的性形式上具有相似的表现,但生成包容性翻译方面仍然是一个挑战,表明未来可能需要进一步的改进和研究。Abstract
As part of the WMT-2023 "Test suites" shared task, in this paper we summarize the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By focusing on the en-de and de-en language pairs, we rely on these newly created test suites to investigate systems' ability to translate feminine and masculine gender and produce gender-inclusive translations. Furthermore we discuss metrics associated with our test suites and validate them by means of human evaluations. Our results indicate that systems achieve reasonable and comparable performance in correctly translating both feminine and masculine gender forms for naturalistic gender phenomena. Instead, the generation of inclusive language forms in translation emerges as a challenging task for all the evaluated MT models, indicating room for future improvements and research on the topic.
摘要
As part of the WMT-2023 "Test suites" shared task, in this paper we summarize the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By focusing on the en-de and de-en language pairs, we rely on these newly created test suites to investigate systems' ability to translate feminine and masculine gender and produce gender-inclusive translations. Furthermore we discuss metrics associated with our test suites and validate them by means of human evaluations. Our results indicate that systems achieve reasonable and comparable performance in correctly translating both feminine and masculine gender forms for naturalistic gender phenomena. Instead, the generation of inclusive language forms in translation emerges as a challenging task for all the evaluated MT models, indicating room for future improvements and research on the topic.Here's the translation in Traditional Chinese as well:As part of the WMT-2023 "Test suites" shared task, in this paper we summarize the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By focusing on the en-de and de-en language pairs, we rely on these newly created test suites to investigate systems' ability to translate feminine and masculine gender and produce gender-inclusive translations. Furthermore we discuss metrics associated with our test suites and validate them by means of human evaluations. Our results indicate that systems achieve reasonable and comparable performance in correctly translating both feminine and masculine gender forms for naturalistic gender phenomena. Instead, the generation of inclusive language forms in translation emerges as a challenging task for all the evaluated MT models, indicating room for future improvements and research on the topic.
Fusing Temporal Graphs into Transformers for Time-Sensitive Question Answering
results: 研究结果显示,我们提议的方法可以substantially 提高Transformer模型在时间理解方面的能力,无需 Fine-tuning。此外,我们的方法还超过了多种基于图 convolution的方法,并在SituatedQA和TimeQA中三个分区中达到了新的state-of-the-art表现。Abstract
Answering time-sensitive questions from long documents requires temporal reasoning over the times in questions and documents. An important open question is whether large language models can perform such reasoning solely using a provided text document, or whether they can benefit from additional temporal information extracted using other systems. We address this research question by applying existing temporal information extraction systems to construct temporal graphs of events, times, and temporal relations in questions and documents. We then investigate different approaches for fusing these graphs into Transformer models. Experimental results show that our proposed approach for fusing temporal graphs into input text substantially enhances the temporal reasoning capabilities of Transformer models with or without fine-tuning. Additionally, our proposed method outperforms various graph convolution-based approaches and establishes a new state-of-the-art performance on SituatedQA and three splits of TimeQA.
摘要
回答时间敏感问题从长文档中需要时间逻辑,检查问题和文档中的时间和事件关系。一个重要的研究问题是大语言模型是否可以通过提供的文档alone来完成这种逻辑,或者是否可以通过其他系统提取的时间信息来增强其能力。我们解决这个研究问题,通过应用现有的时间信息抽取系统来构建问题和文档中的时间图。然后,我们研究不同的方法来融合这些图into Transformer模型。实验结果表明,我们提议的方法可以增强Transformer模型中的时间逻辑能力,无论是否进行了微调。此外,我们的方法还超过了基于图 convolution的方法,并在 SituatedQA 和 TimeQA 中的三个分区中创造了新的状态标准。
Learning to love diligent trolls: Accounting for rater effects in the dialogue safety task
results: 实验结果表明,当噪声用户是一致的时,使用AES-like方法可以准确地推断标签,即使噪声用户占多数。Abstract
Chatbots have the risk of generating offensive utterances, which must be avoided. Post-deployment, one way for a chatbot to continuously improve is to source utterance/label pairs from feedback by live users. However, among users are trolls, who provide training examples with incorrect labels. To de-troll training data, previous work removed training examples that have high user-aggregated cross-validation (CV) error. However, CV is expensive; and in a coordinated attack, CV may be overwhelmed by trolls in number and in consistency among themselves. In the present work, I address both limitations by proposing a solution inspired by methodology in automated essay scoring (AES): have multiple users rate each utterance, then perform latent class analysis (LCA) to infer correct labels. As it does not require GPU computations, LCA is inexpensive. In experiments, I found that the AES-like solution can infer training labels with high accuracy when trolls are consistent, even when trolls are the majority.
摘要
<>聊天机器人有可能生成冒犯性的语音,这些语音必须避免。在部署后,一种方式是通过来自用户的反馈获取语音/标签对。然而,用户中有些人是啦啦队伍,他们提供了训练示例 WITH incorrect标签。为了除啦啦示例,先前的工作使用了用户集成的跨验算法(CV)错误。然而,CV 是昂贵的,而且在协调攻击下,CV 可能会被啦啦人数和一致性所淹没。在当前的工作中,我解决了这两个限制,提出了基于自动化 Essay 评分(AES)的解决方案:由多个用户评分每个语音,然后使用隐藏类分析(LCA)推断正确的标签。由于 LCA 不需要 GPU 计算,因此它是便宜的。在实验中,我发现了 AES 类似的解决方案可以在啦啦人数占主导地位时, WITH high accuracy 推断训练标签。
Moral Judgments in Narratives on Reddit: Investigating Moral Sparks via Social Commonsense and Linguistic Signals
results: 发现事件相关的负性人格特质(如幼稚和颠覆)吸引注意力,导致道德责任的增加,表明道德决策和责任之间存在互动关系。此外,用于描述事件和人物的语言也会增加道德决策的可能性,而对事件的描述则减少这种效果。Abstract
Given the increasing realism of social interactions online, social media offers an unprecedented avenue to evaluate real-life moral scenarios. We examine posts from Reddit, where authors and commenters share their moral judgments on who is blameworthy. We employ computational techniques to investigate factors influencing moral judgments, including (1) events activating social commonsense and (2) linguistic signals. To this end, we focus on excerpt-which we term moral sparks-from original posts that commenters include to indicate what motivates their moral judgments. By examining over 24,672 posts and 175,988 comments, we find that event-related negative personal traits (e.g., immature and rude) attract attention and stimulate blame, implying a dependent relationship between moral sparks and blameworthiness. Moreover, language that impacts commenters' cognitive processes to depict events and characters enhances the probability of an excerpt become a moral spark, while factual and concrete descriptions tend to inhibit this effect.
摘要
Overview of the CLAIMSCAN-2023: Uncovering Truth in Social Media through Claim Detection and Identification of Claim Spans
methods: 本研究使用 CLAIMSCAN 技术,包括 Task A 和 Task B 两个任务,以自动识别社交媒体上的声明,并确定声明中的有关词语或短语。
results: 本研究在 2023 Forum for Information Retrieval Evaluation (FIRE’2023) 上提交了 CLAIMSCAN,并获得了 40 个注册和 28 个团队的参与,表明该技术在当今的数字时代中具有重要性和应用价值。Abstract
A significant increase in content creation and information exchange has been made possible by the quick development of online social media platforms, which has been very advantageous. However, these platforms have also become a haven for those who disseminate false information, propaganda, and fake news. Claims are essential in forming our perceptions of the world, but sadly, they are frequently used to trick people by those who spread false information. To address this problem, social media giants employ content moderators to filter out fake news from the actual world. However, the sheer volume of information makes it difficult to identify fake news effectively. Therefore, it has become crucial to automatically identify social media posts that make such claims, check their veracity, and differentiate between credible and false claims. In response, we presented CLAIMSCAN in the 2023 Forum for Information Retrieval Evaluation (FIRE'2023). The primary objectives centered on two crucial tasks: Task A, determining whether a social media post constitutes a claim, and Task B, precisely identifying the words or phrases within the post that form the claim. Task A received 40 registrations, demonstrating a strong interest and engagement in this timely challenge. Meanwhile, Task B attracted participation from 28 teams, highlighting its significance in the digital era of misinformation.
摘要
在互联网社交媒体平台的快速发展中,内容创造和信息交换得到了极大的提高,这对于社会是非常有利。然而,这些平台也成为了假信息、宣传和falsetelling的渠道。我们的认知形成的基础是声称,但是这些声称经常被用来骗人。为了解决这个问题,社交媒体巨头们雇用内容筛选人员来从实际世界中筛选假新闻。然而,巨大量的信息使得效果性验证假新闻变得困难。因此,已成为必须自动识别社交媒体帖子中的声称,验证其真实性,并将真实和假的声称分开。为此,我们在2023年信息检索评估论坛(FIRE'2023)上发表了CLAIMSCAN。主要目标是解决两个关键任务:任务A是判断社交媒体帖子是否为声称,任务B是在帖子中 precisely identify the words or phrases that form the claim。任务A得到了40个注册,表明了这个时期挑战的强大兴趣和参与度。同时,任务B吸引了28个团队的参与,这反映了在数字时代的谎言普遍性。
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
methods: 这篇论文使用了一种自动化的方法(但需要 negligible human annotations)将短序列任务转换成长序列场景,以评估LLMs在多个能力上的表现。
results: 研究发现,当任务需要多个 span 注意时,现有的 LLMs 很难理解长序列上的上下文。 semantic retrieve 任务对能力强的 LLMs 来说更加困难。 模型通过长文本 fine-tuning 和位置插值来达到相似的性能。Abstract
Managing long sequences has become an important and necessary feature for large language models (LLMs). However, it is still an open question of how to comprehensively and systematically evaluate the long-sequence capability of LLMs. One of the reasons is that conventional and widely-used benchmarks mainly consist of short sequences. In this paper, we propose M4LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. M4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task types and 12 domains. To alleviate the scarcity of tasks with naturally long sequences and incorporate multiple-ability assessment, we propose an automatic approach (but with negligible human annotations) to convert short-sequence tasks into a unified long-sequence scenario where LLMs have to identify single or multiple relevant spans in long contexts based on explicit or semantic hints. Specifically, the scenario includes five different types of abilities: (1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span; (4) semantic multiple-span; and (5) global context understanding. The resulting samples in M4LE are evenly distributed from 1k to 8k input length. We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval task is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area.
摘要
管理长序列已成为大语言模型(LLM)的重要和必要特性。然而,如何全面和系统地评估长序列能力仍然是一个开放的问题。一个原因是,现有的常用的标准benchmark主要包括短序列。在这篇论文中,我们提议M4LE,一个多能力、多范围、多任务、多领域benchmark для长序列评估。M4LE基于多样化的NLP任务池,包括36个NLP任务、11种任务类型和12个领域。为了解决短序列任务的缺乏和多能力评估的困难,我们提出了一种自动化方法(即使无需人工注释),将短序列任务转换成一个统一的长序列场景,要求LLMs在长文本中标识单个或多个相关的span,基于显式或 semantics 的提示。具体来说,场景包括五种不同的能力:(1)显式单span;(2)semantic单span;(3)显式多span;(4)semantic多span;和(5)全局上下文理解。M4LE中的样本具有1k到8k输入长度的均衡分布。我们对11种已知的LLMs进行系统评估,特别是那些针对长序列输入优化。我们的结果表明:1)当前LLMs在长序列上尚未具备全面的理解能力,特别是需要多个span注意力时。2)semantic retrieval任务对高能 LLMs 更加困难。3)没有 fine-tuning 的模型在 longer text 上使用位置插值方法可以达到相当的性能。我们将benchmark公开发布,以便未来的研究在这一领域。
Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective
results: 研究发现,大多数关闭源 LLM 在性能方面比较好,但是小型开源模型 Like LLaMA-2 (7B 和 13B) 在零基eline情况下可以达到与关闭源模型相当的性能。考虑到关闭源模型的隐私问题以及使用精度版本的高成本,开源模型更有利可图于实际应用。因此,LLaMA-2-7B 模型更有前途的推荐用于实际应用。Abstract
This paper studies how to effectively build meeting summarization systems for real-world usage using large language models (LLMs). For this purpose, we conduct an extensive evaluation and comparison of various closed-source and open-source LLMs, namely, GPT-4, GPT- 3.5, PaLM-2, and LLaMA-2. Our findings reveal that most closed-source LLMs are generally better in terms of performance. However, much smaller open-source models like LLaMA- 2 (7B and 13B) could still achieve performance comparable to the large closed-source models even in zero-shot scenarios. Considering the privacy concerns of closed-source models for only being accessible via API, alongside the high cost associated with using fine-tuned versions of the closed-source models, the opensource models that can achieve competitive performance are more advantageous for industrial use. Balancing performance with associated costs and privacy concerns, the LLaMA-2-7B model looks more promising for industrial usage. In sum, this paper offers practical insights on using LLMs for real-world business meeting summarization, shedding light on the trade-offs between performance and cost.
摘要
results: 实验结果显示,使用拓扑特性来优化adapter参数比照度量基线更有效,而结合多种方法则在多个任务上表现最佳。Abstract
Adapters are widely popular parameter-efficient transfer learning approaches in natural language processing that insert trainable modules in between layers of a pre-trained language model. Apart from several heuristics, however, there has been a lack of studies analyzing the optimal number of adapter parameters needed for downstream applications. In this paper, we propose an adapter pruning approach by studying the tropical characteristics of trainable modules. We cast it as an optimization problem that aims to prune parameters from the adapter layers without changing the orientation of underlying tropical hypersurfaces. Our experiments on five NLP datasets show that tropical geometry tends to identify more relevant parameters to prune when compared with the magnitude-based baseline, while a combined approach works best across the tasks.
摘要
<>将文本翻译成简化中文。> adapter 是自然语言处理领域非常流行的参数高效传承学习方法之一,它们将可训练模块插入预训练语言模型之间的层。虽然有几种规则,但没有尝试分析最佳 adapter 参数的数量,用于下游应用。在这篇论文中,我们提出了一种 adapter 剔除方法,通过研究 tropical 特征来决定哪些参数可以从 adapter 层中剔除。我们将其拟合为一个优化问题,目标是从 adapter 层中剔除参数,而不改变 tropical 偏函数的方向。我们在五个 NLP 数据集上进行了实验,发现 tropical 几何可以更好地标识需要剔除的参数,而且一种组合方法在所有任务中表现最佳。
LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths
paper_authors: Xin Liu, Muhammad Khalifa, Lu Wang for: 这篇论文的目的是提出一种轻量级的模型均衡机制,以改善语言模型(LM)的准确性。methods: 这篇论文使用了一种单Linear层来修改输入文本表示的LM输出logits,以改善模型均衡。results: 这篇论文的实验结果表明, LitCab 可以提高 LM 的准确性,并且只添加了 < 2% 的原始模型参数。此外,在7种文本生成任务上, LitCab 可以降低 ECE 平均分数 by 20%。此外,在7种流行的开源语言模型(GPT和LLaMA家族)上, LitCab 可以获得以下关键发现:1) bigger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones。2) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters。3) finetuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of finetuning setups for calibrating LMs。Abstract
A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LMs, as well as building more trustworthy models. Yet, popular neural model calibration techniques are not well-suited for LMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods like temperature scaling are often unable to reorder the candidate generations. Moreover, training-based methods require finetuning the entire model, which is impractical due to the increasing sizes of modern LMs. In this paper, we present LitCab, a lightweight calibration mechanism consisting of a single linear layer taking the input text representation and manipulateing the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of 7 text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, by reducing the average ECE score by 20%. We further conduct a comprehensive evaluation with 7 popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (1) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (2) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (3) Finetuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of finetuning setups for calibrating LMs.
摘要
modelo es considerado bien calibrado cuando su estimación de probabilidad se alinea con la probabilidad real de que el output sea correcto. Calibrar modelos de lenguaje (LMs) es crucial, ya que juega un papel vital en la detección y mitigación de halucinaciones, un problema común de los modelos, así como en la construcción de modelos más confiables. Sin embargo, las técnicas populares de calibración neural no son adecuadas para los modelos de lenguaje debido a su falta de flexibilidad en determinar la corrección de las respuestas y su alto costo computacional. Por ejemplo, los métodos de posprocesamiento como escalado de temperatura a menudo no pueden reordenar las generaciones de candidatos. Además, los métodos de entrenamiento requieren finetuning la totalidad del modelo, lo que es impracticable debido al tamaño creciente de los modelos modernos. En este artículo, presentamos LitCab, un mecanismo de calibración ligero que consta de una sola capa lineal que toma la representación del texto de entrada y manipula las logit del modelo de salida. LitCab mejora la calibración del modelo al agregar menos de 2% de los parámetros originales. Para evaluar, construimos CaT, un conjunto de tareas de generación de texto que cubre respuestas que van desde frases breves hasta párrafos. Probamos LitCab con Llama2-7B, lo que mejora la calibración en todas las tareas, reduciendo la puntuación promedio de ECE en un 20%. Además, realizamos una evaluación exhaustiva con 7 modelos de lenguaje populares de las familias GPT y LLaMA, obteniendo los siguientes hallazgos clave: (1) Los modelos más grandes dentro de la misma familia exhiben mejor calibración en tareas de generación breve, pero no necesariamente en tareas más largas. (2) Los modelos de la familia GPT exhiben una mejor calibración que los modelos LLaMA, Llama2 y Vicuna, a pesar de tener muchos menos parámetros. (3) Finetuning un modelo preentrenado (por ejemplo, LLaMA) con muestras de propósito limitado (por ejemplo, conversaciones) puede llevar a una calibración peor, lo que destaca la importancia de los conjuntos de finetuning adecuados para calibrar los modelos de lenguaje.