results: 研究发现,使用 XDLM 可以在机器翻译 benchmark 上超越 diffusion 和 Transformer 基于模型的基eline。Abstract
Recently, diffusion models have excelled in image generation tasks and have also been applied to neural language processing (NLP) for controllable text generation. However, the application of diffusion models in a cross-lingual setting is less unexplored. Additionally, while pretraining with diffusion models has been studied within a single language, the potential of cross-lingual pretraining remains understudied. To address these gaps, we propose XDLM, a novel Cross-lingual diffusion model for machine translation, consisting of pretraining and fine-tuning stages. In the pretraining stage, we propose TLDM, a new training objective for mastering the mapping between different languages; in the fine-tuning stage, we build up the translation system based on the pretrained model. We evaluate the result on several machine translation benchmarks and outperformed both diffusion and Transformer baselines.
摘要
近些时间,扩散模型在图像生成任务中表现出色,同时也应用于神经语言处理(NLP)中控制文本生成。然而,扩散模型在跨语言设置下的应用仍然未得到充分研究。此外,在单一语言预训练下的扩散模型预训练还未得到充分研究。为了解决这些漏洞,我们提出了 XDLM,一种新的跨语言扩散模型 для机器翻译,包括预训练和精度调整两个阶段。在预训练阶段,我们提出了 TLDM,一个新的训练目标,用于掌握不同语言之间的映射关系;在精度调整阶段,我们建立了基于预训练模型的翻译系统。我们对多个机器翻译标准 benchmark 进行评估,并在 diffusion 和 Transformer 基elines 上出perform。
Holistic Exploration on Universal Decompositional Semantic Parsing: Architecture, Data Augmentation, and LLM Paradigm
results: 论文的实验结果显示, compared to prior models, our approach significantly reduces inference time while maintaining performance. 在不同的数据增强方法下,我们还进行了实验调查,发现ChatGPT在Attribute parsing方面表现出色,但在Relation parsing方面表现不佳,而使用ChatGPT进行数据增强则得不到优秀的结果。Abstract
In this paper, we conduct a holistic exploration of the Universal Decompositional Semantic (UDS) Parsing. We first introduce a cascade model for UDS parsing that decomposes the complex parsing task into semantically appropriate subtasks. Our approach outperforms the prior models, while significantly reducing inference time. We also incorporate syntactic information and further optimized the architecture. Besides, different ways for data augmentation are explored, which further improve the UDS Parsing. Lastly, we conduct experiments to investigate the efficacy of ChatGPT in handling the UDS task, revealing that it excels in attribute parsing but struggles in relation parsing, and using ChatGPT for data augmentation yields suboptimal results. Our code is available at https://github.com/hexuandeng/HExp4UDS.
摘要
在这篇论文中,我们进行了整体的 Universal Decompositional Semantic(UDS)解析探索。我们首先介绍了一种卷积模型为UDS解析任务进行分解,将复杂的解析任务分解成Semantically相应的子任务。我们的方法在优化后比之前的模型表现更优,同时减少了推理时间。我们还将 sintactic information incorporated 到架构中,进一步优化了 architecture。此外,我们还 explore 了不同的数据增强方法,进一步提高了 UDS解析。最后,我们对 ChatGPT 在 UDS 任务中的处理进行了实验,发现它在 attribute 解析方面表现出色,而在 relation 解析方面却遇到了困难,并且使用 ChatGPT 进行数据增强后的结果不佳。我们的代码可以在 GitHub 上找到:https://github.com/hexuandeng/HExp4UDS。
Towards Resolving Word Ambiguity with Word Embeddings
paper_authors: Matthias Thurnbauer, Johannes Reisinger, Christoph Goller, Andreas Fischer
for: This paper aims to address the problem of ambiguity in natural language processing, specifically in the context of word embeddings and information retrieval tasks.
methods: The authors propose using DBSCAN clustering to identify ambiguous words and evaluate their level of ambiguity in the latent space. They also propose an automatic parameter selection method for DBSCAN to ensure high-quality clusters.
results: The authors show that their approach can identify ambiguous words and evaluate their level of ambiguity, and that the resulting clusters are semantically coherent and correspond well to the perceived meanings of the words.Abstract
Ambiguity is ubiquitous in natural language. Resolving ambiguous meanings is especially important in information retrieval tasks. While word embeddings carry semantic information, they fail to handle ambiguity well. Transformer models have been shown to handle word ambiguity for complex queries, but they cannot be used to identify ambiguous words, e.g. for a 1-word query. Furthermore, training these models is costly in terms of time, hardware resources, and training data, prohibiting their use in specialized environments with sensitive data. Word embeddings can be trained using moderate hardware resources. This paper shows that applying DBSCAN clustering to the latent space can identify ambiguous words and evaluate their level of ambiguity. An automatic DBSCAN parameter selection leads to high-quality clusters, which are semantically coherent and correspond well to the perceived meanings of a given word.
摘要
<>自然语言中的歧义是 ubique 存在的。在信息检索任务中,解决歧义的含义特别重要。虽然词嵌入带有含义信息,但它们不能好地处理歧义。 transformer 模型可以处理复杂的查询中的词歧义,但它们无法识别歧义的单个词,例如一个单词查询。此外,使用这些模型进行训练需要大量的时间、硬件资源和训练数据,这限制了它们在特殊环境中使用。 word embedding 可以通过中等级别的硬件资源进行训练。这篇论文显示,将 DBSCAN 归一化算法应用到封闭空间可以识别歧义的单个词,并评估它们的歧义水平。自动选择 DBSCAN 参数可以获得高质量的归一化结果,这些结果是semantically coherent 的,与给定词的感知含义相吻合。
Embedding Models for Supervised Automatic Extraction and Classification of Named Entities in Scientific Acknowledgements
methods: 我们使用Flair NLP框架进行命名实体识别(NER)任务,训练 employed three default Flair NER模型,使用四个不同的训练集和不同版本的Flair NLP框架。
results: 我们发现,使用Flair Embeddings模型在中等训练集和最新版本Flair NLP框架下,性能最高,准确率为0.79。训练集的大小从非常小到中等大幅提高了所有训练算法的准确率,但是进一步扩大训练集不再提高性能。模型可以识别六种实体类型:资金机构、奖励编号、个人、大学、公司和其他。模型在一些实体类型上具有较高的F1分数,如个人和奖励编号,它们的F1分数都高于0.9。Abstract
Acknowledgments in scientific papers may give an insight into aspects of the scientific community, such as reward systems, collaboration patterns, and hidden research trends. The aim of the paper is to evaluate the performance of different embedding models for the task of automatic extraction and classification of acknowledged entities from the acknowledgment text in scientific papers. We trained and implemented a named entity recognition (NER) task using the Flair NLP framework. The training was conducted using three default Flair NER models with four differently-sized corpora and different versions of the Flair NLP framework. The Flair Embeddings model trained on the medium corpus with the latest FLAIR version showed the best accuracy of 0.79. Expanding the size of a training corpus from very small to medium size massively increased the accuracy of all training algorithms, but further expansion of the training corpus did not bring further improvement. Moreover, the performance of the model slightly deteriorated. Our model is able to recognize six entity types: funding agency, grant number, individuals, university, corporation, and miscellaneous. The model works more precisely for some entity types than for others; thus, individuals and grant numbers showed a very good F1-Score over 0.9. Most of the previous works on acknowledgment analysis were limited by the manual evaluation of data and therefore by the amount of processed data. This model can be applied for the comprehensive analysis of acknowledgment texts and may potentially make a great contribution to the field of automated acknowledgment analysis.
摘要
科学论文的感谢部分可以提供科学社区的一些方面的信息,如奖励系统、合作模式和隐藏的研究趋势。本文的目标是评估不同的嵌入模型在自动抽取和分类感谢 Entity 的任务中的表现。我们使用 Flair NLP 框架进行命名实体识别(NER)任务,并对三个默认的 Flair NER 模型进行训练。训练使用了不同的四个 corpus 和不同的 Flair NLP 框架版本。Flair Embeddings 模型使用 medium corpus 和最新的 FLAIR 版本显示最好的准确率为 0.79。将训练 corpus 的大小从非常小到中型大小会大幅提高所有训练算法的准确率,但是进一步扩大训练 corpus 不会再得到更好的改善。此外,模型的性能略有下降。我们的模型能够识别六种实体类型:资金机构、奖学金号、个人、大学、公司和其他。模型在一些实体类型上更加精准,例如个人和奖学金号的 F1 分数都高于 0.9。大多数前一些 acknowledgment 分析的研究都是通过手动评估数据来限制的,因此只能处理有限量的数据。这种模型可以应用于全面的 acknowledgment 文本分析,并可能对自动 acknowledgment 分析领域产生很大的贡献。
Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers
paper_authors: Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, Michalis Vazirgiannis
for: 这 paper 的目的是提出一种新的蛋白质功能预测方法,即 Prot2Text,可以在文本化的方式下预测蛋白质的功能。
methods: 这 paper 使用了 Graph Neural Networks(GNNs) 和 Large Language Models(LLMs) 组合在一起,在encoder-decoder框架下实现蛋白质功能的文本化预测。
results: 该 paper 的实验结果表明,Prot2Text 可以准确地预测蛋白质的功能,并且可以生成详细的文本描述。这些结果表明了 multimodal 模型的转变性,特别是 GNNs 和 LLMs 的融合,为蛋白质功能预测提供了 poderful 工具。Abstract
The complex nature of big biological systems pushed some scientists to classify its understanding under the inconceivable missions. Different leveled challenges complicated this task, one of is the prediction of a protein's function. In recent years, significant progress has been made in this field through the development of various machine learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e assigning predefined labels to proteins. In this work, we propose a novel approach, \textbf{Prot2Text}, which predicts a protein function's in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including proteins' sequences, structures, and textual annotations. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate prediction of proteins' functions. The code, the models and a demo will be publicly released.
摘要
大型生物系统的复杂性让一些科学家将其理解归类为不可思议任务。不同的层次挑战使得这项任务更加复杂,其中之一是蛋白质功能预测。在最近几年,通过开发不同的机器学习方法,有 significante进步在这一领域。然而,大多数现有方法将任务 формули化为多类别问题,即将蛋白质分配预定的标签。在这项工作中,我们提出了一种新的方法——Prot2Text,它预测蛋白质功能的方式与传统的多类别分类法不同,而是通过组合图 neural network和大型自然语言模型,在Encoder-Decoder框架中进行融合。这种多Modal的方法允许对蛋白质功能进行全面表示,从而生成详细和准确的描述。为评估我们的模型,我们从SwissProt中提取了一个多Modal蛋白质数据集,并通过实验证明了Prot2Text的效果。这些结果显示了多Modal模型的转变性,尤其是GNNs和LLMs的融合,为研究人员提供了更准确的蛋白质功能预测工具。代码、模型和 demo 将公开发布。
Improving the Generalization Ability in Essay Coherence Evaluation through Monotonic Constraints
results: 我们的提出的模型能够更好地泛化未经见过的数据。模型在 NLPCC 2023 年度共同任务7的第三名上进行了比赛,并 briefly 介绍了我们的解决方案的剩下的 tracks,其中在第二名上进行了第二名,并在第三名和第四名上进行了第一名。Abstract
Coherence is a crucial aspect of evaluating text readability and can be assessed through two primary factors when evaluating an essay in a scoring scenario. The first factor is logical coherence, characterized by the appropriate use of discourse connectives and the establishment of logical relationships between sentences. The second factor is the appropriateness of punctuation, as inappropriate punctuation can lead to confused sentence structure. To address these concerns, we propose a coherence scoring model consisting of a regression model with two feature extractors: a local coherence discriminative model and a punctuation correction model. We employ gradient-boosting regression trees as the regression model and impose monotonicity constraints on the input features. The results show that our proposed model better generalizes unseen data. The model achieved third place in track 1 of NLPCC 2023 shared task 7. Additionally, we briefly introduce our solution for the remaining tracks, which achieves second place for track 2 and first place for both track 3 and track 4.
摘要
<> translate "Coherence is a crucial aspect of evaluating text readability and can be assessed through two primary factors when evaluating an essay in a scoring scenario. The first factor is logical coherence, characterized by the appropriate use of discourse connectives and the establishment of logical relationships between sentences. The second factor is the appropriateness of punctuation, as inappropriate punctuation can lead to confused sentence structure. To address these concerns, we propose a coherence scoring model consisting of a regression model with two feature extractors: a local coherence discriminative model and a punctuation correction model. We employ gradient-boosting regression trees as the regression model and impose monotonicity constraints on the input features. The results show that our proposed model better generalizes unseen data. The model achieved third place in track 1 of NLPCC 2023 shared task 7. Additionally, we briefly introduce our solution for the remaining tracks, which achieves second place for track 2 and first place for both track 3 and track 4." into Simplified Chinese. coherence 是文本可读性评估中的一个关键因素,可以通过两个主要因素进行评估:一是逻辑连贯性,即使用演示连接词和建立句子之间的逻辑关系;二是句子结构的括号正确性,因为不当的括号可能导致句子结构混乱。为解决这些问题,我们提出了一种减量模型,包括两个特征提取器:本地连贯性推论模型和括号修正模型。我们使用梯度提升回归树作为回归模型,并对输入特征受到约束。结果表明,我们的提出的模型在未seen数据上更好地泛化。这个模型在 NLPCC 2023 共享任务 7 的第三名。此外,我们简要介绍我们对其余轨迹的解决方案,其中在第二轨迹上获得第二名,并在第三轨迹和第四轨迹上均获得第一名。Note: "NLPCC" stands for "Natural Language Processing and Chinese Computing" conference.
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
results: 我们通过实验发现,我们的偏移预处理可以提高一些现有的量化算法,并且使用只有两个位数的量化方法实现了LLM模型的可行结果。我们的代码可以在https://github.com/jerry-chee/QuIP上找到。Abstract
This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/jerry-chee/QuIP .
摘要
An adaptive rounding procedure that minimizes a quadratic proxy objective.2. Efficient pre- and post-processing that ensures weight and Hessian incoherence through multiplication by random orthogonal matrices.We also provide the first theoretical analysis for an LLM-scale quantization algorithm and show that our theory applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/jerry-chee/QuIP.
results: 研究发现在不同的搜寻意图下,用户的搜寻行为和满意度有显著差异。此外,该taxonomy可以应用于多个下游法律检索任务,如结果排名和满意度预测。Abstract
Legal case retrieval is a special Information Retrieval~(IR) task focusing on legal case documents. Depending on the downstream tasks of the retrieved case documents, users' information needs in legal case retrieval could be significantly different from those in Web search and traditional ad-hoc retrieval tasks. While there are several studies that retrieve legal cases based on text similarity, the underlying search intents of legal retrieval users, as shown in this paper, are more complicated than that yet mostly unexplored. To this end, we present a novel hierarchical intent taxonomy of legal case retrieval. It consists of five intent types categorized by three criteria, i.e., search for Particular Case(s), Characterization, Penalty, Procedure, and Interest. The taxonomy was constructed transparently and evaluated extensively through interviews, editorial user studies, and query log analysis. Through a laboratory user study, we reveal significant differences in user behavior and satisfaction under different search intents in legal case retrieval. Furthermore, we apply the proposed taxonomy to various downstream legal retrieval tasks, e.g., result ranking and satisfaction prediction, and demonstrate its effectiveness. Our work provides important insights into the understanding of user intents in legal case retrieval and potentially leads to better retrieval techniques in the legal domain, such as intent-aware ranking strategies and evaluation methodologies.
摘要
法律案例检索是一种特殊的信息检索任务,专注于法律案例文档。根据下游任务中返回的案例文档的用户信息需求,用户在法律检索任务中的搜索意图可能与传统的Web搜索和特殊检索任务存在很大差异。虽然有几篇研究文章通过文本相似性来检索法律案例,但用户在法律检索中的搜索意图还未得到了充分的研究。为此,我们提出了一个新的层次意图分类法,它包括五种意图类别,分为三个标准:寻找特定案例(Search for Particular Case)、特征化(Characterization)、裁罚(Penalty)、程序(Procedure)和利益(Interest)。这个分类体系由 transparent construction 和广泛的用户研究进行验证,并通过实验研究表明了用户在不同搜索意图下的行为和满意度之间存在显著差异。此外,我们还应用该分类法到不同的下游法律检索任务中,如结果排名和满意度预测,并证明其效果。我们的工作为法律检索领域的理解用户意图提供了重要的洞察,并可能导致更好的检索技术的发展,如意向检索策略和评价方法。
Schema-Driven Actionable Insight Generation and Smart Recommendation
results: 该方法可以根据用户反馈进行排序,以适应用户的兴趣。我们已经展示了这种方法可以生成的先验结果。Abstract
In natural language generation (NLG), insight mining is seen as a data-to-text task, where data is mined for interesting patterns and verbalised into 'insight' statements. An 'over-generate and rank' paradigm is intuitively used to generate such insights. The multidimensionality and subjectivity of this process make it challenging. This paper introduces a schema-driven method to generate actionable insights from data to drive growth and change. It also introduces a technique to rank the insights to align with user interests based on their feedback. We show preliminary qualitative results of the insights generated using our technique and demonstrate its ability to adapt to feedback.
摘要
natural language generation (NLG) 中,启示挖掘被看作是一个数据到文本任务,通过挖掘数据中有趣的模式,并将其转化为“启示”声明。一种“过度生成并排序”的思路是INTUITIVELY用于生成这些启示。由于这个过程的多维度和主观性,使其具有挑战性。这篇论文介绍了一种基于Schema驱动的方法,用于从数据中生成可行的启示,以驱动增长和变革。它还介绍了一种基于用户反馈的技术,用于对启示进行排序,以符合用户的兴趣。我们展示了先期的Qualitative结果,证明了我们的方法能够适应反馈。
for: 本研究旨在探讨自动解 math word problem 的方法是否遵循语言表达的 semantic logic。
methods: 本研究使用了 removing parts of the input 来测试模型的性能,以确定模型是否仅仅匹配语言表达的特征。
results: 结果表明,模型对输入中的多个字词的删除不会影响其解题能力,而且可以从 nonsense 问题中提取正确的解答。这表明自动解 math word problem 的模型可能会匹配语言表达的表现,而不是遵循语言表达的 semantic logic。Abstract
Automated math word problem solvers based on neural networks have successfully managed to obtain 70-80\% accuracy in solving arithmetic word problems. However, it has been shown that these solvers may rely on superficial patterns to obtain their equations. In order to determine what information math word problem solvers use to generate solutions, we remove parts of the input and measure the model's performance on the perturbed dataset. Our results show that the model is not sensitive to the removal of many words from the input and can still manage to find a correct answer when given a nonsense question. This indicates that automatic solvers do not follow the semantic logic of math word problems, and may be overfitting to the presence of specific words.
摘要
自动化的数学问题解决程式基于神经网络已经成功地在解决算数问题上获得70-80%的准确率。然而,研究人员发现这些解决方案可能会从 superficier 的特征中获得其方程。为了决定这些解决方案所使用的信息,我们将输入中的部分 removed 并评估模型在这些干扰dataset上的性能。我们的结果显示,模型对输入中的许多字都不敏感,仍然可以从非现实问题中获得正确的解答。这表明自动解决方案不会跟着数学问题的semantic逻辑,可能是过拟合特定的字汇。
Evaluating the Ripple Effects of Knowledge Editing in Language Models
paper_authors: Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, Mor Geva
for: 这篇论文主要针对现代语言模型中的知识更新问题。
methods: 论文提出了一种新的评估标准,用于评估编辑方法对模型知识的影响。
results: 研究发现,目前的编辑方法通常无法在模型知识中引入一致的变化,而一种简单的上下文编辑基线得到了最佳成绩。Abstract
Modern language models capture a large body of factual knowledge. However, some facts can be incorrectly induced or become obsolete over time, resulting in factually incorrect generations. This has led to the development of various editing methods that allow updating facts encoded by the model. Evaluation of these methods has primarily focused on testing whether an individual fact has been successfully injected, and if similar predictions for other subjects have not changed. Here we argue that such evaluation is limited, since injecting one fact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple effect'' in the form of additional facts that the model needs to update (e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we propose a novel set of evaluation criteria that consider the implications of an edit on related facts. Using these criteria, we then construct \ripple{}, a diagnostic benchmark of 5K factual edits, capturing a variety of types of ripple effects. We evaluate prominent editing methods on \ripple{}, showing that current methods fail to introduce consistent changes in the model's knowledge. In addition, we find that a simple in-context editing baseline obtains the best scores on our benchmark, suggesting a promising research direction for model editing.
摘要
现代语言模型可以捕捉大量的事实知识。然而,一些事实可能会在时间的推移中变得过时或者错误地被推导出来,导致模型生成的结果不准确。为了解决这个问题,人们开发了多种修改方法,以更新模型中的事实。然而,评估这些方法的主要方法是测试模型中的一个特定事实是否已经成功地更新,并且其他主题的预测没有变化。在这篇文章中,我们 argue这种评估方法是有限的,因为更新一个事实(例如,“杰克·德普是Johnny Depp的儿子”)会导致模型需要更新其他相关的事实(例如,“杰克·德普是LILY-ROSE DEPP的姐妹”)。为解决这个问题,我们提出了一组新的评估标准,考虑修改的影响于相关的事实。使用这些标准,我们然后构建了一个名为\ripple{}的诊断 benchmark,包含5000个事实修改。我们对这些修改进行评估,发现当前的修改方法无法在模型中引入一致的改变。此外,我们发现一个简单的 Context-sensitive editing baseline 在我们的benchmark上获得了最好的分数, suggesting a promising research direction for model editing。
Leveraging Label Variation in Large Language Models for Zero-Shot Text Classification
results: 论文发现,虽然不同的任务、数据和语言下的模型表现不同,但使用人工注解者的汇集技术可以substantially better than任何一个个体模型。然而,LLMs仍然不能与人工注解者相比,因此它们并不可以完全取代人工注解。Abstract
The zero-shot learning capabilities of large language models (LLMs) make them ideal for text classification without annotation or supervised training. Many studies have shown impressive results across multiple tasks. While tasks, data, and results differ widely, their similarities to human annotation can aid us in tackling new tasks with minimal expenses. We evaluate using 5 state-of-the-art LLMs as "annotators" on 5 different tasks (age, gender, topic, sentiment prediction, and hate speech detection), across 4 languages: English, French, German, and Spanish. No single model excels at all tasks, across languages, or across all labels within a task. However, aggregation techniques designed for human annotators perform substantially better than any one individual model. Overall, though, LLMs do not rival even simple supervised models, so they do not (yet) replace the need for human annotation. We also discuss the tradeoffs between speed, accuracy, cost, and bias when it comes to aggregated model labeling versus human annotation.
摘要
大型自然语言模型(LLM)的零shot学习能力使其成为文本分类无需注释或指导式训练的理想选择。许多研究表明在多个任务上获得了吸引人的结果。虽然任务、数据和结果之间存在差异,但它们在人类注释上的相似性可以帮助我们解决新的任务,降低成本。我们使用5种 state-of-the-art LLM作为“注释员”进行5个任务(年龄、性别、主题、情感预测和词语攻击检测),在英语、法语、德语和西班牙语四种语言上进行评估。没有任何模型在所有任务和语言上表现出色,但是为human annotator的汇集技术表现出了明显的改善。总的来说,LLMs现在没有超过简单的指导式模型,因此它们还没有取代人类注释。我们还讨论了在汇集模型标签与人类注释之间的速度、准确率、成本和偏见的贸易。
Aligning Large Language Models with Human: A Survey
For: This paper provides a comprehensive overview of alignment technologies for large language models (LLMs) to better suit human-oriented tasks and expectations.* Methods: The paper reviews various training methodologies for LLM alignment, including supervised fine-tuning, online and offline human preference training, and parameter-efficient training mechanisms.* Results: The paper evaluates the effectiveness of human-aligned LLMs using a multifaceted approach and highlights several promising future research avenues in the field.Here is the same information in Simplified Chinese text:* For: 这篇论文提供了大语言模型(LLM)的启用技术的全面回顾,以便更好地适应人类需求。* Methods: 论文回顾了各种用于LLM启用的训练方法,包括监督精度调整、在线和离线人类偏好训练以及参数高效训练机制。* Results: 论文使用多方面的评估方法评估了人类启用LLM的效果,并提出了许多可能的未来研究方向。Abstract
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks. Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect (hallucinated) information. Hence, aligning LLMs with human expectations has become an active area of interest within the research community. This survey presents a comprehensive overview of these alignment technologies, including the following aspects. (1) Data collection: the methods for effectively collecting high-quality instructions for LLM alignment, including the use of NLP benchmarks, human annotations, and leveraging strong LLMs. (2) Training methodologies: a detailed review of the prevailing training methods employed for LLM alignment. Our exploration encompasses Supervised Fine-tuning, both Online and Offline human preference training, along with parameter-efficient training mechanisms. (3) Model Evaluation: the methods for evaluating the effectiveness of these human-aligned LLMs, presenting a multifaceted approach towards their assessment. In conclusion, we collate and distill our findings, shedding light on several promising future research avenues in the field. This survey, therefore, serves as a valuable resource for anyone invested in understanding and advancing the alignment of LLMs to better suit human-oriented tasks and expectations. An associated GitHub link collecting the latest papers is available at https://github.com/GaryYufei/AlignLLMHumanSurvey.
摘要
庞大语言模型(LLM)在各种自然语言处理(NLP)任务中表现出色,但它们也存在一些局限性,如不理解人类指令、生成可能偏见的内容或者 factually incorrect(hallucinated)信息。因此,与人类期望的 aligning LLM 已成为研究领域的热点。这篇评论文章提供了一个全面的对这些对齐技术的评论,包括以下方面:1. 数据采集:如何有效地采集高质量的人类指令,包括使用 NLP 标准准则、人工标注和利用强大的 LLM。2. 训练方法:详细介绍了在 LLM 对齐中广泛使用的训练方法,包括监督精细调整、在线人类喜好训练和效率的参数训练机制。3. 模型评价:如何评价这些人类对齐的 LLM,提出了多方面的评价方法,以提供全面的评价方式。总之,这篇评论文章为各种投入 NLP 领域的人提供了一个有价值的资源,帮助他们更好地理解和提高 LLM 的对齐性,以更好地适应人类中心的任务和期望。关于这些研究的最新论文,可以通过以下 GitHub 链接获取:https://github.com/GaryYufei/AlignLLMHumanSurvey.
Boosting Punctuation Restoration with Data Generation and Reinforcement Learning
results: 实验表明,我们的方法在ASR测试集上的两个标准 datasets上达到了状态的最佳性能。Abstract
Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability. While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. The experiments show that our method achieves state-of-the-art performance on the ASR test set on two benchmark datasets for punctuation restoration.
摘要
“短语结构修复是自动语音识别(ASR)中的一项重要任务,旨在提高ASR文本的可读性。written文本充沛,但是ASR文本与written文本之间存在差异,这限制了使用written文本来训练ASR文本的短语结构修复系统。本文提出了一种利用topic内文本和大型预训练生成语言模型的强化学习方法,bridge这个差异。实验结果显示,我们的方法在ASR测试集上两个benchmark dataset上达到了状态对抗性的性能。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.
The potential of LLMs for coding with low-resource and domain-specific programming languages
results: 研究发现,LLM可以用于写作、理解、改进和文档gretl代码,包括生成函数的描述文档和提供 econometric代码的准确解释。但是,LLM还有一些局限性,如无法改进某些代码部分和写入正确的单元测试代码。Abstract
This paper presents a study on the feasibility of using large language models (LLM) for coding with low-resource and domain-specific programming languages that typically lack the amount of data required for effective LLM processing techniques. This study focuses on the econometric scripting language named hansl of the open-source software gretl and employs a proprietary LLM based on GPT-3.5. Our findings suggest that LLMs can be a useful tool for writing, understanding, improving, and documenting gretl code, which includes generating descriptive docstrings for functions and providing precise explanations for abstract and poorly documented econometric code. While the LLM showcased promoting docstring-to-code translation capability, we also identify some limitations, such as its inability to improve certain sections of code and to write accurate unit tests. This study is a step towards leveraging the power of LLMs to facilitate software development in low-resource programming languages and ultimately to lower barriers to entry for their adoption.
摘要
Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well.