methods: 该论文使用了一种新的方法,即内文对照学习 WITH 精神(ICLD),利用新闻报道的特殊结构特征进行采样,并通过对比分类来增强模型的性能。
results: 论文的实验结果表明,ICLD 方法可以有效地解决新闻报道中文本结构分类问题,并且比传统的监督学习方法更有效。Abstract
News Discourse Profiling seeks to scrutinize the event-related role of each sentence in a news article and has been proven useful across various downstream applications. Specifically, within the context of a given news discourse, each sentence is assigned to a pre-defined category contingent upon its depiction of the news event structure. However, existing approaches suffer from an inadequacy of available human-annotated data, due to the laborious and time-intensive nature of generating discourse-level annotations. In this paper, we present a novel approach, denoted as Intra-document Contrastive Learning with Distillation (ICLD), for addressing the news discourse profiling task, capitalizing on its unique structural characteristics. Notably, we are the first to apply a semi-supervised methodology within this task paradigm, and evaluation demonstrates the effectiveness of the presented approach.
摘要
新闻话语分析旨在研究每个新闻文章中的每句话语的事件相关性角色,并在多种下游应用中表现出有用性。特别是在给定的新闻话语背景下,每句话语会被分配到预定的类别,根据它们描述新闻事件结构。然而,现有的方法受到有限的人工标注数据的不足,这是因为生成话语水平标注的劳动和时间费时的。在这篇论文中,我们提出了一种新的方法,称为Intra-document Contrastive Learning with Distillation(ICLD),用于解决新闻话语 profiling 任务,利用它的独特结构特征。值得注意的是,我们是首次在这个任务准则下应用 semi-supervised 方法ологи,评估结果表明该方法的有效性。
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
results: 根据 LLaMA-2 为基础模型,实现了在 WMT’21 和 WMT’22 测试集上的平均提高超过 12 BLEU 和 12 COMET,在 10 个翻译方向上。表现较之前的所有工作更好,甚至超过 NLLB-54B 模型和 GPT-3.5-text-davinci-003,即使只有 7B 或 13B 参数。这种方法为机器翻译训练方法提供了基础。Abstract
Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.
摘要
生成大型自然语言模型(LLM)在不同的自然语言处理任务中已经取得了非常出色的进步。然而,这些进步并没有反映在翻译任务中,尤其是使用中型模型(i.e., 7B或13B参数),这些模型仍然落后于传统的监督编码器-解码器翻译模型。先前的研究已经尝试使用不同的方法来提高这些中型LLM的翻译能力,但其成果很有限。在这个研究中,我们提出了一种特有的练习方法,用于提高LLM的翻译能力,不需要大量的并行数据。我们的方法包括两个练习阶段:首先在单语言数据上进行初始练习,然后在一个小量高质量并行数据上进行 subsequential 练习。我们称之为Advanced Language Model-based trAnslator(ALMA)。基于LLaMA-2作为我们的基础模型,我们的结果显示,该模型可以在10个翻译方向上 average 提高超过12个BLEU和12个COMET的性能,相比于零开始性能。这个性能高于所有之前的工作,甚至超过NLLB-54B模型和GPT-3.5-text-davinci-003模型,即使只有7B或13B参数。这种方法创立了一种新的训练 парадигма在机器翻译领域。
Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation
paper_authors: Ali Mousavi, Xin Zhan, He Bai, Peng Shi, Theo Rekatsinas, Benjamin Han, Yunyao Li, Jeff Pound, Josh Susskind, Natalie Schluter, Ihab Ilyas, Navdeep Jaitly
for: 这 paper 的目的是证明使用不同噪音水平生成 Knowledge Graph (KG) 和文本对应的数据集,可以训练前向和反向神经网络模型,但是使用不同的数据集可能会导致更多的幻觉和更差的拟合率。
methods: 这 paper 使用了生成文本和 KG 的cyclic evaluation来评估模型的性能,并通过手动创建 WebNLG 和自动创建 TeKGen 和 T-REx 来评估模型的表现。
results: 这 paper 发现,使用不同噪音水平生成的数据集可以影响模型的性能,并且手动创建的 WebNLG 表现更好于自动创建的 TeKGen 和 T-REx。此外,使用大语言模型 (LLM) 构建的数据集可以训练模型在文本生成中表现出色,但是在 Knowledge Graph 生成中表现较差,可能是因为没有一个共同的 Ontology。Abstract
Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.
摘要
Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However, models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation, we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Towards Effective Disambiguation for Machine Translation with Large Language Models
results: 实验结果表明,我们的方法可以与当前状态的系统如深度翻译和 NLLB 匹配或超越,在五种语言方向中四种方向中表现出色。Abstract
Resolving semantic ambiguity has long been recognised as a central challenge in the field of machine translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to capture many of these cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate ambiguous sentences containing polysemous words and rare word senses. We also propose two ways to improve the handling of such ambiguity through in-context learning and fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs for disambiguation during machine translation.
摘要
解决语义含义的挑战一直被认为是机器翻译领域的中心问题。最近的研究表明,使用含义ambiguous sentence进行翻译性能测试的传统神经机器翻译(NMT)系统有限,不能捕捉这些情况。大型语言模型(LLM)在这些情况下表现出了潜在的优势,并提出了新的控制目标输出的方法。在这篇论文中,我们研究了LLM在含义ambiguous sentence中翻译的能力,并提出了两种改进方法,通过在上下文学习和精心编辑的歧义数据进行训练。实验结果显示,我们的方法可以与现有的状态机DeepL和NLLB相当或超越,在五种语言方向中四种方向取得了最佳效果。我们的研究为将LLM适应到翻译中的歧义提供了有价值的视角。
Hate speech detection in algerian dialect using deep learning
paper_authors: Dihia Lanasri, Juan Olano, Sifal Klioui, Sin Liang Lee, Lamia Sekkai
for: 帮助掌握在阿拉伯语言上的仇恨言论检测问题,尤其是在阿尔жи尔语 dialect中。
methods: 使用深度学习架构对阿尔жи尔社交媒体上的短讯进行分类,以确定是否包含仇恨言论。
results: 在对13500余个阿尔жи尔社交媒体短讯的实验中,提出了一种可靠的仇恨言论检测方法,并取得了批判性的结果。Abstract
With the proliferation of hate speech on social networks under different formats, such as abusive language, cyberbullying, and violence, etc., people have experienced a significant increase in violence, putting them in uncomfortable situations and threats. Plenty of efforts have been dedicated in the last few years to overcome this phenomenon to detect hate speech in different structured languages like English, French, Arabic, and others. However, a reduced number of works deal with Arabic dialects like Tunisian, Egyptian, and Gulf, mainly the Algerian ones. To fill in the gap, we propose in this work a complete approach for detecting hate speech on online Algerian messages. Many deep learning architectures have been evaluated on the corpus we created from some Algerian social networks (Facebook, YouTube, and Twitter). This corpus contains more than 13.5K documents in Algerian dialect written in Arabic, labeled as hateful or non-hateful. Promising results are obtained, which show the efficiency of our approach.
摘要
Translated into Simplified Chinese:随着社交媒体上不同形式的仇恨言语、网络欺凌和暴力等等的普及,人们受到了不适的情况和威胁。过去几年,为了解决这种现象,各种努力已经投入了很多时间和精力,以检测不同的结构语言中的仇恨言语,如英语、法语、阿拉伯语等等。然而,对于阿拉伯 диалект,如突尼斯、埃及和 Golfo 的研究相对较少。为了填补这个空白,我们在这工作中提出了一个完整的方法,用于在在线阿尔及利亚消息中检测仇恨言语。我们在一些阿尔及利亚社交媒体(Facebook、YouTube和Twitter)上创建了一个大量的 corpus,包括13500余个文档,用阿尔及利亚 диалект的阿拉伯语书写,标注为有仇恨或无仇恨。我们评估了多种深度学习架构,并获得了良好的结果,这表明我们的方法的效果。
SpeechAlign: a Framework for Speech Translation Alignment Evaluation
results: 通过发布SpeechAlign框架,这篇论文为speech模型评估提供了可 accessible的评估框架,并通过使用这个框架对开源Speech Translation模型进行了比较。Abstract
Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. To contribute to these fields, we present SpeechAlign, a framework to evaluate the underexplored field of source-target alignment in speech models. Our framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), to evaluate alignment quality in speech models. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models.
摘要
<>转换给定文本到简化中文。现在演示的 Speech-to-Speech 和 Speech-to-Text 翻译是研究领域的动态领域。为了贡献这些领域,我们提出 SpeechAlign 框架,用于评估speech模型中source-target对齐的领域。我们的框架有两个核心组成部分。首先,由于缺乏适合的评估数据集,我们引入 Speech Gold Alignment 数据集,基于英语-德语文本翻译金标Alignment数据集。其次,我们引入两种新的指标,Speech Alignment Error Rate (SAER) 和 Time-weighted Speech Alignment Error Rate (TW-SAER),用于评估对齐质量在speech模型中。通过发布 SpeechAlign,我们提供了一个可访问的评估框架,并使用它来对开源 Speech Translation 模型进行比较。
Incorporating Singletons and Mention-based Features in Coreference Resolution via Multi-task Learning for Better Generalization
results: 本研究在OntoGUMbenchmark上 achieve新的状态机制得分 (+2.7点),并在多个out-of-domain数据集上提高了Robustness (+2.3点的平均提高值),这些提高可能是由于更好的提及检测和更多的数据来自单个提及span的使用所致。Abstract
Previous attempts to incorporate a mention detection step into end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention span data as well as other entity information. This paper presents a coreference model that learns singletons as well as features such as entity type and information status via a multi-task learning-based approach. This approach achieves new state-of-the-art scores on the OntoGUM benchmark (+2.7 points) and increases robustness on multiple out-of-domain datasets (+2.3 points on average), likely due to greater generalizability for mention detection and utilization of more data from singletons when compared to only coreferent mention pair matching.
摘要
先前的尝试将提及检测步骤包含在英语的端到端神经核心referencing中,受到缺乏单个提及跨度数据以及其他实体信息的限制。这篇论文提出了一种核心模型,可以学习单个提及以及实体类型和信息状态等特征,使用多任务学习的方式。这种方法在OntoGUM benchmark上达到了新的状态态标准分(+2.7分),并在多个 OUT-OF-DOMAIN 数据集上提高了鲁棒性(平均+2.3分),可能是因为更好的提及检测和更多的数据来自单个提及 span 的利用。
Examining the Limitations of Computational Rumor Detection Models Trained on Static Datasets
results: 研究结果表明,Context基于的模型仍然受到来源帖子信息的限制,并且忽略了上下文信息的重要作用。此外,研究还探讨了数据分割策略对分类器性能的影响,并提供了实践的建议来降低静态数据集中的时间概念漂移的影响。Abstract
A crucial aspect of a rumor detection model is its ability to generalize, particularly its ability to detect emerging, previously unknown rumors. Past research has indicated that content-based (i.e., using solely source posts as input) rumor detection models tend to perform less effectively on unseen rumors. At the same time, the potential of context-based models remains largely untapped. The main contribution of this paper is in the in-depth evaluation of the performance gap between content and context-based models specifically on detecting new, unseen rumors. Our empirical findings demonstrate that context-based models are still overly dependent on the information derived from the rumors' source post and tend to overlook the significant role that contextual information can play. We also study the effect of data split strategies on classifier performance. Based on our experimental results, the paper also offers practical suggestions on how to minimize the effects of temporal concept drift in static datasets during the training of rumor detection methods.
摘要
一个重要的噱头检测模型特点是其能够总结,特别是检测出现在未知噱头。过去的研究表明,含有媒体文章仅作输入的内容基于噱头检测模型在未看过的噱头上表现较差。同时,叙述基于模型的潜力仍然未得到充分利用。本文的主要贡献在于对内容和叙述基于模型的性能差异进行深入评估,特别是检测新的、未知噱头。我们的实验结果表明,叙述基于模型仍然过分依赖源媒体文章提供的信息,而忽视了Contextual信息的重要作用。我们还研究了数据分裂策略对分类器性能的影响。根据我们的实验结果,文章还提供了实践的建议,以降低在训练噱头检测方法时的时间概念退变的影响。
SignBank+: Multilingual Sign Language Translation Dataset
methods: 介绍SignBank+数据集,是Optimized for machine translation的纯净版SignBank数据集,并使用简单的文本到文本翻译方法。
results: 评估结果显示,使用SignBank+数据集训练的模型超过原始数据集训练的模型,创造新的benchmark和提供开放资源 для未来研究。Abstract
This work advances the field of sign language machine translation by focusing on dataset quality and simplification of the translation system. We introduce SignBank+, a clean version of the SignBank dataset, optimized for machine translation. Contrary to previous works that employ complex factorization techniques for translation, we advocate for a simplified text-to-text translation approach. Our evaluation shows that models trained on SignBank+ surpass those on the original dataset, establishing a new benchmark and providing an open resource for future research.
摘要
这个研究提高了手语机器翻译的领域,关注数据集质量和翻译系统简化。我们介绍了SignBank+,一个优化的手语数据集,适用于机器翻译。与前期工作不同,我们主张使用简单的文本到文本翻译方法。我们的评估表明,基于SignBank+的模型比原始数据集模型更高效,创造了新的标准和提供了未来研究的开放资源。
Hierarchical reinforcement learning with natural language subgoals
results: 该方法比专家复制行为和没有这种监督目标空间的HRL better表现,表明该方法可以结合人类专家监督和奖励学习的优点。Abstract
Hierarchical reinforcement learning has been a compelling approach for achieving goal directed behavior over long sequences of actions. However, it has been challenging to implement in realistic or open-ended environments. A main challenge has been to find the right space of sub-goals over which to instantiate a hierarchy. We present a novel approach where we use data from humans solving these tasks to softly supervise the goal space for a set of long range tasks in a 3D embodied environment. In particular, we use unconstrained natural language to parameterize this space. This has two advantages: first, it is easy to generate this data from naive human participants; second, it is flexible enough to represent a vast range of sub-goals in human-relevant tasks. Our approach outperforms agents that clone expert behavior on these tasks, as well as HRL from scratch without this supervised sub-goal space. Our work presents a novel approach to combining human expert supervision with the benefits and flexibility of reinforcement learning.
摘要
hierarchical reinforcement learning 是一种吸引人的方法,可以实现长序列动作的目标行为。然而,在真实或开放的环境中实现具有挑战。一个主要挑战是找到适当的下一级目标空间,以实现层次结构。我们提出了一种新的方法,使用人类解决这些任务的数据来软着册这个空间。具体来说,我们使用无结构的自然语言来 parameterize这个空间。这有两个优点:首先,可以轻松地从不熟悉的人参与者中获得这些数据;其次,它够灵活,可以表示人类相关任务中的广泛下一级目标。我们的方法比不同扩展学习的代理人和不带有此协助下一级目标空间的 HRL 表现更好。我们的工作提出了一种结合人类专家指导和强化学习的新方法。
DreamLLM: Synergistic Multimodal Comprehension and Creation
results: DreamLLM 能够生成免 Training 的多modal通用专家,在多modal总体 экспериментах中表现出色,受益于提高的学习共识。Abstract
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.
摘要
Generative modeling of both language and image posteriors through direct sampling in the raw multimodal space. This approach bypasses the limitations of external feature extractors like CLIP and enables a more comprehensive understanding of multimodal information.2. Generation of raw, interleaved documents that model both text and image contents, as well as unstructured layouts. This allows DreamLLM to effectively learn all conditional, marginal, and joint multimodal distributions.As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content, demonstrating superior performance as a zero-shot multimodal generalist. Comprehensive experiments highlight the enhanced learning synergy achieved by DreamLLM.
Controlled Generation with Prompt Insertion for Natural Language Explanations in Grammatical Error Correction
for: 这个论文的目的是提出一种名为控制生成(Prompt Insertion,PI)的方法,用于使大型自然语言模型(Large Language Models,LLMs)可以在自然语言中提供对 grammar 和语法错误 corrections 的直接解释。
methods: 这个论文使用了 Large Language Models (LLMs) 和 Prompt Insertion (PI) 方法来生成对 grammar 和语法错误 corrections 的直接解释。
results: 这个研究发现,使用 PI 方法可以使 LLMs 能够直接在自然语言中提供对 grammar 和语法错误 corrections 的解释,并且可以提高对 correction reasons 的生成性能。Abstract
In Grammatical Error Correction (GEC), it is crucial to ensure the user's comprehension of a reason for correction. Existing studies present tokens, examples, and hints as to the basis for correction but do not directly explain the reasons for corrections. Although methods that use Large Language Models (LLMs) to provide direct explanations in natural language have been proposed for various tasks, no such method exists for GEC. Generating explanations for GEC corrections involves aligning input and output tokens, identifying correction points, and presenting corresponding explanations consistently. However, it is not straightforward to specify a complex format to generate explanations, because explicit control of generation is difficult with prompts. This study introduces a method called controlled generation with Prompt Insertion (PI) so that LLMs can explain the reasons for corrections in natural language. In PI, LLMs first correct the input text, and then we automatically extract the correction points based on the rules. The extracted correction points are sequentially inserted into the LLM's explanation output as prompts, guiding the LLMs to generate explanations for the correction points. We also create an Explainable GEC (XGEC) dataset of correction reasons by annotating NUCLE, CoNLL2013, and CoNLL2014. Although generations from GPT-3 and ChatGPT using original prompts miss some correction points, the generation control using PI can explicitly guide to describe explanations for all correction points, contributing to improved performance in generating correction reasons.
摘要
在语法错误 corrections (GEC) 中,确保用户理解 correction 的理由是关键。现有的研究提供了 tokens、例子和提示,但没有直接解释 correction 的理由。虽然使用 Large Language Models (LLMs) 提供直接解释的自然语言方法已经被提出 для多个任务,但对 GEC 的方法不存在。生成 GEC corrections 的解释 involves 对输入和输出 tokens 进行对应、确定 correction 点并提供相应的解释。然而,不是 straightforward specify 复杂的生成格式,因为Explicit 控制生成是 difficult 的。这种研究引入一种名为 controlled generation with Prompt Insertion (PI) 的方法,使得 LLMs 可以通过自然语言来解释 correction 的理由。在 PI 中,LLMs 首先 corrections 输入文本,然后我们自动提取 correction 点基于规则。提取的 correction 点被自动插入 LLMs 的解释输出中作为提示,导引 LLMs 生成对 correction 点的解释。我们还创建了一个 Explainable GEC (XGEC) 数据集,其中包含 correction 理由的注释。虽然 GPT-3 和 ChatGPT 使用原始提示生成的 Generation 缺少一些 correction 点,但使用 PI 的生成控制可以明确指导 LLMs 生成对 correction 点的解释,从而提高生成 correction 理由的性能。
results: 模型在终到级文档级文本识别和图像到markdown文本生成任务中表现出色,可以适应各种不同的任务,并且可以通过精度微调来适应不同的应用场景。Abstract
We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
摘要
我们介绍Kosmos-2.5,一种多Modal literate模型,用于机器阅读图像中的文本内容。Kosmos-2.5在两个不同 yet 相互协作的译写任务中表现出色:(1)生成具有空间坐标的文本块,每个文本块在图像中被分配特定的空间坐标;(2)生成符合markdown格式的结构化文本输出。这种多Modal literate能力通过共享Transformer架构、任务特定的提示和灵活文本表示方式实现。我们对Kosmos-2.5进行了端到端文档级文本识别和图像到markdown文本生成的评估。此外,通过精心微调,可以将模型适应不同的提示任务,使其成为实际应用中文本强度图像理解任务的通用工具。此项工作也为未来扩大多Modal大语言模型的前景铺平了路。
Safurai 001: New Qualitative Approach for Code LLM Evaluation
results: 研究表明,Safurai-001可以超越GPT-3.5和WizardCoder在代码可读性方面,提高1.58%和18.78%。Abstract
This paper presents Safurai-001, a new Large Language Model (LLM) with significant potential in the domain of coding assistance. Driven by recent advancements in coding LLMs, Safurai-001 competes in performance with the latest models like WizardCoder [Xu et al., 2023], PanguCoder [Shen et al., 2023] and Phi-1 [Gunasekar et al., 2023] but aims to deliver a more conversational interaction. By capitalizing on the progress in data engineering (including latest techniques of data transformation and prompt engineering) and instruction tuning, this new model promises to stand toe-to-toe with recent closed and open source developments. Recognizing the need for an efficacious evaluation metric for coding LLMs, this paper also introduces GPT4-based MultiParameters, an evaluation benchmark that harnesses varied parameters to present a comprehensive insight into the models functioning and performance. Our assessment shows that Safurai-001 can outperform GPT-3.5 by 1.58% and WizardCoder by 18.78% in the Code Readability parameter and more.
摘要
Studying Lobby Influence in the European Parliament
results: 我们的结果表明,在欧洲议会的法制 процесса中,利益集团对MEP的影响存在可解释的连接。我们对 relate Lobby 和政治分组的MEP进行了汇总分析,发现与政治分组的意识相符(例如,中间左派组织与社会问题相关)。我们认为这项研究、方法、数据和结果,是为了提高民主机构内复杂决策过程的透明度做出了一步前进。Abstract
We present a method based on natural language processing (NLP), for studying the influence of interest groups (lobbies) in the law-making process in the European Parliament (EP). We collect and analyze novel datasets of lobbies' position papers and speeches made by members of the EP (MEPs). By comparing these texts on the basis of semantic similarity and entailment, we are able to discover interpretable links between MEPs and lobbies. In the absence of a ground-truth dataset of such links, we perform an indirect validation by comparing the discovered links with a dataset, which we curate, of retweet links between MEPs and lobbies, and with the publicly disclosed meetings of MEPs. Our best method achieves an AUC score of 0.77 and performs significantly better than several baselines. Moreover, an aggregate analysis of the discovered links, between groups of related lobbies and political groups of MEPs, correspond to the expectations from the ideology of the groups (e.g., center-left groups are associated with social causes). We believe that this work, which encompasses the methodology, datasets, and results, is a step towards enhancing the transparency of the intricate decision-making processes within democratic institutions.
摘要
我们提出了基于自然语言处理(NLP)的方法,用于研究欧洲议会(EP)中利益集团(游说者)的影响力。我们收集了和分析了游说者的位置纸和EP议员(MEP)的演讲文本。通过比较这些文本的含义相似性和推导关系,我们能够发现MEP和游说者之间的可读取连接。在没有ground truth datasets的情况下,我们进行了间接验证,比较发现的连接与我们自己curate的推特链接和MEP公开的会议记录。我们的最佳方法在AUC分数0.77达到了,并与多个基eline相比表现出色。此外,我们对发现的连接进行了聚合分析,发现与相关的游说者和政治组织相对应。这种结果与政治组织的意识相符,例如中间左派组织与社会问题相关。我们认为这种方法、数据和结果是推进民主机构内复杂决策过程的一步。
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
paper_authors: Atakan Kara, Farrin Marouf Sofian, Andrew Bond, Gözde Gül Şahin
for: 这个论文的目的是提出一种可以生成高质量的同步数据的Synthetic Data Generation Pipeline,用于解决土耳其语自然语言处理 tasks 中的数据缺乏问题。
methods: 这个论文使用了多种复杂的变换函数来实现更 than 20 个专家修改后的语法和拼写规则,并从专业编辑的文章中 derivation 了130,000个高质量的同步句子。
results: 这个论文通过三种基线模型(neural machine translation, sequence tagging, prefix tuning)实现了强大的结果,并通过对各种尘肤数据进行详细的实验来证明了该论文的可重复性和稳定性。Abstract
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
摘要
grammatical error detection和修正工具(GEC)对本地语言和第二语言学习者都有用。开发这些工具需要大量并行、注释的数据,但这些数据对大多数语言而言罕见。Synthetic data生成是一种常见的办法来解决这个问题。然而,对于 morphologically rich的语言如土耳其来说,Synthetic data生成并不简单,因为它们的写作规则需要 fonological、morphological和 sintactic信息。在这种情况下,我们提出了一种灵活可扩展的Synthetic data生成管道,可以覆盖More than 20个专家精心编辑的语法和拼写规则(即写作规则),通过复杂的转换函数来实现。通过这种管道,我们得到了130,000个高质量的并行句子,并创建了一个更真实的测试集,通过手动注释一些电影评论。我们实现了三种基线,即 neural machine translation、sequence tagging 和 prefix tuning with a pretrained decoder-only model,取得了出色的结果。此外,我们进行了详细的对out-of-domain数据集的实验,以了解提案方法的传输性和稳定性。我们的结果表明,我们的句子库,GECTurk,具有高质量,并允许知识传输到out-of-domain Setting。为了促进土耳其GEC的研究,我们在https://github.com/GGLAB-KU/gecturk上发布了我们的数据集、基线模型和Synthetic data生成管道。
Improving Article Classification with Edge-Heterogeneous Graph Neural Networks
results: 结果表明,使用edge-heterogeneous graphs 可以提高 GNN 模型的性能,而且可以使用简单和浅的 GNN 拓扑来达到与更复杂的结构相同的性能。在 OGB 竞赛中,我们获得了第15名的成绩(准确率 74.61%),并在 PubMed 数据集上与 state-of-the-art GNN 结构相当(准确率 89.88%)。Abstract
Classifying research output into context-specific label taxonomies is a challenging and relevant downstream task, given the volume of existing and newly published articles. We propose a method to enhance the performance of article classification by enriching simple Graph Neural Networks (GNN) pipelines with edge-heterogeneous graph representations. SciBERT is used for node feature generation to capture higher-order semantics within the articles' textual metadata. Fully supervised transductive node classification experiments are conducted on the Open Graph Benchmark (OGB) ogbn-arxiv dataset and the PubMed diabetes dataset, augmented with additional metadata from Microsoft Academic Graph (MAG) and PubMed Central, respectively. The results demonstrate that edge-heterogeneous graphs consistently improve the performance of all GNN models compared to the edge-homogeneous graphs. The transformed data enable simple and shallow GNN pipelines to achieve results on par with more complex architectures. On ogbn-arxiv, we achieve a top-15 result in the OGB competition with a 2-layer GCN (accuracy 74.61%), being the highest-scoring solution with sub-1 million parameters. On PubMed, we closely trail SOTA GNN architectures using a 2-layer GraphSAGE by including additional co-authorship edges in the graph (accuracy 89.88%). The implementation is available at: $\href{https://github.com/lyvykhang/edgehetero-nodeproppred}{\text{https://github.com/lyvykhang/edgehetero-nodeproppred}$.
摘要
classe research output into context-specific label taxonomies 是一个复杂且有 relevance 的下游任务, giventhe volume of existing 和 newly published articles. We propose a method to enhance the performance of article classification by enriching simple Graph Neural Networks (GNN) pipelines with edge-heterogeneous graph representations. SciBERT is used for node feature generation to capture higher-order semantics within the articles' textual metadata. Fully supervised transductive node classification experiments are conducted on the Open Graph Benchmark (OGB) ogbn-arxiv dataset and the PubMed diabetes dataset, augmented with additional metadata from Microsoft Academic Graph (MAG) and PubMed Central, respectively. The results demonstrate that edge-heterogeneous graphs consistently improve the performance of all GNN models compared to the edge-homogeneous graphs. The transformed data enable simple and shallow GNN pipelines to achieve results on par with more complex architectures. On ogbn-arxiv, we achieve a top-15 result in the OGB competition with a 2-layer GCN (accuracy 74.61%), being the highest-scoring solution with sub-1 million parameters. On PubMed, we closely trail SOTA GNN architectures using a 2-layer GraphSAGE by including additional co-authorship edges in the graph (accuracy 89.88%). The implementation is available at: $\href{https://github.com/lyvykhang/edgehetero-nodeproppred}{\text{https://github.com/lyvykhang/edgehetero-nodeproppred}$.
Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition
paper_authors: Ahmed Amine Ben Abdallah, Ata Kabboudi, Amir Kanoun, Salah Zaiem
for: This paper is written for the purpose of developing an effective Automatic Speech Recognition (ASR) solution for dialects, specifically focusing on the Tunisian dialect.
methods: The paper explores self-supervision, semi-supervision, and few-shot code-switching approaches to improve the state-of-the-art in ASR for Tunisian Arabic, English, and French.
results: The paper produces human evaluations of transcripts to avoid the noise coming from spelling inadequacies in testing references, and the models are able to transcribe audio samples in a linguistic mix involving Tunisian Arabic, English, and French. The data used during training and testing are released for public use and further improvements.Abstract
Crafting an effective Automatic Speech Recognition (ASR) solution for dialects demands innovative approaches that not only address the data scarcity issue but also navigate the intricacies of linguistic diversity. In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. First, textual and audio data is collected and in some cases annotated. Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets; covering different acoustic, linguistic and prosodic conditions. Finally, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling inadequacies in our testing references. Our models, allowing to transcribe audio samples in a linguistic mix involving Tunisian Arabic, English and French, and all the data used during training and testing are released for public use and further improvements.
摘要
制定一个有效的自动语音识别(ASR)解决方案 для方言需要创新的方法,不仅解决数据缺乏问题,还能够探索方言语言多样性的细节。在这篇论文中,我们关注了前述的ASR挑战,将着眼点在突尼斯方言。首先,我们收集了文本和音频数据,并在某些情况下进行了标注。其次,我们探索了无监督、半监督和少量代码交换的方法,以在不同的突尼斯测试集上提高状态。这些测试集涵盖了不同的听音、语言和语调条件。最后,由于没有传统的拼写法,我们进行了人工评估我们的讲文,以避免测试参考中的杂音。我们的模型可以将突尼斯阿拉伯语、英语和法语混合的语音样本转录为文本,并在训练和测试中使用的所有数据都公开发布,以便进一步的改进。
DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services
results: 论文通过对DISC-Law-Eval测试集进行量化和资深评价, demonstarted了其在不同的法律场景中的效果。详细的资源可以在https://github.com/FudanDISC/DISC-LawLLM上找到。Abstract
We propose DISC-LawLLM, an intelligent legal system utilizing large language models (LLMs) to provide a wide range of legal services. We adopt legal syllogism prompting strategies to construct supervised fine-tuning datasets in the Chinese Judicial domain and fine-tune LLMs with legal reasoning capability. We augment LLMs with a retrieval module to enhance models' ability to access and utilize external legal knowledge. A comprehensive legal benchmark, DISC-Law-Eval, is presented to evaluate intelligent legal systems from both objective and subjective dimensions. Quantitative and qualitative results on DISC-Law-Eval demonstrate the effectiveness of our system in serving various users across diverse legal scenarios. The detailed resources are available at https://github.com/FudanDISC/DISC-LawLLM.
摘要
我们提出了DISC-LawLLM,一种智能法律系统,使用大型自然语言模型(LLM)提供广泛的法律服务。我们采用法律逻辑提示策略构建监督精度训练集,在中国司法领域进行超参数 fine-tuning,以提高模型的法律推理能力。我们将LLM加载一个检索模块,以提高模型对外部法律知识的访问和利用能力。我们提供了一个全面的法律评价指标,DISC-Law-Eval,以评估智能法律系统的效果从客观和主观两个角度。我们对DISC-Law-Eval进行了量化和质量的测试,结果表明我们的系统在多种法律场景下可以为用户提供有效的服务。详细的资源可以在https://github.com/FudanDISC/DISC-LawLLM上找到。
The Wizard of Curiosities: Enriching Dialogues with Fun Facts
results: 根据对Over 1000对话的A/B测试表明,启示可以不 только增加用户参与度,还提高用户的平均相对评价值9.7%。Abstract
Introducing curiosities in a conversation is a way to teach something new to the person in a pleasant and enjoyable way. Enriching dialogues with contextualized curiosities can improve the users' perception of a dialog system and their overall user experience. In this paper, we introduce a set of curated curiosities, targeting dialogues in the cooking and DIY domains. In particular, we use real human-agent conversations collected in the context of the Amazon Alexa TaskBot challenge, a multimodal and multi-turn conversational setting. According to an A/B test with over 1000 conversations, curiosities not only increase user engagement, but provide an average relative rating improvement of 9.7%.
摘要
在对话中引入curiosities是一种教育用户新知识的有趣和愉悦的方式。在对话中添加上下文化curiosities可以提高对对话系统的评估和用户总体体验。在这篇论文中,我们介绍了一个 curae的curiosities集合,targeting cooking和DIY对话。特别是,我们使用了来自Amazon Alexa TaskBot挑战的真实人机对话收集,一种多媒体和多turn对话Setting。据A/B测试,curiosities不仅提高了用户参与度,还提供了9.7%的相对评分提升。
The Scenario Refiner: Grounding subjects in images at the morphological level
results: 研究发现,语言模型的预测与人类参与者的判断存在差异,尤其是在 grammatical 方面存在偏向。Abstract
Derivationally related words, such as "runner" and "running", exhibit semantic differences which also elicit different visual scenarios. In this paper, we ask whether Vision and Language (V\&L) models capture such distinctions at the morphological level, using a a new methodology and dataset. We compare the results from V\&L models to human judgements and find that models' predictions differ from those of human participants, in particular displaying a grammatical bias. We further investigate whether the human-model misalignment is related to model architecture. Our methodology, developed on one specific morphological contrast, can be further extended for testing models on capturing other nuanced language features.
摘要
derivationally related words, such as "runner" and "running", exhibit semantic differences which also elicit different visual scenarios. In this paper, we ask whether Vision and Language (V&L) models capture such distinctions at the morphological level, using a new methodology and dataset. We compare the results from V&L models to human judgements and find that models' predictions differ from those of human participants, in particular displaying a grammatical bias. We further investigate whether the human-model misalignment is related to model architecture. Our methodology, developed on one specific morphological contrast, can be further extended for testing models on capturing other nuanced language features.Here's the translation in Traditional Chinese as well: derivationally related words, such as "runner" and "running", exhibit semantic differences which also elicit different visual scenarios. In this paper, we ask whether Vision and Language (V&L) models capture such distinctions at the morphological level, using a new methodology and dataset. We compare the results from V&L models to human judgements and find that models' predictions differ from those of human participants, in particular displaying a grammatical bias. We further investigate whether the human-model misalignment is related to model architecture. Our methodology, developed on one specific morphological contrast, can be further extended for testing models on capturing other nuanced language features.
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
results: 该论文的实验表明,使用 OpenChat 框架和 C-RLFT 方法可以提高开源语言模型的性能,并且在三个标准的 bencmark 上 achieved the highest average performance 中。Abstract
Nowadays, open-source large language models like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the optimal policy in C-RLFT can be easily solved through single-stage, RL-free supervised learning, which is lightweight and avoids costly human preference labeling. Through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source language models. Moreover, we use AGIEval to validate the model generalization performance, in which only openchat-13b surpasses the base model. Finally, we conduct a series of analyses to shed light on the effectiveness and robustness of OpenChat. Our code, data, and models are publicly available at https://github.com/imoneoi/openchat.
摘要
现在,开源大语言模型如LLaMA已经出现。最近的发展包括监督精细调教(SFT)和奖励学习调教(RLFT),以使模型与人类目标 better alignment。然而,SFT方法将所有训练数据视为一样的质量,而RLFT方法需要高质量的对数据进行对比或排名。在这种研究中,我们提出了一种新的框架,名为OpenChat,以提高开源语言模型的质量。 Specifically,我们考虑了通用的SFT训练数据,包括一小量的专家数据和大量的不优化数据,无需任何偏好标签。我们提议了C(条件)-RLFT,它将不同的数据来源视为粗粒化奖励标签,并学习一个类别 Conditioned 策略,以利用不同数据质量信息。有趣的是,C-RLFT 的优化策略可以通过单阶段、RL-free 监督学习,以轻量级和避免高昂的人类偏好标签。经过广泛的实验,我们的 openchat-13b 通过 C-RLFT 进行微调,在三个标准 bench mark 上 achieve 所有 13b 开源语言模型的最高平均性能。此外,我们使用 AGIEval 验证模型的通用性能,只有 openchat-13b 在基础模型之上超越。最后,我们进行了一系列的分析,以证明 OpenChat 的效果和可靠性。我们的代码、数据和模型都可以在 https://github.com/imoneoi/openchat 上获取。
Speak While You Think: Streaming Speech Synthesis During Text Generation
results: 实验结果表明,LM2Speech可以保持教师模型的质量,同时减少对话延迟,以便实现自然的语音交互。Abstract
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction. LLM2Speech mimics the predictions of a non-streaming teacher model while limiting the exposure to future context in order to enable streaming. It exploits the hidden embeddings of the LLM, a by-product of the text generation that contains informative semantic context. Experimental results show that LLM2Speech maintains the teacher's quality while reducing the latency to enable natural conversations.
摘要
大语言模型(LLM)显示出很强的能力,然而与这些模型交互通常是通过文本进行的。使用文本到语音synthesize LLM输出通常会导致很长的延迟,这对于流畅的语音对话不实用。我们提议LLM2Speech,一种架构可以在文本生成过程中同时synthesize语音,从而减少延迟。LLM2Speech模仿教师模型的预测,限制未来上下文的暴露,以便实现流动。它利用LLM的隐藏嵌入,这是文本生成过程的产物,含有有用的semanticContext。实验结果表明,LLM2Speech可以保持教师的质量,同时减少延迟,以便实现自然的对话。
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
paper_authors: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag
for: This paper aims to provide a fair comparison of language modeling methods based on their empirical scaling trends, and to serve as a foundation for meaningful and reproducible research in the field.
methods: The paper introduces an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours, and uses a pre-processed dataset of books to evaluate the methods.
results: The paper shows that the LSTM baseline exhibits a predictable and more favourable scaling law than the GPT baseline, and that the two models intersect at roughly 50,000 accelerator hours.Here is the text in Simplified Chinese:
for: 这篇论文的目的是为语言模型比较提供公平的比较基础,并为语言模型研究提供可重复的基础。
methods: 论文提出了一种实验协议,使得模型比较基于等效计算时间( measured in accelerator hours)进行。为了评价方法,文章使用了一个已经处理过的大型、多样化、高质量的书籍数据集。
results: 论文显示,LSTM基eline在计算时间上采取了一种可预测的和更有利的整体增长规律,而GPT基eline在所有等效计算时间水平上都保持了更好的折衣率。两个基eline在约50,000个加速器小时上交叉。Abstract
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.
摘要
蓝夷面厨房 serves as both a research collective and codebase, designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modeling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modeling research.
Assessment of Pre-Trained Models Across Languages and Grammars
paper_authors: Alberto Muñoz-Ortiz, David Vilares, Carlos Gómez-Rodríguez
for: 这个研究是为了评估多语言大型自然语言处理器(LLMs)如何学习语法结构。
methods: 该研究使用了抽象到多形式语法结构的方法,包括将解析视为序列标签。
results: 研究发现:(一)框架在不同编码下具有一致性,(二)预训练词词 vectors 不会偏好语法树表示于dependency表示,(三)使用字符串分词是需要表示语法结构的,与字符串模型不同,(四)语言出现在预训练数据中的频率比任务数据更重要于从词词 vectors 中恢复语法结构。Abstract
We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
摘要
我们提出了一种方法,用于评估多语言大型自然语言处理器(LLM)在多形式语法结构中学习语法的方式。我们希望通过将分析转换为序列标签来恢复句子和依赖结构。为此,我们选择了一些LLM并对13种UD treebanks进行了依赖分析和10种treebanks进行了句子分析。我们的结果表明:(i)框架在不同编码中具有一致性,(ii)预训练词词 vec 不倾向于 syntax 中的句子表示,(iii)字符串分词是必要的,而不是字符串模型,以表示语法,(iv)预training数据中语言的出现次数高于任务数据时,可以更好地从词 vectors 中恢复语法。
Prototype of a robotic system to assist the learning process of English language with text-generation through DNN
paper_authors: Carlos Morales-Torres, Mario Campos-Soberanis, Diego Campos-Sobrino
for: 这个论文是为了帮助英语自学者提高英语水平的。
methods: 这个论文使用了Long Short Term Memory(LSTM)神经网络来生成文本,learners通过图形用户界面与系统互动,系统根据学生的英语水平生成文本。
results: 实验结果显示,learners与系统互动后,他们的 grammatical Range 有所提高。Abstract
In the last ongoing years, there has been a significant ascending on the field of Natural Language Processing (NLP) for performing multiple tasks including English Language Teaching (ELT). An effective strategy to favor the learning process uses interactive devices to engage learners in their self-learning process. In this work, we present a working prototype of a humanoid robotic system to assist English language self-learners through text generation using Long Short Term Memory (LSTM) Neural Networks. The learners interact with the system using a Graphic User Interface that generates text according to the English level of the user. The experimentation was conducted using English learners and the results were measured accordingly to International English Language Testing System (IELTS) rubric. Preliminary results show an increment in the Grammatical Range of learners who interacted with the system.
摘要
最近几年来,自然语言处理(NLP)领域内,有许多进展,以帮助执行多种任务,包括英语教学(ELT)。一种有效的策略是使用互动设备,以吸引学生参与自学习过程。在这个工作中,我们展示了一个人工智能机器人系统,用于帮助英语自学者通过文本生成来提高英语水平。学生通过图形用户界面与系统进行交互,系统根据用户的英语水平生成文本。实验中使用了英语学习者,并根据国际英语语言考试系统(IELTS)标准进行评估结果。初步结果表明,与系统交互的学生的 grammatical range 有所增加。
K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling
results: 本研究发现了K-pop歌曲翻译的独特特征,与其他已经广泛研究的类型不同,同时还构建了一个基于神经网络的歌词翻译模型,从而证明了专门为歌曲翻译而设计的 dataset 的重要性。Abstract
Lyric translation, a field studied for over a century, is now attracting computational linguistics researchers. We identified two limitations in previous studies. Firstly, lyric translation studies have predominantly focused on Western genres and languages, with no previous study centering on K-pop despite its popularity. Second, the field of lyric translation suffers from a lack of publicly available datasets; to the best of our knowledge, no such dataset exists. To broaden the scope of genres and languages in lyric translation studies, we introduce a novel singable lyric translation dataset, approximately 89\% of which consists of K-pop song lyrics. This dataset aligns Korean and English lyrics line-by-line and section-by-section. We leveraged this dataset to unveil unique characteristics of K-pop lyric translation, distinguishing it from other extensively studied genres, and to construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations.
摘要
<> transtable "Lyric translation, a field studied for over a century, is now attracting computational linguistics researchers. We identified two limitations in previous studies. Firstly, lyric translation studies have predominantly focused on Western genres and languages, with no previous study centering on K-pop despite its popularity. Second, the field of lyric translation suffers from a lack of publicly available datasets; to the best of our knowledge, no such dataset exists. To broaden the scope of genres and languages in lyric translation studies, we introduce a novel singable lyric translation dataset, approximately 89% of which consists of K-pop song lyrics. This dataset aligns Korean and English lyrics line-by-line and section-by-section. We leveraged this dataset to unveil unique characteristics of K-pop lyric translation, distinguishing it from other extensively studied genres, and to construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations."中文翻译:学术界对歌词翻译一百年来进行研究,现在吸引了计算语言学研究者。我们认为前一代研究存在两个限制:首先,歌词翻译研究主要集中在西方类型和语言上,尚未对K-pop进行过研究,尽管其受欢迎程度极高。其次,歌词翻译领域缺乏公共可用数据集,到我们所知,没有这样的数据集存在。为了扩大歌词翻译研究的类型和语言范围,我们介绍了一个新的可唱歌词翻译数据集,其中大约89%是K-pop歌曲 lyrics。这个数据集将韩语和英语歌词一行一行、段段对齐。我们利用了这个数据集,揭示了K-pop歌词翻译的独特特征,与其他广泛研究的类型区分开来,并构建了神经网络歌词翻译模型,从而强调了专门为可唱歌词翻译而设置的数据集的重要性。
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
for: This paper focuses on improving text-video retrieval, which is essential for video filtering, recommendation, and search, due to the increasing amount of web videos.
methods: The paper proposes two novel techniques to improve contrastive learning for text-video retrieval: 1) Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs, and 2) Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples.
results: The proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo, and ActivityNet.Here’s the simplified Chinese text in the format you requested:
results: 该提出的方法在四个常用的文本视频相似性数据集上(MSR-VTT、MSVD、DiDeMo、ActivityNet)得到了较高的性能,比如 existed 方法。Abstract
In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
摘要
近年来,Web视频的爆炸式增长使得文本视频检索变得越来越重要和受欢迎,用于视频筛选、推荐和搜索。文本视频检索的目标是将相关的文本和视频排名在不相关的文本和视频之前。核心任务是准确度量文本和视频之间的跨Modal相似性。在这个任务中,对照学习方法已经取得了显著成果,大多数方法都是通过建立正例和反例来学习文本和视频表示。然而,这些方法往往忽略硬例和不同水平的 semantic similarity。为了解决这两个问题,本文提出了两种新的技术:首先,我们提出了一种双Modal注意力增强模块(DMAE),以挖掘文本和视频中的硬例。其次,我们引入了一种Negative-aware InfoNCE(NegNCE)损失函数,以适应性地标识和特别强调硬例的影响。其次,我们 argue that triplet samples可以更好地模型细致的 semantic similarity,而不是pairwise samples。我们因此提出了一种新的Triplet Partial Margin Contrastive Learning(TPM-CL)模块,通过自动生成匹配的文本视频对的硬例来建立 partial order triplet samples。TPM-CL模块还设计了一种自适应的token掩码策略,以模型文本和视频之间的跨Modal差异。经过广泛的实验,我们发现,提出的方法在四个常用的文本视频检索数据集上都能够达到更高的性能。
UniPCM: Universal Pre-trained Conversation Model with Task-aware Automatic Prompt
results: 通过使用高质量的提示,我们扩展了对话系统模型的训练数据集至122个任务,并实现了对多种对话任务和不同的对话系统的优秀表现。Abstract
Recent research has shown that multi-task pre-training greatly improves the model's robustness and transfer ability, which is crucial for building a high-quality dialog system. However, most previous works on multi-task pre-training rely heavily on human-defined input format or prompt, which is not optimal in quality and quantity. In this work, we propose to use Task-based Automatic Prompt generation (TAP) to automatically generate high-quality prompts. Using the high-quality prompts generated, we scale the corpus of the pre-trained conversation model to 122 datasets from 15 dialog-related tasks, resulting in Universal Pre-trained Conversation Model (UniPCM), a powerful foundation model for various conversational tasks and different dialog systems. Extensive experiments have shown that UniPCM is robust to input prompts and capable of various dialog-related tasks. Moreover, UniPCM has strong transfer ability and excels at low resource scenarios, achieving SOTA results on 9 different datasets ranging from task-oriented dialog to open-domain conversation. Furthermore, we are amazed to find that TAP can generate prompts on par with those collected with crowdsourcing. The code is released with the paper.
摘要
近期研究表明,多任务预训练可以大幅提高模型的Robustness和传递能力,这是建立高质量对话系统的关键。然而,大多数前一些工作中的多任务预训练都依赖于人类定义的输入格式或提示,这并不是最佳的质量和量。在这项工作中,我们提议使用任务基本Prompt生成(TAP)自动生成高质量提示。使用生成的高质量提示,我们扩展了预训练对话模型的训练数据集,达到了122个对话相关任务的规模,并命名为Universal Pre-trained Conversation Model(UniPCM)。广泛的实验表明,UniPCM具有输入提示的Robustness和多种对话任务的能力。此外,UniPCM在资源不足的情况下表现出色,在9个不同任务上达到了SOTA的结果,从任务型对话到开放领域对话。此外,我们发现TAP可以生成与人类收集的提示相当的提示。代码随着论文一起发布。
XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates
results: 通过对现有的开放和关闭大语言模型进行评估,本论文示出了 instrucion tuning 的效果和不同架构下的编辑任务的影响。此外,广泛的实验还表明了对文本编辑任务的细化解释的重要性。Abstract
Text editing is a crucial task that involves modifying text to better align with user intents. However, existing text editing benchmark datasets have limitations in providing only coarse-grained instructions. Consequently, although the edited output may seem reasonable, it often deviates from the intended changes outlined in the gold reference, resulting in low evaluation scores. To comprehensively investigate the text editing capabilities of large language models, this paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing. XATU covers a wide range of topics and text types, incorporating lexical, syntactic, semantic, and knowledge-intensive edits. To enhance interpretability, we leverage high-quality data sources and human annotation, resulting in a benchmark that includes fine-grained instructions and gold-standard edit explanations. By evaluating existing open and closed large language models against our benchmark, we demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks. Furthermore, extensive experimentation reveals the significant role of explanations in fine-tuning language models for text editing tasks. The benchmark will be open-sourced to support reproduction and facilitate future research.
摘要
Translation in Simplified Chinese:文本编辑是一项重要的任务,它涉及修改文本,使其更加符合用户的意图。然而,现有的文本编辑标准数据集有限制,只提供粗略的指令。因此,编辑后的输出可能看起来合理,但它经常与金标准 refer 中的修改细则不符,导致评价分数低下。为了全面调查大语言模型的文本编辑能力,这篇论文引入 XATU,首个专门为精细指令基于的可解释文本编辑标准。XATU 覆盖了各种话题和文本类型,包括语法、语义和知识等编辑。为了增强可读性,我们利用高质量的数据源和人工标注,从而创建了一个包含精细指令和金标准编辑解释的标准。通过评价现有的开源和关闭式大语言模型,我们示出了指令调整和模型的底层结构对于不同的编辑任务的影响。此外,广泛的实验表明,解释在调整语言模型进行文本编辑任务时发挥了重要的作用。这个标准将被开源,以支持重现和未来研究。
fakenewsbr: A Fake News Detection Platform for Brazilian Portuguese
results: 提出的方法在评估中实现了高精度和F1-Score,证明其在检测假新闻中的有效性。此外,我们还开发了一个User-friendly的网页平台,fakenewsbr.com,以便用户对新闻文章的真实性进行实时分析。Abstract
The proliferation of fake news has become a significant concern in recent times due to its potential to spread misinformation and manipulate public opinion. This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese, focusing on journalistic-type news. We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF and Word2Vec, to extract features from textual data. We evaluate the performance of various classification algorithms, such as logistic regression, support vector machine, random forest, AdaBoost, and LightGBM, on a dataset containing both true and fake news articles. The proposed approach achieves high accuracy and F1-Score, demonstrating its effectiveness in identifying fake news. Additionally, we developed a user-friendly web platform, fakenewsbr.com, to facilitate the verification of news articles' veracity. Our platform provides real-time analysis, allowing users to assess the likelihood of fake news articles. Through empirical analysis and comparative studies, we demonstrate the potential of our approach to contribute to the fight against the spread of fake news and promote more informed media consumption.
摘要
“假新闻的扩散已成为当前的一大问题,因为它可能导致谣言的传播和公众意识的扭曲。这篇论文介绍了检测巴西葡萄牙语假新闻的完整研究,专注于新闻类文章。我们提议一种基于机器学习的方法,利用自然语言处理技术,包括TF-IDF和Word2Vec,提取文本数据中的特征。我们评估了多种分类算法,如逻辑回归、支持向量机和Random Forest等,在一个包含真实和假新闻文章的数据集上进行了评估。我们的方法实现了高精度和F1分数,证明了它的效iveness在识别假新闻。此外,我们还开发了一个用户友好的网站,fakenewsbr.com,以便评估新闻文章的真实性。我们的平台提供了实时分析,让用户在实时基础上评估假新闻文章的可能性。通过实验分析和比较研究,我们表明了我们的方法在抗击假新闻的扩散方面的潜在作用,并促进更有知识的媒体消费。”
Localize, Retrieve and Fuse: A Generalized Framework for Free-Form Question Answering over Tables
results: 实验表明,TAG-QA 能够生成比基eline 更加准确、完整的答案,特别是与 pipeline-based 基eline TAPAS 和 end-to-end 模型 T5 相比。TAG-QA 在 BLEU-4 和 PARENT F-score 上比 TAPAS 高出 17% 和 14%,并高于 T5 的 BLEU-4 和 PARENT F-score 上的提高为 16% 和 12%。Abstract
Question answering on tabular data (a.k.a TableQA), which aims at generating answers to questions grounded on a provided table, has gained significant attention recently. Prior work primarily produces concise factual responses through information extraction from individual or limited table cells, lacking the ability to reason across diverse table cells. Yet, the realm of free-form TableQA, which demands intricate strategies for selecting relevant table cells and the sophisticated integration and inference of discrete data fragments, remains mostly unexplored. To this end, this paper proposes a generalized three-stage approach: Table-to- Graph conversion and cell localizing, external knowledge retrieval, and the fusion of table and text (called TAG-QA), to address the challenge of inferring long free-form answers in generative TableQA. In particular, TAG-QA (1) locates relevant table cells using a graph neural network to gather intersecting cells between relevant rows and columns, (2) leverages external knowledge from Wikipedia, and (3) generates answers by integrating both tabular data and natural linguistic information. Experiments showcase the superior capabilities of TAG-QA in generating sentences that are both faithful and coherent, particularly when compared to several state-of-the-art baselines. Notably, TAG-QA surpasses the robust pipeline-based baseline TAPAS by 17% and 14% in terms of BLEU-4 and PARENT F-score, respectively. Furthermore, TAG-QA outperforms the end-to-end model T5 by 16% and 12% on BLEU-4 and PARENT F-score, respectively.
摘要
问答基于表格数据(即 TableQA)在最近几年内获得了广泛关注,目的是生成基于提供的表格数据的问题的回答。然而,现有的工作主要通过提取表格单元中的信息进行信息抽取,缺乏能够跨单元进行推理的能力。为了解决这个问题,本文提出了一种通用的三stageapproach:表格转 graf并Cell Localization(TAG-QA),以生成具有推理能力的表格问答系统。具体来说,TAG-QA包括以下三个阶段:1. 使用图 neural network 来找到相关的表格单元,并将其作为交叉单元进行汇聚。2. 利用外部知识来提高表格问答的能力。3. 将表格数据和自然语言信息 integrate 起来,以生成具有 faithful 和 coherent 性的回答。实验表明,TAG-QA 在生成长度不受限制的自由形表格问答方面具有显著的优势,特别是与一些状态之际的基准值进行比较。在 BLEU-4 和 PARENT F-score 等指标上,TAG-QA 与 TAPAS 和 T5 模型相比,净提高了17%和14%。此外,TAG-QA 还在 BLEU-4 和 PARENT F-score 上出现16%和12%的提升。
Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural Networks
for: Addressing the challenges of entity matching in heterogeneous data with complex attribute relationships.
methods: Utilizing a novel entity matching model, EMM-CCAR, built upon pre-trained models, with attention mechanisms to capture complex relationships between attributes.
results: Achieving improvements of approximately 4% and 1% in F1 scores compared to prevalent DER-SSM and Ditto approaches, respectively, demonstrating the effectiveness of the proposed model in handling complex attribute relationships.Abstract
Across various domains, data from different sources such as Baidu Baike and Wikipedia often manifest in distinct forms. Current entity matching methodologies predominantly focus on homogeneous data, characterized by attributes that share the same structure and concise attribute values. However, this orientation poses challenges in handling data with diverse formats. Moreover, prevailing approaches aggregate the similarity of attribute values between corresponding attributes to ascertain entity similarity. Yet, they often overlook the intricate interrelationships between attributes, where one attribute may have multiple associations. The simplistic approach of pairwise attribute comparison fails to harness the wealth of information encapsulated within entities.To address these challenges, we introduce a novel entity matching model, dubbed Entity Matching Model for Capturing Complex Attribute Relationships(EMM-CCAR),built upon pre-trained models. Specifically, this model transforms the matching task into a sequence matching problem to mitigate the impact of varying data formats. Moreover, by introducing attention mechanisms, it identifies complex relationships between attributes, emphasizing the degree of matching among multiple attributes rather than one-to-one correspondences. Through the integration of the EMM-CCAR model, we adeptly surmount the challenges posed by data heterogeneity and intricate attribute interdependencies. In comparison with the prevalent DER-SSM and Ditto approaches, our model achieves improvements of approximately 4% and 1% in F1 scores, respectively. This furnishes a robust solution for addressing the intricacies of attribute complexity in entity matching.
摘要
across various domains, data from different sources such as Baidu Baike and Wikipedia often manifest in distinct forms. Current entity matching methodologies predominantly focus on homogeneous data, characterized by attributes that share the same structure and concise attribute values. However, this orientation poses challenges in handling data with diverse formats. Moreover, prevailing approaches aggregate the similarity of attribute values between corresponding attributes to ascertain entity similarity. Yet, they often overlook the intricate interrelationships between attributes, where one attribute may have multiple associations. The simplistic approach of pairwise attribute comparison fails to harness the wealth of information encapsulated within entities.To address these challenges, we introduce a novel entity matching model, dubbed Entity Matching Model for Capturing Complex Attribute Relationships(EMM-CCAR),built upon pre-trained models. Specifically, this model transforms the matching task into a sequence matching problem to mitigate the impact of varying data formats. Moreover, by introducing attention mechanisms, it identifies complex relationships between attributes, emphasizing the degree of matching among multiple attributes rather than one-to-one correspondences. Through the integration of the EMM-CCAR model, we adeptly surmount the challenges posed by data heterogeneity and intricate attribute interdependencies. In comparison with the prevalent DER-SSM and Ditto approaches, our model achieves improvements of approximately 4% and 1% in F1 scores, respectively. This furnishes a robust solution for addressing the intricacies of attribute complexity in entity matching.
Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach
results: 对于嵌入式 NER 和平面 NER 数据集,实验结果表明 Multi-NER 可以在所有数据集上提高性能。Abstract
Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types (e.g., organization or person name). Recently, many works have been proposed to shape the NER as a machine reading comprehension problem (also termed MRC-based NER), in which entity recognition is achieved by answering the formulated questions related to pre-defined entity types through MRC, based on the contexts. However, these works ignore the label dependencies among entity types, which are critical for precisely recognizing named entities. In this paper, we propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER. We decompose MRC-based NER into multiple tasks and use a self-attention module to capture label dependencies. Comprehensive experiments on both nested NER and flat NER datasets are conducted to validate the effectiveness of the proposed Multi-NER. Experimental results show that Multi-NER can achieve better performance on all datasets.
摘要
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model
results: 实验结果表明,基于大语言模型的方法是建立一个紧凑的对话系统的可能性的。Abstract
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.
摘要
这个论文探讨了构建一个基于人工智能的对话系统,该系统可以同时“思考如何回答”和“思考如何说”,这更接近于人类语言生产过程。我们假设大语言模型(LLM)拥有数十亿个参数,具有强大的语音理解能力,可以同时模型对话回答和语言特征。我们进行了两组实验:1)语调结构预测,这是常见的前端任务在文本识别中,以示LLM的语音理解能力;2)将对话回答和广泛的语言特征集成使用统一编码格式。我们的结果表明,基于LLM的方法是构建统一的对话系统的可能之道。