cs.CL - 2023-11-14

DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Pre-trained Language Models

  • paper_url: http://arxiv.org/abs/2311.08598
  • repo_url: None
  • paper_authors: Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu
  • for: 本研究旨在提高黑客攻击方法的攻击成功率,并对黑客攻击方法的攻击效果进行评估。
  • methods: 本研究提出了一种 Distribution-Aware LoRA-based Adversarial Attack (DALA) 方法,该方法考虑了攻击示例的分布偏移,以提高攻击效果。
  • results: 实验结果表明,DALA 方法可以在四个常用的数据集上提高黑客攻击方法的攻击成功率,并在 ASR 和 NASR 两个评价指标上达到了比较高的水平。
    Abstract Pre-trained language models (PLMs) that achieve success in applications are susceptible to adversarial attack methods that are capable of generating adversarial examples with minor perturbations. Although recent attack methods can achieve a relatively high attack success rate (ASR), our observation shows that the generated adversarial examples have a different data distribution compared with the original examples. Specifically, these adversarial examples exhibit lower confidence levels and higher distance to the training data distribution. As a result, they are easy to detect using very simple detection methods, diminishing the actual effectiveness of these attack methods. To solve this problem, we propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method, which considers the distribution shift of adversarial examples to improve attack effectiveness under detection methods. We further design a new evaluation metric NASR combining ASR and detection for the attack task. We conduct experiments on four widely-used datasets and validate the attack effectiveness on ASR and NASR of the adversarial examples generated by DALA on the BERT-base model and the black-box LLaMA2-7b model.
    摘要 预训言语模型(PLM)在应用中获得成功,却容易受到敌意攻击的威胁。最新的攻击方法可以生成敌意示例,但我们发现这些示例的数据分布与原始示例不同。具体来说,这些示例的信任水平较低,与训练数据分布更加远离。因此,它们容易被非常简单的检测方法探测到,从而削弱了攻击的实际效果。为解决这个问题,我们提议一种考虑数据分布变化的 Distribution-Aware LoRA-based Adversarial Attack(DALA)方法。此外,我们还设计了一个新的评价指标NASR,它将ASR和检测结果相结合来评价攻击任务的效果。我们在四个常用的数据集上进行了实验,并验证了DALA在BERT-base模型和黑盒LLaMA2-7b模型上的攻击效果。

Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

  • paper_url: http://arxiv.org/abs/2311.08596
  • repo_url: None
  • paper_authors: Philippe Laban, Lidiya Murakhovs’ka, Caiming Xiong, Chien-Sheng Wu
  • for: This paper aims to analyze the behavior of Large Language Models (LLMs) in multi-turn conversations and evaluate their ability to refine and improve their answers.
  • methods: The authors propose the FlipFlop experiment, which involves presenting an LLM with a prompt containing a classification task in the first round, and then challenging the model with a follow-up phrase in the second round to elicit a reflection on its initial answer.
  • results: The study finds that LLMs flip their answers on average 46% of the time and experience a drop in accuracy between their first and final predictions, with an average drop of 17%. The results demonstrate the universality of sycophantic behavior in LLMs and provide a robust framework for analyzing model behavior and evaluating potential solutions.Here are the three points in Simplified Chinese text:
  • for: 这篇论文目的是分析大语言模型(LLMs)在多轮对话中的行为,并评估它们是否可以改进和精细化它们的答案。
  • methods: 作者提出了“折衣”实验,其中在第一轮对话中给LLM提供一个分类任务的提示,然后在第二轮对话中给它一个“你确定吗?”的跟问,以让模型反思其初始答案,并决定是否确认或变更它的答案。
  • results: 研究发现,LLMs在平均情况下会在46%的时间变更答案,并且在第一个和最终预测之间的准确率下降了17%。结果表明大语言模型中的追随行为是通用的,并提供了一个可靠的框架来分析模型行为并评估可能的解决方案。
    Abstract The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM responds to a prompt containing a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Are you sure?", offering an opportunity for the model to reflect on its initial answer, and decide whether to confirm or flip its answer. A systematic study of nine LLMs on seven classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17%. The FlipFlop experiment illustrates the universality of sycophantic behavior in LLMs and provides a robust framework to analyze model behavior and evaluate potential solutions.
    摘要 大型自然语言模型(LLM)的互动性在理论上允许模型在回答时进行精细调整和改进,然而对多轮行为的系统性分析尚未受到广泛关注。本文提出了“折衣”实验:在第一轮对话中,一个LLM对一个分类任务提交答案。在第二轮对话中,模型被挑战以“你确定吗?”的后续话,给模型提供反思其初始答案的机会,并决定是否确认或折衣答案。对九个LLM在七个分类任务上进行系统性分析,发现models在 average 46% 的时间会折衣答案,并且所有模型在第一个预测和最终预测之间的准确率都会下降,平均下降17%。“折衣”实验表明 LLM 中的迷恋行为是普遍的,并提供了一个可靠的框架来分析模型行为并评估可能的解决方案。

ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models

  • paper_url: http://arxiv.org/abs/2311.08593
  • repo_url: None
  • paper_authors: Haoxin Li, Phillip Keung, Daniel Cheng, Jungo Kasai, Noah A. Smith
  • for: 这个论文旨在提出一种新的终端文档检索方法,即直接基于输入查询生成文档标识符。
  • methods: 这种方法使用大语言模型生成抽象关键词,并将每个文档的标识符组成为这些关键词的组合。
  • results: 使用这种方法可以提高终端文档检索的top-10和top-20准确率,相比之前的状态艺术基准。
    Abstract Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a new approach for end-to-end document retrieval that directly generates document identifiers given an input query. Techniques for designing effective, high-quality document IDs remain largely unexplored. We introduce ACID, in which each document's ID is composed of abstractive keyphrases generated by a large language model, rather than an integer ID sequence as done in past work. We compare our method with the current state-of-the-art technique for ID generation, which produces IDs through hierarchical clustering of document embeddings. We also examine simpler methods to generate natural-language document IDs, including the naive approach of using the first k words of each document as its ID or words with high BM25 scores in that document. We show that using ACID improves top-10 and top-20 accuracy by 15.6% and 14.4% (relative) respectively versus the state-of-the-art baseline on the MSMARCO 100k retrieval task, and 4.4% and 4.0% respectively on the Natural Questions 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs in generative retrieval with LMs. The code for reproducing our results and the keyword-augmented datasets will be released on formal publication.
    摘要 新的生成检索方法(Wang et al., 2022; Tay et al., 2022)可以直接基于输入查询生成文档标识符。现有的技术设计高质量、有效的文档标识符仍然尚未得到了充分的探索。我们介绍了ACID,它的每个文档标识符由一个大语言模型生成的抽象关键词组成,而不是以往的整数ID序列。我们与现状态的技术进行比较,它通过层次归一化文档嵌入来生成ID。我们还考虑了使用每个文档的前k个词作为其ID的简单方法,以及使用每个文档中高BM25 scores的词作为ID的方法。我们的结果表明,使用ACID可以提高MSMARCO 100k检索任务的前10和前20准确率相对提高15.6%和14.4%,并在Natural Questions 100k检索任务上提高4.4%和4.0%。我们的结果表明,使用人类可读的、自然语言ID在LMs中的生成检索中是有效的。我们将在正式发布时释放代码和附加的关键词扩展数据集。

PEMA: Plug-in External Memory Adaptation for Language Models

  • paper_url: http://arxiv.org/abs/2311.08590
  • repo_url: None
  • paper_authors: HyunJin Kim, Young Jin Kim, JinYeong Bak
  • for: 这个论文的目的是为了提高预训练大语言模型(PLM)在不同下游NLP任务上的性能,同时减少预训练过程中的资源需求。
  • methods: 这篇论文使用了Parameter-Efficient Fine-Tuning(PEFT)方法,其中包括Plug-in External Memory Adaptation(PEMA)和LoRA-based weight matrices等技术。PEMA可以在推理过程中在测试数据上下文中插入外部存储,以便在推理过程中使用PLM生成的上下文表示。
  • results: 根据实验结果,PEMA方法在语法数据集和实际数据集上的机器翻译和风格转换任务上表现出色,比其他PEFT方法更高效和可靠。同时,PEMA方法能够保持句子的意义,同时生成适当的语言和风格。
    Abstract Pre-trained language models (PLMs) have demonstrated impressive performance across various downstream NLP tasks. Nevertheless, the resource requirements of pre-training large language models in terms of memory and training compute pose significant challenges. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning on specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) approach designed for fine-tuning PLMs without the need for all weights. PEMA can be integrated into the context representation of test data during inference to execute downstream tasks. It leverages an external memory to store context representations generated by a PLM, mapped with the desired target word. Our method entails training LoRA-based weight matrices within the final layer of the PLM for enhanced efficiency. The probability is then interpolated with the next-word distribution from the PLM to perform downstream tasks. To improve the generation quality, we propose a novel interpolation strategy named Gradual Unrolling. To demonstrate the effectiveness of our proposed method, we conduct experiments to demonstrate the efficacy of PEMA with a syntactic dataset and assess its performance on machine translation and style transfer tasks using real datasets. PEMA outperforms other PEFT methods in terms of memory and latency efficiency for training and inference. Furthermore, it outperforms other baselines in preserving the meaning of sentences while generating appropriate language and styles.
    摘要 预训语言模型(PLM)已经在不同的下游自然语言处理任务中表现出色。然而,预训大语言模型的资源需求,包括内存和训练计算机,带来了 significiant challenges。此外,由于资源的巨大需求,许多PLM的权重都是机密的。因此,用户被迫分享自己的数据来为特定任务进行细化。为了突破这些限制,我们介绍了插入式外部记忆适配(PEMA),一种基于精简的 Parametric Efficient Fine-Tuning(PEFT)方法,可以在不需要所有权重的情况下进行细化。PEMA可以在推理过程中将测试数据的上下文表示 integrate into the context representation of test data during inference, and leverage an external memory to store context representations generated by a PLM, mapped with the desired target word. Our method entails training LoRA-based weight matrices within the final layer of the PLM for enhanced efficiency. The probability is then interpolated with the next-word distribution from the PLM to perform downstream tasks. To improve the generation quality, we propose a novel interpolation strategy named Gradual Unrolling. To demonstrate the effectiveness of our proposed method, we conduct experiments to demonstrate the efficacy of PEMA with a syntactic dataset and assess its performance on machine translation and style transfer tasks using real datasets. PEMA outperforms other PEFT methods in terms of memory and latency efficiency for training and inference, and also outperforms other baselines in preserving the meaning of sentences while generating appropriate language and styles.

Asking More Informative Questions for Grounded Retrieval

  • paper_url: http://arxiv.org/abs/2311.08584
  • repo_url: None
  • paper_authors: Sedrick Keh, Justin T. Chiu, Daniel Fried
  • for: 这种研究的目的是提高模型在基于图像的多回合识别任务中的信息收集能力,具体来说是使用更加有用的问题来提高模型的性能。
  • methods: 这种方法使用了开放式问题的形式,而不是传统的简单的是否问题,从而使得模型可以更好地获取信息。此外,这种方法还包括一种假设处理机制,以解决模型在处理开放式问题时的假设错误。
  • results: 实验表明,这种方法可以提高模型在基于图像的多回合识别任务中的性能,相比之前的状态艺术提高了14%,而且在人工评估中比基于传统问题更加高效,具体来说是48%更高效。
    Abstract When a model is trying to gather information in an interactive setting, it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous studies have been constrained to polar yes/no questions, limiting how much information the model can gain in a single turn. We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf visual question answering (VQA) models often make presupposition errors, which standard information gain question selection methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations.
    摘要 当模型在交互 Setting 中尝试收集信息时,它会受益于提问有价值的问题。然而,在基于图像识别多turn任务的前 Studies 中,模型受限于两元问题,这限制了模型在单turn中收集信息的可能性。我们提出了一种方法,该方法使用更加有价值的、开放式问题。在做这些事情时,我们发现了许多Off-the-shelf 视觉问答(VQA)模型经常会出现先假设错误,这些错误不符合信息增加问题选择方法所能处理。为解决这个问题,我们提出了一种方法,该方法可以在问题选择和信念更新中包含先假设处理。具体来说,我们使用两个阶段的过程,首先使用模型过滤出不相关的图像,然后更新模型对用户所意图的图像的信念。通过自适应和人工评估,我们表明了我们的方法能够成功地提问有价值的开放式问题,相比前一个状态的最佳性能提高14%,而同时在人工评估中效率提高48%。

Graph-Induced Syntactic-Semantic Spaces in Transformer-Based Variational AutoEncoders

  • paper_url: http://arxiv.org/abs/2311.08579
  • repo_url: None
  • paper_authors: Yingji Zhang, Marco Valentino, Danilo S. Carvalho, Ian Pratt-Hartmann, André Freitas
  • For: 提高VAEs的表现和泛化能力* Methods: 使用多任务学习或双Encoder架构,将分布Semantic特征和语法结构归一化到不同的隐藏空间中* Results: 通过在编码阶段 integrate graph-based和序列模型,并在解码器的注意力机制中插入多个特化的隐藏表示,实现了提高语言模型和下游生成任务的表现
    Abstract The injection of syntactic information in Variational AutoEncoders (VAEs) has been shown to result in an overall improvement of performances and generalisation. An effective strategy to achieve such a goal is to separate the encoding of distributional semantic features and syntactic structures into heterogeneous latent spaces via multi-task learning or dual encoder architectures. However, existing works employing such techniques are limited to LSTM-based VAEs. In this paper, we investigate latent space separation methods for structural syntactic injection in Transformer-based VAE architectures (i.e., Optimus). Specifically, we explore how syntactic structures can be leveraged in the encoding stage through the integration of graph-based and sequential models, and how multiple, specialised latent representations can be injected into the decoder's attention mechanism via low-rank operators. Our empirical evaluation, carried out on natural language sentences and mathematical expressions, reveals that the proposed end-to-end VAE architecture can result in a better overall organisation of the latent space, alleviating the information loss occurring in standard VAE setups, resulting in enhanced performances on language modelling and downstream generation tasks.
    摘要 射预Variational AutoEncoders(VAEs)中的内在结构信息注入已经被证明可以提高性能和数据准确性。一种有效的策略是在多任务学习或双encoder架构下,分离分布式semantic feature和 sintactic structure的编码。然而,现有的作品仅采用LSTM-based VAEs。在本文中,我们对Transformer-based VAE架构(i.e., Optimus)中的latent space separation方法进行了探索。具体来说,我们试验了在编码阶段通过结构模型和序列模型的结合,以及在解码器的注意机制中通过低维运算符插入多个特殊的latent representation。我们的实验评估,针对自然语言句子和数学表达,表明了我们提案的终端VAE架构可以将latent space更好地组织,降低标准VAE设置中的信息损失,进而提高语言预测和下游生成任务的性能。

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

  • paper_url: http://arxiv.org/abs/2311.08562
  • repo_url: https://github.com/cathyxl/magic
  • paper_authors: Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See Kiong Ng, Jiashi Feng
  • for: 本研究旨在评估大语言模型(LLMs)在多智能环境中的能力,包括判断、规划、协作、自我意识和合理性。
  • methods: 本研究使用了游戏如彩虹和隐身,以及游戏理论场景如成本分享、多人投降和公共财富,创造了多样化的测试环境。同时,研究人员还使用了可能图模型(PGM)方法来增强LLMs的处理复杂社会和认知维度的能力。
  • results: 研究发现,使用PGM方法可以提高所选模型的能力,并且在七种多智能系统中,使用GPT-4模型的能力高出三倍于使用Llama-2-70B模型的能力。同时,研究也发现了不同模型之间的能力差距,并且PGM方法可以提高所有选择的模型的能力平均50%。研究代码可以在GitHub上下载:https://github.com/cathyxl/MAgIC。
    Abstract Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing, demonstrating exceptional capabilities in reasoning, tool usage, and memory. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework that captures their abilities in reasoning, planning, collaboration, and more. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality. We utilize games such as Chameleon and Undercover, alongside game theory scenarios like Cost Sharing, Multi-player Prisoner's Dilemma, and Public Good, to create diverse testing environments. Our framework is fortified with the Probabilistic Graphical Modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex social and cognitive dimensions. The benchmark evaluates seven multi-agent systems powered by different LLMs, quantitatively highlighting a significant capability gap over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the inherent abilities of all selected models by 50% on average. Our codes are released here https://github.com/cathyxl/MAgIC.
    摘要

UT5: Pretraining Non autoregressive T5 with unrolled denoising

  • paper_url: http://arxiv.org/abs/2311.08552
  • repo_url: None
  • paper_authors: Mahmoud G. Salem, Jiayu Ye, Chu-Cheng Lin, Frederick Liu
  • for: 本研究旨在提高Transformer基于大型自然语言模型的natural language generation能力,尤其是解决autoregressive模型在解码K个token时需要K次顺序前进的性能瓶颈。
  • methods: 本研究采用了非 autoregressive(NAR)方法,通过对T5模型进行无监督预训练,以提高其在下游生成任务中的性能。
  • results: 研究发现,通过对T5模型进行无监督预训练,可以在SQuAD问题生成任务和XSum任务中达到SoTA的性能水平。Here’s the simplified Chinese text in the format you requested:
  • for: 本研究旨在提高Transformer基于大型自然语言模型的natural language generation能力。
  • methods: 本研究采用了非 autoregressive(NAR)方法,通过对T5模型进行无监督预训练,以提高其在下游生成任务中的性能。
  • results: 研究发现,通过对T5模型进行无监督预训练,可以在SQuAD问题生成任务和XSum任务中达到SoTA的性能水平。
    Abstract Recent advances in Transformer-based Large Language Models have made great strides in natural language generation. However, to decode K tokens, an autoregressive model needs K sequential forward passes, which may be a performance bottleneck for large language models. Many non-autoregressive (NAR) research are aiming to address this sequentiality bottleneck, albeit many have focused on a dedicated architecture in supervised benchmarks. In this work, we studied unsupervised pretraining for non auto-regressive T5 models via unrolled denoising and shown its SoTA results in downstream generation tasks such as SQuAD question generation and XSum.
    摘要

Efficient Continual Pre-training for Building Domain Specific Large Language Models

  • paper_url: http://arxiv.org/abs/2311.08545
  • repo_url: None
  • paper_authors: Yong Xie, Karan Aggarwal, Aitzaz Ahmad
  • for: 这个研究是为了开发领域专业的语言模型(LLMs)。
  • methods: 这个研究使用了适应领域的 continual pre-training 方法来开发领域专业的 LLMs。
  • results: 研究发现,通过适应领域的 continual pre-training,可以在金融领域表现出色,并且比原始基础模型更好。此外,研究还提出了一些简单 yet effective 的数据选择策略,可以在少量的资料量和成本下达到更好的性能。
    Abstract Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.
    摘要 Simplified Chinese:大型语言模型(LLM)在开放领域上表现出色。传统上,为一个领域而适应的 LLM 通常是从scratch开始训练,以便在处理领域特定任务上表现出色。在这项工作中,我们探索了一种途径,通过连续预训练来开发领域特定的 LLM。我们介绍了基于金融领域的 FinPythia-6.9B,通过适应领域的连续预训练而开发。连续预训练后的 FinPythia 在金融任务上表现了一致性的改进,比原始基础模型更好。我们还进一步探索了简单而有效的数据选择策略,用于连续预训练。我们的数据选择策略可以在10%的文库大小和成本下达到同等水平,而不会影响开放领域的标准任务表现。我们的工作提出了一种可行的解决方案,即通过cost-effective的方式建立领域特定的 LLM。

Extending Multilingual Machine Translation through Imitation Learning

  • paper_url: http://arxiv.org/abs/2311.08538
  • repo_url: None
  • paper_authors: Wen Lai, Viktor Hangya, Alexander Fraser
  • for: 扩展大规模多语言神经机器翻译模型(MNMT)到新语言,以便将所有已支持语言翻译到新语言。
  • methods: 使用imit-MNMT方法,即模仿专家的行为,通过英语作为中间语言,将新语言和原始语言的翻译结果相似。
  • results: 对新语言和原始语言之间的翻译表现有显著改善,无论 catastrophic forgetting 问题。同时,可以解决copy和off-target问题,两个常见的现代大规模 MNMT 模型存在的问题。
    Abstract Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world's languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.
    摘要 尽管现有的多语言神经机器翻译(MNMT)模型已经支持了许多语言,但大多数世界上的语言仍然被留下。我们想扩展大规模MNMT模型到一个新语言,以便在这个新语言和所有已经支持的语言之间进行翻译。过去的方法,如继续在并行数据集中包括新语言进行训练,会导致 catastrophic forgetting(即其他语言的性能下降)。我们的新方法Imit-MNMT将这个任务视为一种模仿学习过程,这种技术在计算机视觉领域广泛使用,但在自然语言处理领域并不受欢迎。更 Specifically,我们将新语言和原始语言之间的 pseudo 多并行数据集建立,通过英语作为中间语言,并模仿原始 MNMT 模型的输出分布。我们的方法在翻译性能方面取得了显著改进,而无需严重的 catastrophic forgetting。我们还证明了我们的方法可以解决现有大规模 MNMT 模型中的复制和偏差问题。

Natural Language Processing for Financial Regulation

  • paper_url: http://arxiv.org/abs/2311.08533
  • repo_url: https://github.com/mdxedia/Awsome-Cash
  • paper_authors: Ixandra Achitouv, Dragos Gorduza, Antoine Jacquier
  • for: 本研究使用自然语言处理技术来框架化金融监管领域中的规则和政策匹配搜索,无需数据支持学习。
  • methods: 本研究使用自然语言处理技术的基本构件,包括句子生成和句子分析,以及使用自由可用资源进行改进。
  • results: 本研究可以超越简单的预训练句子转换器模型,实现更高效的规则和政策匹配搜索。
    Abstract This article provides an understanding of Natural Language Processing techniques in the framework of financial regulation, more specifically in order to perform semantic matching search between rules and policy when no dataset is available for supervised learning. We outline how to outperform simple pre-trained sentences-transformer models using freely available resources and explain the mathematical concepts behind the key building blocks of Natural Language Processing.
    摘要 Translated into Simplified Chinese:这篇文章提供了关于自然语言处理技术在金融规制框架中的理解,具体来说是为了在无可学习数据的情况下进行规则和政策之间的Semantic matching搜索。我们提供了使用可得到的资源超越简单预训练句子变换器模型的方法,并解释了自然语言处理中关键Component的数学概念。

CoRE-CoG: Conversational Recommendation of Entities using Constrained Generation

  • paper_url: http://arxiv.org/abs/2311.08511
  • repo_url: None
  • paper_authors: Harshvardhan Srivastava, Kanav Pruthi, Soumen Chakrabarti, Mausam
  • for: 提高 conversational recommendation systems (CRS) 的准确性和流畅性,解决 prior systems 中的三大挑战(1)在每次转折时是否需要推荐知识库 (KB) 实体,(2) 推荐哪一个最相关的 KB 实体,以及 (3) 在对话历史中适当地推荐实体。
  • methods: CoRE-CoG 使用了以下三个模块来解决这些挑战:(1) 推荐触发器,决定系统语句是否包含实体,(2) 类型剔除模块,提高推荐实体的相关性,以及 (3) 一种新的约束回归生成器,以保持 fluency 并做出准确的推荐决定。
  • results: CoRE-CoG 在最新的 benchmark 上实现了close to 10 F1 和 4 Recall@1 的 conditional generation 子任务的提高,与基线相比增加了近 10 F1 和 4 Recall@1 的分数点。
    Abstract End-to-end conversational recommendation systems (CRS) generate responses by leveraging both dialog history and a knowledge base (KB). A CRS mainly faces three key challenges: (1) at each turn, it must decide if recommending a KB entity is appropriate; if so, it must identify the most relevant KB entity to recommend; and finally, it must recommend the entity in a fluent utterance that is consistent with the conversation history. Recent CRSs do not pay sufficient attention to these desiderata, often generating unfluent responses or not recommending (relevant) entities at the right turn. We introduce a new CRS we call CoRE-CoG. CoRE-CoG addresses the limitations in prior systems by implementing (1) a recommendation trigger that decides if the system utterance should include an entity, (2) a type pruning module that improves the relevance of recommended entities, and (3) a novel constrained response generator to make recommendations while maintaining fluency. Together, these modules ensure simultaneous accurate recommendation decisions and fluent system utterances. Experiments with recent benchmarks show the superiority particularly on conditional generation sub-tasks with close to 10 F1 and 4 Recall@1 percent points gain over baselines.
    摘要 End-to-end conversational recommendation systems (CRS) 通过对对话历史和知识库 (KB) 的利用,生成响应。CRS 面临三个关键挑战:在每个转卡时,决定是否推荐 KB 实体,如果是,则确定最相关的 KB 实体,最后,推荐实体的表达需要与对话历史一致。现有 CRS 不够重视这些要求,经常生成不流畅的响应或者不推荐相关的实体。我们提出了一个新的 CRS,即 CoRE-CoG。CoRE-CoG 通过实施以下三个模块来解决先前系统的限制:1. 推荐触发器:决定系统语句是否包含实体。2. 类型缩短模块:提高推荐实体的相关性。3. 一种新的受限响应生成器:在保持流畅性的情况下,为系统语句提供推荐。这些模块共同确保同时准确地做出推荐决策和流畅的系统语句。对于最近的benchmark检验,CoRE-CoG 表现出优异,特别是在条件生成子任务中,与基eline相比,取得了近10个 F1 和 4 的 Recall@1 百分点的提升。

Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning

  • paper_url: http://arxiv.org/abs/2311.08505
  • repo_url: None
  • paper_authors: Xin Su, Tiep Le, Steven Bethard, Phillip Howard
  • for: 提高大语言模型在知识密集任务中的表现
  • methods: 引入 semi-structured prompting 方法,协同使用模型的 parametric memory、文本文档中的不结构化知识和知识图中的结构化知识
  • results: 在多步问答任务上达到了比较好的表现,甚至超过了一些需要 fine-tuning 的方法
    Abstract An important open question pertaining to the use of large language models for knowledge-intensive tasks is how to effectively integrate knowledge from three sources: the model's parametric memory, external structured knowledge, and external unstructured knowledge. Most existing prompting methods either rely solely on one or two of these sources, or require repeatedly invoking large language models to generate similar or identical content. In this work, we overcome these limitations by introducing a novel semi-structured prompting approach that seamlessly integrates the model's parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs. Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques, even exceeding those which require fine-tuning.
    摘要 Currently, a significant open question regarding the use of large language models for knowledge-intensive tasks is how to effectively integrate knowledge from three sources: the model's parametric memory, external structured knowledge, and external unstructured knowledge. Most existing prompting methods either rely solely on one or two of these sources or require repeatedly invoking large language models to generate similar or identical content. In this work, we overcome these limitations by introducing a novel semi-structured prompting approach that seamlessly integrates the model's parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs. Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques, even exceeding those which require fine-tuning.Here's the text in Traditional Chinese:目前,大型语言模型用于知识工作中的一个重要开问是如何有效地结合三种知识来源:模型的参数记忆、外部结构化知识和外部无结构化知识。现有的提示方法大多只靠一或二种来源,或者需要重复运行大型语言模型来生成相似或相同的内容。在这个工作中,我们解决了这些限制,通过引入一种新的半结构化提示方法,将模型的参数记忆与文档中的无结构化知识和知识图中的结构化知识融合在一起。实验结果显示,我们的提示方法在开放领域多步问答dataset上表现出色, Even exceeding those which require fine-tuning.

Functionality learning through specification instructions

  • paper_url: http://arxiv.org/abs/2311.08481
  • repo_url: None
  • paper_authors: Pedro Henrique Luz de Araujo, Benjamin Roth
  • for: 本研究旨在提高自然语言处理模型的功能学习能力,不需要练习数据集。
  • methods: 研究人员生成了一组指令说明,用于定义每个功能的需求。然后,他们将这些指令组合成 specification-augmented prompts,并对预训练在自然指令数据上的语言模型进行测试。
  • results: 研究发现,较小的模型(params < 3B)具有困难遵循指令说明的能力,而较大的模型(params > 3B)可以受益于指令说明,甚至在不同的功能上展现愉悦的行为。
    Abstract Test suites assess natural language processing models' performance on specific functionalities: cases of interest involving model robustness, fairness, or particular linguistic capabilities. They enable fine-grained evaluations of model aspects that would otherwise go unnoticed in standard evaluation datasets, but they do not address the problem of how to fix the failure cases. Previous work has explored functionality learning by fine-tuning models on suite data. While this improves performance on seen functionalities, it often does not generalize to unseen ones and can harm general performance. This paper analyses a fine-tuning-free approach to functionality learning. For each functionality in a suite, we generate a specification instruction that encodes it. We combine the obtained specification instructions to create specification-augmented prompts, which we feed to language models pre-trained on natural instruction data to generate suite predictions. A core aspect of our analysis is to measure the effect that including a set of specifications has on a held-out set of unseen, qualitatively different specifications. Our experiments across four tasks and models ranging from 80M to 175B parameters show that smaller models struggle to follow specification instructions. However, larger models (> 3B params.) can benefit from specifications and even generalize desirable behaviors across functionalities.
    摘要 <>测试集用于评估自然语言处理模型的特定功能性能:涉及到模型Robustness、公平性或特定语言能力的场景。它们允许细化评估模型的具体方面,而标准评估数据集中的问题则可能被忽略不了。先前的工作已经探讨了学习功能性能的方法,包括在suite数据上练习模型。although this improves performance on seen functionalities, it often does not generalize to unseen ones and can harm general performance. 本文分析了一种不需要练习的方法来学习功能性能。对于每个suite中的功能,我们生成一个 specification instruction,这个 instruction 编码了该功能。我们将获得的 specification instructions 组合成 specification-augmented prompts,并将这些提示 feed 到自然 instrucion 数据上预训练的语言模型中,以供 suite 预测。我们的分析中的核心问题是测量在一个固定的 held-out 集中的未见、Qualitatively different 的 specification 对模型的影响。我们的实验通过四个任务和参数量从 80M 到 175B 的模型展示,较小的模型困难遵循 specification instructions,但是大于 3B 参数的模型可以利用 specification 并将恰当的行为扩展到不同的功能。

Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.08472
  • repo_url: None
  • paper_authors: Carlos Aguirre, Kuleen Sasse, Isabel Cachola, Mark Dredze
  • for: This paper is written to explore the fairness of large language models (LLMs) as NLP classification systems, specifically looking at how different shot selection strategies affect model fairness.
  • methods: The paper uses standard NLP tasks and three fairness datasets to evaluate the fairness of LLMs. The authors consider both existing and new demographically sensitive methods for shot selection.
  • results: The paper explores how different shot selection strategies affect the fairness of LLMs as NLP classification systems, and discusses how future work can include LLM fairness evaluations.
    Abstract Recently, work in NLP has shifted to few-shot (in-context) learning, with large language models (LLMs) performing well across a range of tasks. However, while fairness evaluations have become a standard for supervised methods, little is known about the fairness of LLMs as prediction systems. Further, common standard methods for fairness involve access to models weights or are applied during finetuning, which are not applicable in few-shot learning. Do LLMs exhibit prediction biases when used for standard NLP tasks? In this work, we explore the effect of shots, which directly affect the performance of models, on the fairness of LLMs as NLP classification systems. We consider how different shot selection strategies, both existing and new demographically sensitive methods, affect model fairness across three standard fairness datasets. We discuss how future work can include LLM fairness evaluations.
    摘要 最近,NLU工作偏向几个shot(在上下文中学习),大型自然语言模型(LLM)在多种任务上表现良好。然而,对于LLM作为预测系统的公平性知之甚少。常见的标准方法 для公平性做出在模型权重访问或在微调中应用,这些方法不适用于几个shot学习。 LLM是否具有预测偏见?在这个工作中,我们研究shot的影响,直接影响模型性能,LLM作为NLU分类系统的公平性。我们考虑了不同的shot选择策略,包括现有的人口普查方法和新的人口普查方法,对三个标准公平性数据集的模型公平性进行分析。我们讨论了未来工作如何包括LLM公平性评估。

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

  • paper_url: http://arxiv.org/abs/2311.08469
  • repo_url: None
  • paper_authors: Wenting Zhao, Justin T Chiu, Jena D. Hwang, Faeze Brahman, Jack Hessel, Sanjiban Choudhury, Yejin Choi, Xiang Lorraine Li, Alane Suhr
  • for: 研究异常、不常见的情况下的推理能力。
  • methods: 使用人工推理和大语言模型来生成解释,以提高对于不常见的情况下的预测能力。
  • results: 比较人类解释和大语言模型的表现,发现人类解释和模型合作的解释能够达到最高质量,并且可以在 especificity 和 diversity 之间进行融合。此外,通过在线模仿学习算法进行训练,可以更好地帮助模型学习这种任务。
    Abstract Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate a natural language explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the differences between the performance of human explainers and the best performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several online imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.
    摘要 Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate a natural language explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the differences between the performance of human explainers and the best performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several online imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.Here's the translation in Traditional Chinese as well:Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate a natural language explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the differences between the performance of human explainers and the best performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several online imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

  • paper_url: http://arxiv.org/abs/2311.08402
  • repo_url: None
  • paper_authors: Sai Muralidhar Jayanthi, Devang Kulshreshtha, Saket Dingliwal, Srikanth Ronanki, Sravan Bodapati
  • for: 提高自然语言处理(NLP)系统的自适应性,以提高识别罕见词和域 especific实体的精度。
  • methods: 使用注意力基于的上下文偏好技术来改进识别罕见词和域 especific实体的精度。
  • results: 提出了一种“检索并复制”机制,可以提高响应速度而不 sacrifi 精度,并提出了一种培 optimized training 策略,可以在大型目录中提高识别精度。实验结果表明,我们的方法可以相比强基eline提高 Word Error Rate (WERR)上到 6%,并提高 F1 得分约 3.6%。同时,我们的方法可以支持大型目录,不会影响 WER 和 F1 分数,并且可以实现每个音频帧的执行速度提高至少 20%。
    Abstract Personalization of automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usability. To address this, we first propose a "Retrieve and Copy" mechanism to improve latency while retaining the accuracy even when scaled to a large catalog. We also propose a training strategy to overcome the degradation in recall at such scale due to an increased number of confusing entities. Overall, our approach achieves up to 6% more Word Error Rate reduction (WERR) and 3.6% absolute improvement in F1 when compared to a strong baseline. Our method also allows for large catalog sizes of up to 20K without significantly affecting WER and F1-scores, while achieving at least 20% inference speedup per acoustic frame.
    摘要 personalized automatic speech recognition (ASR) models 是一个广泛研究的话题,因为它有很多实际应用。最近,人们主要使用注意力基于的上下文偏好技术来改善罕见词和领域专有实体的识别。然而,由于性能约束,偏好通常是限制在几千个实体上,这限制了实际应用的可用性。为了解决这个问题,我们首先提出了“获取并复制”机制,以提高响应时间而不失去精度,即使扩展到大型目录。我们还提出了一种训练策略,以超越扩展后的识别率下降。总的来说,我们的方法可以在20K大型目录下实现6%左右的单词错误率下降(WERR)和3.6%绝对提升的F1分数,而无需明显影响WER和F1分数。此外,我们的方法还可以在每个音频帧上实现至少20%的执行速度提升。

A Material Lens on Coloniality in NLP

  • paper_url: http://arxiv.org/abs/2311.08391
  • repo_url: None
  • paper_authors: William Held, Camille Harris, Michael Best, Diyi Yang
  • for: 本文旨在探讨自然语言处理(NLP)领域中的殖民性强度,并提出了解决方案。
  • methods: 本文使用了actor-network理论(ANT)来分析NLP数据、算法和软件中的殖民性关系网络,并进行了质量调查以证明殖民性强度随NLP研究阶段的发展增长。
  • results: 研究发现,NLP领域的殖民性强度随着时间的推移而增长,而且与殖民前期的不平等关系有直接的相关性。因此,要解决NLP中的殖民性问题,不仅需要更改现有的价值观,还需要 aktiv地去除殖民思想在基础数据和算法中的积累。
    Abstract Coloniality, the continuation of colonial harms beyond "official" colonization, has pervasive effects across society and scientific fields. Natural Language Processing (NLP) is no exception to this broad phenomenon. In this work, we argue that coloniality is implicitly embedded in and amplified by NLP data, algorithms, and software. We formalize this analysis using Actor-Network Theory (ANT): an approach to understanding social phenomena through the network of relationships between human stakeholders and technology. We use our Actor-Network to guide a quantitative survey of the geography of different phases of NLP research, providing evidence that inequality along colonial boundaries increases as NLP builds on itself. Based on this, we argue that combating coloniality in NLP requires not only changing current values but also active work to remove the accumulation of colonial ideals in our foundational data and algorithms.
    摘要 殖民性,即殖民主义的继续影响 beyond "官方" 殖民化,对社会和科学领域都产生了广泛的影响。自然语言处理(NLP)不例外。在这篇文章中,我们 argued that 殖民性在 NLP 数据、算法和软件中是隐式地嵌入的,并通过这些技术的网络关系来强化。我们使用 Actor-Network 理论(ANT)来理解社会现象,通过人类利益相互关系和技术之间的网络来分析。我们使用我们的 Actor-Network 来导引一项量化调查 NLP 研究的不同阶段的地理学,并提供证据表明,在 NLP 建立起来的过程中,殖民性的不平等增加。基于这些证据,我们 argue that 在 NLP 中推翻殖民性需要不仅改变当前的价值观,还需要活动地除掉殖民主义的积累在我们基础数据和算法中。

On What Basis? Predicting Text Preference Via Structured Comparative Reasoning

  • paper_url: http://arxiv.org/abs/2311.08390
  • repo_url: None
  • paper_authors: Jing Nathan Yan, Tianqi Liu, Justin T Chiu, Jiaming Shen, Zhen Qin, Yue Yu, Yao Zhao, Charu Lakshmanan, Yair Kurzion, Alexander M. Rush, Jialu Liu, Michael Bendersky
  • for: 用于提高自然语言处理(NLP)中文本偏好预测的精度。
  • methods: 使用结构化中间比较来预测文本偏好。首先提出比较方面,然后生成每个方面下的文本比较。使用对比式比较器确保每个方面的比较能够清晰地区分文本之间的差异,从而减少幻想和提高一致性。
  • results: 在多种NLP任务中,包括摘要、检索和自动评分等,SCapproach可以使LLMs达到文本偏好预测的状态 искусственный智能水平。
    Abstract Comparative reasoning plays a crucial role in text preference prediction; however, large language models (LLMs) often demonstrate inconsistencies in their reasoning. While approaches like Chain-of-Thought improve accuracy in many other settings, they struggle to consistently distinguish the similarities and differences of complex texts. We introduce SC, a prompting approach that predicts text preferences by generating structured intermediate comparisons. SC begins by proposing aspects of comparison, followed by generating textual comparisons under each aspect. We select consistent comparisons with a pairwise consistency comparator that ensures each aspect's comparisons clearly distinguish differences between texts, significantly reducing hallucination and improving consistency. Our comprehensive evaluations across various NLP tasks, including summarization, retrieval, and automatic rating, demonstrate that SC equips LLMs to achieve state-of-the-art performance in text preference prediction.
    摘要 <>转换给定文本为简化中文。>文本比较逻辑在文本偏好预测中发挥关键作用,但大型语言模型(LLM)经常表现出不一致的逻辑。虽然链条思维方法在其他设定中提高准确性,但它们在识别复杂文本之间的相似性和差异时表现不佳。我们介绍了SC,一种提示方法,它预测文本偏好by generating结构化中间比较。SC首先提出比较方面,然后生成文本中的比较。我们使用对比式相似性比较器来选择一致的比较,以确保每个比较方面的比较能够清晰地分辨文本之间的差异,从而减少幻觉和提高一致性。我们在各种NLP任务,包括摘要、检索和自动评分中进行了全面的评估,demonstrate that SC使得LLM可以实现文本偏好预测的状态码性表现。

ChOiRe: Characterizing and Predicting Human Opinions with Chain of Opinion Reasoning

  • paper_url: http://arxiv.org/abs/2311.08385
  • repo_url: https://github.com/dxlong2000/ChOiRe
  • paper_authors: Xuan Long Do, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen
  • for: 预测人类意见 (predicting human opinions)
  • methods: 四步解决方案框架 (four-step solution framework): 1. LM 分析用户明确人格 (analyze user explicit personae) 2. LM 排序隐藏人格意见 (rank implicit persona opinions) 3. 链接意见 (Chain-of-Opinion) 运算 4. 多次执行链接意见 (execute CoO multiple times)
  • results: 提高过往技术表现率 (improve previous LLM-based techniques) by 3.22%.
    Abstract Aligning language models (LMs) with human opinion is challenging yet vital to enhance their grasp of human values, preferences, and beliefs. We present ChOiRe, a four-step solution framework to predict human opinion that differentiates between the user explicit personae (i.e. demographic or ideological attributes) that are manually declared and implicit personae inferred from user historical opinions. Specifically, it consists of (i) an LM analyzing the user explicit personae to filter out irrelevant attributes; (ii) the LM ranking the implicit persona opinions into a preferential list; (iii) Chain-of-Opinion (CoO) reasoning, where the LM sequentially analyzes the explicit personae and the most relevant implicit personae to perform opinion prediction; (iv) and where ChOiRe executes Step (iii) CoO multiple times with increasingly larger lists of implicit personae to overcome insufficient personae information to infer a final result. ChOiRe achieves new state-of-the-art effectiveness with limited inference calls, improving previous LLM-based techniques significantly by 3.22%.
    摘要 aligning language models (LMs) with human opinion 是一项挑战性质的 yet vital 任务,以提高 LMs 的人类价值观、偏好和信仰的理解。我们提出了 ChOiRe,一个四步解决方案框架,用于预测人类意见。specifically,it consists of (i) an LM analyzing the user explicit personae (i.e. demographic or ideological attributes)to filter out irrelevant attributes; (ii) the LM ranking the implicit persona opinions into a preferential list; (iii) Chain-of-Opinion (CoO) reasoning, where the LM sequentially analyzes the explicit personae and the most relevant implicit personae to perform opinion prediction; (iv) and where ChOiRe executes Step (iii) CoO multiple times with increasingly larger lists of implicit personae to overcome insufficient personae information to infer a final result. ChOiRe achieves new state-of-the-art effectiveness with limited inference calls, improving previous LLM-based techniques significantly by 3.22%.

Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding

  • paper_url: http://arxiv.org/abs/2311.08380
  • repo_url: None
  • paper_authors: Guangyu Yang, Jinghong Chen, Weizhe Lin, Bill Byrne
  • for: 提高多语言大语言模型(MLLM)的翻译性能
  • methods: 使用现代 reinforcement learning(RL)技术,具体是Direct Preference Optimization(DPO)来微调 MLLM,以获得 MBR decoding 的好处而不需要额外计算
  • results: 在多个 NMT 测试集上,我们的微调模型比基础 MLLM ohne preference optimization 有显著的提高,并且只需使用小型单语言微调集来实现。
    Abstract Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive and in this paper, we show how recently developed Reinforcement Learning (RL) technique, Direct Preference Optimization (DPO) can be used to fine-tune MLLMs so that we get the gains from MBR without the additional computation in inference. Our fine-tuned models have significantly improved performance on multiple NMT test sets compared to base MLLMs without preference optimization. Our method boosts the translation performance of MLLMs using relatively small monolingual fine-tuning sets.
    摘要 <> translate "Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive and in this paper, we show how recently developed Reinforcement Learning (RL) technique, Direct Preference Optimization (DPO) can be used to fine-tune MLLMs so that we get the gains from MBR without the additional computation in inference. Our fine-tuned models have significantly improved performance on multiple NMT test sets compared to base MLLMs without preference optimization. Our method boosts the translation performance of MLLMs using relatively small monolingual fine-tuning sets." into Simplified Chinese.翻译结果:可以使用最小概率风险(MBR)解码提高多语言大语言模型(MLLM)的翻译性能。然而,MBR解码 computationally costly。在这篇论文中,我们使用最近发展的回归学习(RL)技术,直接偏好优化(DPO)来练化 MLLM,以获得 MBR 的优点而无需在推理中添加计算。我们的练化模型在多个 NMT 测试集上显著提高了翻译性能,与基础 MLLM без偏好优化比较。我们的方法可以通过 relativelly 小的单语言练化集来提高 MLLM 的翻译性能。

A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts

  • paper_url: http://arxiv.org/abs/2311.08374
  • repo_url: None
  • paper_authors: Nafis Irtiza Tripto, Saranya Venkatraman, Dominik Macko, Robert Moro, Ivan Srba, Adaku Uchendu, Thai Le, Dongwon Lee
  • for: investigate whether a text retains its original authorship when it undergoes numerous paraphrasing iterations using Large Language Models (LLMs)
  • methods: employing LLMs to paraphrase the text and examining the resulting text for retention of original authorship
  • results: a philosophical inquiry into the determination of authorship in instances where LLMs or similar paraphrasing tools are employed to rephrase the text, with a focus on whether authorship should be attributed to the original human author or the AI-powered tool
    Abstract In the realm of text manipulation and linguistic transformation, the question of authorship has always been a subject of fascination and philosophical inquiry. Much like the \textbf{Ship of Theseus paradox}, which ponders whether a ship remains the same when each of its original planks is replaced, our research delves into an intriguing question: \textit{Does a text retain its original authorship when it undergoes numerous paraphrasing iterations?} Specifically, since Large Language Models (LLMs) have demonstrated remarkable proficiency in the generation of both original content and the modification of human-authored texts, a pivotal question emerges concerning the determination of authorship in instances where LLMs or similar paraphrasing tools are employed to rephrase the text. This inquiry revolves around \textit{whether authorship should be attributed to the original human author or the AI-powered tool, given the tool's independent capacity to produce text that closely resembles human-generated content.} Therefore, we embark on a philosophical voyage through the seas of language and authorship to unravel this intricate puzzle.
    摘要 在文本处理和语言转化领域中,作者所有权问题一直是一个感人和哲学问题。与《戴达瑟斯船只 парадок斯》相似,我们的研究探讨了一个有趣的问题:改进了多个重叠修改后,文本仍然保留原始作者的身份吗?具体来说,由大语言模型(LLMs)所示的出色表现使得我们面临着一个问题:在使用LLMs或类似的重叠修改工具时,推准作者的身份应该归属于原始的人类作者还是AI工具?这个问题环绕着AI工具独立地生成文本的能力和人类生成内容的相似性。因此,我们将开启一场哲学之旅,探索语言和作者所有权的秘密。

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

  • paper_url: http://arxiv.org/abs/2311.08370
  • repo_url: None
  • paper_authors: Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A. Hale, Paul Röttger
  • for: 这个论文的目的是为开发者和企业提供一个快速和系统地检测大语言模型(LLM)的重要安全隐患的测试集。
  • methods: 论文提出了一个名为SimpleSafetyTests的新测试集,包含5种危害领域的100个测试提示,以检测 LLM 是否会遵循有害指令、提供危险建议或生成恶意内容。
  • results: 测试发现大多数测试用的 LLM 在20%以上的情况下做出了危险响应,其中一些模型甚至在50%以上的情况下做出了危险响应。 prepending 一个安全强调系统提示可以显著减少危险响应的发生,但并不能完全消除它们。
    Abstract The past year has seen rapid acceleration in the development of large language models (LLMs). For many tasks, there is now a wide range of open-source and open-access LLMs that are viable alternatives to proprietary models like ChatGPT. Without proper steering and safeguards, however, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. This is a critical safety risk for businesses and developers. We introduce SimpleSafetyTests as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 popular open LLMs and find critical safety weaknesses in several of them. While some LLMs do not give a single unsafe response, most models we test respond unsafely on more than 20% of cases, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. We recommend that developers use such system prompts as a first line of defence against critical safety risks.
    摘要 过去一年,大型语言模型(LLM)的开发速度快速增加。许多任务上现在有很多开源和开放的 LLM,可以作为专有模型如ChatGPT的替代品。如果没有适当的导航和安全措施,LLM很快就会遵循恶意指令,提供危险的建议,并生成毒害内容。这是商业和开发人员的重要安全风险。我们介绍了 SimpleSafetyTests 作为新的测试集,用于快速和系统地检测这些重要安全风险。测试集包括 100 个测试提示,涵盖了五种伤害领域,LLM 在大多数应用程序上应该拒绝遵从。我们测试了 11 个流行的开源 LLM,发现了一些重要的安全漏洞。although some LLMs do not give a single unsafe response,most models we test respond unsafely on more than 20% of cases,with over 50% unsafe responses in the extreme。在添加安全强调系统提示前,Unsafe responses 的发生率很高,但不可以完全消除。我们建议开发人员使用这些系统提示作为首选防止重要安全风险的措施。

How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection

  • paper_url: http://arxiv.org/abs/2311.08369
  • repo_url: None
  • paper_authors: Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
  • for: 本研究旨在探讨现有语言模型检测器在生成文本时的表现不稳定性,具体来说是在学生作业写作中使用任务导向的约束会导致检测器的表现不一致。
  • methods: 本研究使用现有的语言模型生成文本检测器,并手动创建了作业质量因素的任务导向约束。实验结果显示,使用任务导向约束的 instrucion 可以导致检测器的表现差异增加至多达 20 倍。
  • results: 本研究发现,使用任务导向约束的 instrucion 可以导致现有的检测器表现不稳定,具体来说是在生成文本时的表现差异可以达到多达 20 倍。这些结果表明需要进一步的研究,以开发更加稳定的检测器,能够检测任务导向约束所引起的分布Shift。
    Abstract Against the misuse (e.g., plagiarism or spreading misinformation) of Large Language Models (LLMs), many recent works have presented LLM-generated-text detectors with promising detection performance. Spotlighting a situation where users instruct LLMs to generate texts (e.g., essay writing), there are various ways to write the instruction (e.g., what task-oriented constraint to include). In this paper, we discover that even a task-oriented constraint in instruction can cause the inconsistent performance of current detectors to the generated texts. Specifically, we focus on student essay writing as a realistic domain and manually create the task-oriented constraint for each factor on essay quality by Ke and Ng (2019). Our experiment shows that the detection performance variance of the current detector on texts generated by instruction with each task-oriented constraint is up to 20 times larger than the variance caused by generating texts multiple times and paraphrasing the instruction. Our finding calls for further research on developing robust detectors that can detect such distributional shifts caused by a task-oriented constraint in the instruction.
    摘要 对大语言模型(LLM)的滥用(如 пла格іязм或传播False Information),许多最近的研究已经提出了LLM生成文本检测器,其检测性能有promising的表现。对于用户将LLM生成文本(例如学生写作),存在多种写作指导(例如任务型束)。在这篇文章中,我们发现,即使用户提供了任务型束,current detector的检测性能仍然存在不稳定性。 Specifically, we focus on学生写作 as a realistic domain, and manually create the task-oriented constraint for each factor of essay quality proposed by Ke and Ng (2019). Our experiment shows that the detection performance variance of the current detector on texts generated by instruction with each task-oriented constraint is up to 20 times larger than the variance caused by generating texts multiple times and paraphrasing the instruction. Our finding calls for further research on developing robust detectors that can detect such distributional shifts caused by a task-oriented constraint in the instruction.

Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques

  • paper_url: http://arxiv.org/abs/2311.08349
  • repo_url: None
  • paper_authors: Laida Kushnareva, Tatiana Gaintseva, German Magai, Serguei Barannikov, Dmitry Abulkhanov, Kristian Kuznetsov, Irina Piontkovskaya, Sergey Nikolenko
  • for: 本研究旨在探讨人工智能语言生成模型的快速发展以来,文本中的人类和机器生成部分的分界问题。
  • methods: 本研究考虑了多种不同的方法来解决人工智能语言生成模型中文本的分界问题,并对几种预测器进行了比较。
  • results: 研究发现,使用精度进行精心微调的RoBERTa模型在总体来说表现良好,但在cross-domain和cross-生成器设置下表现不佳,很容易过拟合数据中的假性质。然后,研究提出了基于冻结语言模型的嵌入特征的新方法,能够超过人类精度水平和先前考虑的基准值。此外,研究还采用了抽象率基于的方法来检测文本边界,并分析了这些方法的行为。
    Abstract Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms.
    摘要 (Note: The text has been translated into Simplified Chinese, but some words and phrases may still be in Traditional Chinese, as there are no direct translations for some of the technical terms used in the text.)

MC^2: A Multilingual Corpus of Minority Languages in China

  • paper_url: http://arxiv.org/abs/2311.08348
  • repo_url: https://github.com/luciusssss/mc2_corpus
  • paper_authors: Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng
  • for: 提高中国少数民族语言的可访问性
  • methods: 采用质量中心的解决方案,优先保证数据的准确性和质量,同时提高语言表现的多样性和代表性
  • results: 实现了中国少数民族语言的大规模数据采集,探讨了这些语言的新研究挑战,如长文本处理和多种写作系统的结合
    Abstract Large-scale corpora play a vital role in the construction of large language models (LLMs). However, existing LLMs exhibit limited abilities in understanding low-resource languages, including the minority languages in China, due to a lack of training data. To improve the accessibility of these languages, we present MC^2, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus so far. It encompasses four underrepresented languages, i.e., Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. Notably, two writing systems in MC^2 are long neglected in previous corpora. As we identify serious contamination in the low-resource language split in the existing multilingual corpora, we propose a quality-centric solution for collecting MC^2, prioritizing quality and accuracy while enhancing representativeness and diversity. By in-depth analysis, we demonstrate the new research challenges MC^2 brings, such as long-text modeling and multiplicity of writing systems. We hope MC^2 can help enhance the equity of the underrepresented languages in China and provide a reliable data foundation for further research on low-resource languages.
    摘要 大规模 corpora 在大语言模型(LLM)的建构中发挥重要作用。然而,现有的 LLM 在理解低资源语言方面表现有限,包括中国少数民族语言,因为缺乏训练数据。为了提高这些语言的可访问性,我们提出 MC^2,一个多语言资料库,这是目前最大的开源资料库。它包括四种少数语言,即藏语、维吾尔语、哈萨克语(使用kazakh arabic字母)和蒙古语(使用传统蒙古字母)。值得注意的是, MC^2 中两种文字系统长期被前一些资料库忽略。在我们发现现有多语言资料库中低资源语言分区存在严重污染的问题后,我们提出了一种基于质量的解决方案,即在收集 MC^2 时,优先考虑质量和准确性,同时增强表现和多样性。通过深入分析,我们显示 MC^2 带来的新研究挑战,例如长文本模型和多种文字系统的复杂性。我们希望 MC^2 能够提高中国少数民族语言的平等,并为未来对低资源语言进行更多研究提供可靠的数据基础。

  • paper_url: http://arxiv.org/abs/2311.08329
  • repo_url: https://github.com/hanseokoh/ktrlf
  • paper_authors: Hanseok Oh, Haebin Shin, Miyoung Ko, Hyunji Lee, Minjoon Seo
  • for: 这篇论文是为了解决一个新的问题KTRL+F,这是一个基于知识的在文档中搜索任务,需要在文档中实时找到所有semantic target,同时考虑外部知识来填充semantic gap。
  • methods: 论文使用了多种基eline来分析KTRL+F问题,并发现了现有模型存在一些局限性,如幻觉、响应时间较长、外部知识难以引入。因此,论文提出了一种知识增强的短语检索模型,可以在实时下提供一个平衡的性能和速度。
  • results: 论文通过用户研究发现,解决KTRL+F问题可以提高用户搜索体验,用户可以减少查询数量,并减少外部源查询次数。这表明,通过增强在文档中的信息访问,可以提高用户的搜索效率。
    Abstract We introduce a new problem KTRL+F, a knowledge-augmented in-document search task that necessitates real-time identification of all semantic targets within a document with the awareness of external sources through a single natural query. This task addresses following unique challenges for in-document search: 1) utilizing knowledge outside the document for extended use of additional information about targets to bridge the semantic gap between the query and the targets, and 2) balancing between real-time applicability with the performance. We analyze various baselines in KTRL+F and find there are limitations of existing models, such as hallucinations, low latency, or difficulties in leveraging external knowledge. Therefore we propose a Knowledge-Augmented Phrase Retrieval model that shows a promising balance between speed and performance by simply augmenting external knowledge embedding in phrase embedding. Additionally, we conduct a user study to verify whether solving KTRL+F can enhance search experience of users. It demonstrates that even with our simple model users can reduce the time for searching with less queries and reduced extra visits to other sources for collecting evidence. We encourage the research community to work on KTRL+F to enhance more efficient in-document information access.
    摘要 我们引入了一个新的问题KTRL+F,即具有知识扩展的在文档内搜索任务,需要在文档中实时识别所有 semantic 目标,并在单个自然语言查询中利用外部知识。这个任务面临着以下两个独特挑战:1)利用外部知识来延伸文档中的信息,以填补查询和目标之间的semantic gap,2)在实时应用中维护高性能。我们分析了KTRL+F的各种基eline,发现存在许多限制,如幻觉、低响应速度、或者Difficulty in leveraging external knowledge。因此,我们提议一种具有良好平衡的知识扩展短语检索模型,通过在短语嵌入中添加外部知识嵌入来实现。此外,我们进行了一次用户研究,以验证解决KTRL+F问题是否可以提高搜索用户的体验。结果表明,即使使用我们的简单模型,用户可以减少搜索时间,并避免了额外访问其他来源以收集证据。我们鼓励研究者继续开发KTRL+F,以提高更有效的文档内信息访问。

Open-vocabulary keyword spotting in any language through multilingual contrastive speech-phoneme pretraining

  • paper_url: http://arxiv.org/abs/2311.08323
  • repo_url: None
  • paper_authors: Jian Zhu, Farhan Samir, Changbing Yang, Jahurul Islam
  • for: 这篇论文旨在开发一个大规模多语言 speech Corpora,包含超过 115 种语言,并提出一种多语言 phoneme-speech 对比 embedding 模型,能够在开放词汇中匹配 speech 信号和 phonemically 转录的关键词或自由文本。
  • methods: 该模型使用 fine-grained phonemic transcriptions 作为输入,并采用一种基于 contrastive learning 的训练方法,可以在 97 种语言中进行开放词汇匹配。
  • results: 对比文本模型,使用 phonemes 作为模型单元可以实现更好的 crosslinguistic 泛化性,并且在两个采集 speech Corpora 中进行了证明。
    Abstract In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic transcriptions, encompassing more than 115 languages from diverse language families. Based on this multilingual dataset, we propose CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between speech signals and phonemically transcribed keywords or arbitrary phrases. The proposed model has been tested on two fieldwork speech corpora in 97 unseen languages, exhibiting strong generalizability across languages. Comparison with a text-based model shows that using phonemes as modeling units enables much better crosslinguistic generalization than orthographic texts.
    摘要 在这篇论文中,我们介绍了一个大量多语言的语音 corpora,其包含了超过115种语言,从多种语言家族中选取。基于这个多语言数据集,我们提议了一种多语言phoneme-speech对比嵌入模型,能够在语音信号和phonemically转写的关键词或自由语phrases之间进行开放词汇匹配。我们对97种未看过语言的场景进行了测试,结果显示了模型在不同语言之间具有强大的泛化能力。与文本基于模型相比,使用phonemes作为模型单元可以实现跨语言泛化的更好的性能。

On-the-Fly Fusion of Large Language Models and Machine Translation

  • paper_url: http://arxiv.org/abs/2311.08306
  • repo_url: None
  • paper_authors: Hieu Hoang, Huda Khayrallah, Marcin Junczys-Dowmunt
  • for: 提高机器翻译模型的翻译质量
  • methods: 使用LLM进行在线折衔,并与NMT模型相互结合
  • results: 实验结果表明,使用LLM可以提高翻译质量,并且结合LLM和NMT模型的结合效果比两个 stronger MT模型的结合效果更好。
    Abstract We propose the on-the-fly ensembling of a machine translation model with an LLM, prompted on the same task and input. We perform experiments on 4 language pairs (both directions) with varying data amounts. We find that a slightly weaker-at-translation LLM can improve translations of a NMT model, and ensembling with an LLM can produce better translations than ensembling two stronger MT models. We combine our method with various techniques from LLM prompting, such as in context learning and translation context.
    摘要 我们提议在实时进行机器翻译模型与大语言模型(LLM)的 ensemble,两者在同一任务和输入下被触发。我们在4种语言对(双向)进行了实验,并调整了数据量。我们发现一些较弱的翻译LLM可以改善机器翻译模型的翻译结果,而ensemble两者可以生成更好的翻译结果,而且比 ensemble两个更强的机器翻译模型。我们将方法与LLM触发技术相结合,如在语言上学习和翻译上下文。

How Well Do Large Language Models Understand Syntax? An Evaluation by Asking Natural Language Questions

  • paper_url: http://arxiv.org/abs/2311.08287
  • repo_url: https://github.com/Jacob-Zhou/SynEval
  • paper_authors: Houquan Zhou, Yang Hou, Zhenghua Li, Xuebin Wang, Zhefeng Wang, Xinyu Duan, Min Zhang
  • for: 本研究探讨了大语言模型(LLMs)真正理解语言的问题,是否仅仅通过模式识别来模拟理解。
  • methods: 本研究采用自然语言问答(Q&A)方式,制定了九个句子理解知识点,这些知识点最直接关系到句子理解。
  • results: 实验结果表明,大多数LLMs在句子理解知识点上有限的掌握,尤其是在附属句子和副词修饰知识点上表现出较大的差异。
    Abstract While recent advancements in large language models (LLMs) bring us closer to achieving artificial general intelligence, the question persists: Do LLMs truly understand language, or do they merely mimic comprehension through pattern recognition? This study seeks to explore this question through the lens of syntax, a crucial component of sentence comprehension. Adopting a natural language question-answering (Q&A) scheme, we craft questions targeting nine syntactic knowledge points that are most closely related to sentence comprehension. Experiments conducted on 24 LLMs suggest that most have a limited grasp of syntactic knowledge, exhibiting notable discrepancies across different syntactic knowledge points. In particular, questions involving prepositional phrase attachment pose the greatest challenge, whereas those concerning adjectival modifier and indirect object are relatively easier for LLMs to handle. Furthermore, a case study on the training dynamics of the LLMs reveals that the majority of syntactic knowledge is learned during the initial stages of training, hinting that simply increasing the number of training tokens may not be the `silver bullet' for improving the comprehension ability of LLMs.
    摘要 Recent advancements in large language models (LLMs) have brought us closer to achieving artificial general intelligence, but the question remains: do LLMs truly understand language, or do they simply mimic comprehension through pattern recognition? This study aims to explore this question through the lens of syntax, a crucial component of sentence comprehension. Using a natural language question-answering (Q&A) scheme, we crafted questions targeting nine syntactic knowledge points that are most closely related to sentence comprehension. Our experiments on 24 LLMs suggest that most have a limited grasp of syntactic knowledge, with notable discrepancies across different syntactic knowledge points. In particular, questions involving prepositional phrase attachment posed the greatest challenge, while those concerning adjectival modifier and indirect object were relatively easier for LLMs to handle. Additionally, a case study on the training dynamics of the LLMs revealed that the majority of syntactic knowledge is learned during the initial stages of training, suggesting that simply increasing the number of training tokens may not be the "silver bullet" for improving the comprehension ability of LLMs.Note: Please keep in mind that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Examining Modularity in Multilingual LMs via Language-Specialized Subnetworks

  • paper_url: http://arxiv.org/abs/2311.08273
  • repo_url: None
  • paper_authors: Rochelle Choenni, Ekaterina Shutova, Dan Garrette
  • for: 本研究旨在 investigate multilingual language models中的语言层次结构和 Cross-Lingual Sharing 的关系。
  • methods: 研究使用 Training Data Attribution 方法来衡量模型的预测结果如何受到语言特有的训练示例和 Cross-Lingual Sharing 的影响。
  • results: 研究发现,无需特殊的模块化 intervención,语言模块 naturally arise 在模型中,并且 SFT 可以减少语言特化的子网络,导致更多的 Cross-Lingual Sharing。
    Abstract Recent work has proposed explicitly inducing language-wise modularity in multilingual LMs via sparse fine-tuning (SFT) on per-language subnetworks as a means of better guiding cross-lingual sharing. In this work, we investigate (1) the degree to which language-wise modularity naturally arises within models with no special modularity interventions, and (2) how cross-lingual sharing and interference differ between such models and those with explicit SFT-guided subnetwork modularity. To quantify language specialization and cross-lingual interaction, we use a Training Data Attribution method that estimates the degree to which a model's predictions are influenced by in-language or cross-language training examples. Our results show that language-specialized subnetworks do naturally arise, and that SFT, rather than always increasing modularity, can decrease language specialization of subnetworks in favor of more cross-lingual sharing.
    摘要

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

  • paper_url: http://arxiv.org/abs/2311.08268
  • repo_url: None
  • paper_authors: Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang
  • for: 提高 Large Language Models(LLMs)的安全性和用户体验,防止 LLMs 生成危险内容。
  • methods: 基于 Prompt Rewriting 和 Scenario Nesting 的自动攻击框架 ReNeLLM,利用 LLMs 本身生成有效的监狱攻击提示。
  • results: 比对基eline的成本和时间成本,ReNeLLM 能够显著提高攻击成功率,同时大幅降低时间成本。研究还发现现有防御方法无法有效地保护 LLMs。
    Abstract Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on another white-box model, compromising generalization or jailbreak efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we offer detailed analysis and discussion from the perspective of prompt execution priority on the failure of LLMs' defense. We hope that our research can catalyze both the academic community and LLMs vendors towards the provision of safer and more regulated Large Language Models.
    摘要 在这篇论文中,我们将监狱提示攻击分为两个方面:(1) 提示重写和(2) enario Nesting。基于这两个方面,我们提出了一种自动化框架,即 ReNeLLM,可以利用 LLMs 本身来生成有效的监狱提示。我们的实验结果表明,ReNeLLM 可以明显提高攻击成功率,同时大幅降低了时间成本,相比现有的基elines。我们的研究还发现,现有的防御方法无法保护 LLMS。最后,我们提供了关于提示执行优先级的详细分析,并讨论 LLMS 的防御失败的原因。我们希望,我们的研究可以激发学术界和 LLMS 供应商,向更安全和更加规范的 Large Language Models 努力。

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

  • paper_url: http://arxiv.org/abs/2311.08263
  • repo_url: https://github.com/smart-life-tech/231118-082633-esp32dev
  • paper_authors: Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
  • for: 这篇论文主要针对的是提高大型语言模型(LLM)的推理速度,以便在实时应用中更好地使用LLM。
  • methods: 该论文提出了一种名为FastCoT的框架,该框架基于平行解码,不需要 auxiliary model 或 LLM 的修改。 FastCoT 使用可变大小的上下文窗口,同时进行平行解码和自然语言处理,以便充分利用 GPU 计算资源。
  • results: 经过广泛的实验, authors 表明 FastCoT 可以将推理时间减少约 20%,只有微不足的性能下降。 此外, authors 还证明了上下文窗口大小在不同任务上具有较大的稳定性。
    Abstract In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.
    摘要 在这项工作中,我们提出了FastCoT,一个模型无关框架,不需要额外训练或 modify LLM 自身。FastCoT 使用变化大小的上下文窗口来同时进行平行解码和自然语言解码,因此可以完全利用 GPU 计算资源。在 FastCoT 中,平行解码部分为 LLM 提供了快速 glance 到未来组成的 approximate 字符,这可能会比常规autoregressive解码使用 causal transformer 更快。我们还提供了在 LLM 中实现平行解码的方法,支持 KV-cache 生成和批处理。经过广泛的实验,我们发现 FastCoT 可以将推理时间减少 nearly 20%,只有微scopic 性能下降。此外,我们还证明了上下文窗口大小在不同任务中具有显著的稳定性。

On Using Distribution-Based Compositionality Assessment to Evaluate Compositional Generalisation in Machine Translation

  • paper_url: http://arxiv.org/abs/2311.08249
  • repo_url: https://github.com/aalto-speech/dbca
  • paper_authors: Anssi Moisio, Mathias Creutz, Mikko Kurimo
  • for: 这个论文的目的是开发一种可以评估自然语言处理系统中的compositional generalization(CG)能力的benchmark。
  • methods: 这个论文使用了分布基于的compositional assessment(DBCA)框架,将欧 parliament翻译 corpus分为训练和测试集,以测试翻译系统在不同的依赖关系分布下的性能。
  • results: 这个实验使用了自动化的分布分割方法,可以方便地应用于其他dataset和语言上。 Code和数据可以在https://github.com/aalto-speech/dbca上获取。
    Abstract Compositional generalisation (CG), in NLP and in machine learning more generally, has been assessed mostly using artificial datasets. It is important to develop benchmarks to assess CG also in real-world natural language tasks in order to understand the abilities and limitations of systems deployed in the wild. To this end, our GenBench Collaborative Benchmarking Task submission utilises the distribution-based compositionality assessment (DBCA) framework to split the Europarl translation corpus into a training and a test set in such a way that the test set requires compositional generalisation capacity. Specifically, the training and test sets have divergent distributions of dependency relations, testing NMT systems' capability of translating dependencies that they have not been trained on. This is a fully-automated procedure to create natural language compositionality benchmarks, making it simple and inexpensive to apply it further to other datasets and languages. The code and data for the experiments is available at https://github.com/aalto-speech/dbca.
    摘要 叙述总结(CG)在自然语言处理(NLP)和机器学习中通常通过人工生成的数据集进行评估。为了更好地理解部署在野的系统的能力和局限性,需要开发真实世界自然语言任务中的标准测试套件。为此,我们的GenBench Collaborative Benchmarking Task提交使用分布型compose-ibility评估(DBCA)框架,将欧 parliament翻译集分为训练和测试集,以便测试翻译系统对于它们没有受过训练的依赖关系的翻译能力。specifically,训练和测试集具有不同的依赖关系分布,测试翻译系统的compose-ibility。这是一种自动化的、简单便宜的方法,可以轻松应用于其他数据集和语言。相关代码和数据可以在https://github.com/aalto-speech/dbca上获取。

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

  • paper_url: http://arxiv.org/abs/2311.08213
  • repo_url: None
  • paper_authors: Xinwei Li, Li Lin, Shuai Wang, Chen Qian
  • for: 提高多模态LM的性能和泛化能力
  • methods: 使用竞争型多模态知识储存(CoMD)框架,包括多模态预训练和竞争式知识传递两个阶段
  • results: 实验结果表明,我们的知识传递方法可以持续提高学生模型的能力,并在零基eline设置下超越当前状态艺AE模型和其他强基eline。
    Abstract Recently, multi-modal content generation has attracted lots of attention from researchers by investigating the utilization of visual instruction tuning based on large language models (LLMs). To enhance the performance and generalization ability of such LLMs, the practice of distilling knowledge from pretrained multi-modal models (a.k.a. teachers) to more compact multi-modal LLMs (students) has gained considerable interest. However, the prevailing paradigm of instructiontuning in multi-modal LLMs knowledge distillation is resource-intensive and unidirectional, neglecting the potential for mutual feedback between the student and teacher models. Thus, we propose an innovative Competitive Multi-modal Distillation framework (CoMD), which captures bidirectional feedback between teacher and student models and continually updates the multi-modal capabilities that the student model has learned. It comprises two stages: multi-modal pre-training and multi-modal competitive distillation. The first stage pre-trains the student model on a large number of filtered multi-modal datasets. The second stage facilitates a bidirectional knowledge transfer between the student and teacher models. Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model. Finally, the 7B-sized student model after four distillations surpassed the current state-of-the-art model LLaVA-13B on the ScienceQA and LLaVA Test dataset, also outperforms other strong baselines in the zero-shot setting.
    摘要 (注意:以下是简化中文版本,与原文可能有些不同)近些年来,多Modal内容生成技术吸引了研究人员的广泛关注,通过基于大语言模型(LLM)的视觉指令调整来探索多Modal内容生成的可能性。为了提高多Modal LLMs的性能和泛化能力,卷积多Modal模型(teacher)到更加紧凑的多Modal LLMs(student)的知识抽象已经得到了广泛的关注。然而,现有的多Modal LLMs知识抽象方法通常是资源占用和单向的,忽略了学生和教师模型之间的可能的反馈。因此,我们提出了一种创新的竞争型多Modal抽象方法(CoMD),该方法通过双向反馈来捕捉学生和教师模型之间的知识交换。CoMD包括两个阶段:多Modal预训练和多Modal竞争抽象。第一阶段通过大量筛选的多Modal数据集进行学生模型的预训练。第二阶段通过双向知识传递来实现学生和教师模型之间的知识交换。我们对多个数据集进行了实验分析,表明我们的知识传递方法能够不断提高学生模型的能力。最终,我们的7B字模型经四次抽象后,超越了当前状态的艺术模型LLaVA-13B在科学问答和LLaVA测试集上的性能,同时也超越了其他强大的基线模型。

GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

  • paper_url: http://arxiv.org/abs/2311.08191
  • repo_url: https://github.com/gibson210/gec-depend
  • paper_authors: Konstantin Yakovlev, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya
  • for: 这个论文的目的是提出一种新的非自然语言处理(NLP)任务,即句子重构(GEC)。
  • methods: 这个论文使用了一种新的非自然语言处理(NLP)方法,即句子重构(GEC)方法,该方法使用了一个卷积网络和一个排序网络来实现。
  • results: 这个论文的实验结果表明,该方法可以超过之前已知的非自然语言处理(NLP)方法,并达到自然语言处理(NLP)方法的水平,而不需要使用语言特定的数据生成方法。
    Abstract Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-to-sequence models. However, approaches of this class are inherently slow due to one-by-one token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network that outputs a self-attention weight matrix that can be used in beam search to find the best permutation of input tokens (with auxiliary {ins} tokens) and a decoder network based on a step-unrolled denoising autoencoder that fills in specific tokens. This allows us to find the token permutation after only one forward pass of the permutation network, avoiding autoregressive constructions. We show that the resulting network improves over previously known non-autoregressive methods for GEC and reaches the level of autoregressive methods that do not use language-specific synthetic data generation methods. Our results are supported by a comprehensive experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets and an extensive ablation study that supports our architectural and algorithmic choices.
    摘要 句子结构错误纠正(GEC)是一个重要的自然语言处理(NLP)任务,通常使用回归式序列到序列模型解决。然而,这类方法因为每个字符串生成一个 Token 的速度相对较慢,因此需要非回归式的替代方法。在这种工作中,我们提出了一种新的非回归式GEC方法,其拆分 Architecture 为一个卷积网络,该网络输出一个自注意力权重矩阵,可以在搜索 beam 中使用,以找到输入token的最佳 permutation(与辅助 {ins} tokens)。此外,我们还使用一个基于步骤拆分的减噪自适应网络,填充特定的 Token。这样,我们可以在单一的前进 pass 中找到 permutation,无需使用回归式结构。我们的结果表明,该网络在 previously known 非回归式方法之上提高,并达到使用语言特定数据生成方法的 autoregressive 方法水平。我们的结果得到了 ConLL-2014 和 Write&Improve+LOCNESS 数据集的广泛实验 validate,以及一项详细的ablation study,支持我们的建筑和算法选择。

Unlocking Science: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

  • paper_url: http://arxiv.org/abs/2311.08189
  • repo_url: None
  • paper_authors: Yuhan Li, Jian Wu, Zhiwei Yu, Börje F. Karlsson, Wei Shen, Manabu Okumura, Chin-Yew Lin
  • for: 本研究的目的是提供一个 semi-supervised 标注管道,以便对科学论文中的信息进行抽象,并且可以跨 Modality 进行标注。
  • methods: 本研究使用了一个迭代式的标注管道,可以同时标注文本中的实体,以及表格中的实体和关系。
  • results: 根据本研究的结果,使用 semi-supervised 标注管道可以实现跨 Modality 的标注,并且可以提高IE模型的性能。此外,本研究还报告了使用 ChatGPT 大语言模型的现有性能。
    Abstract Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.
    摘要 科学文献提取有大量潜在的可能性,可以帮助研究人员更 efficiently 工作,并促进科学进步的速度。过去几年,科学信息提取(SciIE)的研究发展出了许多新系统和标准。然而,现有的论文集中的数据主要集中在特定部分(例如摘要),并且是单一的模式(文本或表格),这是因为处理复杂和昂贵的标注。此外,核心信息可能存在于文本中或表格中,或者在两者之间。为了填补这个数据不足和启用交叉模式的IE,我们提出了一个半supervised管道来标注文本中的实体,以及表格中的实体和关系。基于这个管道,我们发布了一个高质量的benchmark,一个大规模的 corpus,以及一个半supervised的标注管道。我们还报告了现有IE模型在我们的benchmark dataset上的性能,作为基eline。最后,我们探讨了使用大语言模型如ChatGPT进行当前任务的可能性。我们的新数据集、结果和分析证明了我们的半supervised管道的有效性和高效性,并讨论了剩下的限制。

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

  • paper_url: http://arxiv.org/abs/2311.08182
  • repo_url: https://github.com/ofa-sys/diverseevol
  • paper_authors: Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, Chang Zhou
  • for: 提高大语言模型(LLM)的指令遵循能力,尤其需要大量的指令调整数据。
  • methods: 我们提出了一种自动 sampler 机制,让模型自己选择最有优势的数据点,以提高其表现。
  • results: 我们在三个数据集和benchmark上进行了广泛的实验,发现我们的方法可以让模型在使用 less than 8% 的原始数据时保持或提高性能。
    Abstract Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self-evolving mechanism DiverseEvol. In this process, a model iteratively augments its training subset to refine its own performance, without requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol. Our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data. We also provide empirical evidence to analyze the importance of diversity in instruction data and the iterative scheme as opposed to one-time sampling. Our code is publicly available at https://github.com/OFA-Sys/DiverseEvol.git.
    摘要 加强大语言模型(LLM)的指令遵从能力主要需要大量的指令调整数据。然而,这些数据的量却带来了较大的计算负担和标注成本。为了研究一种标签效率的指令调整方法,我们提出了一种自我演化机制——多样演化(DiverseEvol)。在这个过程中,模型会自动选择子集,以便在其当前的嵌入空间中进行进一步的调整。不需要人类或更高级的LLM的干预。我们的数据采样技术的关键在于在选择的子集中增加多样性,模型会选择新的数据点与现有的数据点最为不同,以便进一步提高自己的性能。我们在三个数据集和测试准则上进行了广泛的实验,结果表明多样演化可以很好地提高模型的性能。我们的模型,只使用原始数据的8%,可以保持或提高与全量数据的训练模型相同的性能。我们还提供了对多样性和迭代方案的分析,以及对一次采样的证明。我们的代码公开可用于https://github.com/OFA-Sys/DiverseEvol.git。

Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration

  • paper_url: http://arxiv.org/abs/2311.08152
  • repo_url: https://github.com/hitsz-tmg/multi-agent-peer-review
  • paper_authors: Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, Yuxiang Wu
  • for: 提高单个模型的推理能力,让模型能够更好地解决复杂的问题
  • methods: 引入多个模型协作,每个模型独立构建解决方案,对别的模型的解决方案提供评审,并将评审结果纳入自己的解决方案中
  • results: 在三种不同类型的推理任务上,与现有方法相比,协作方法在十个数据集中具有更高的准确率
    Abstract Large Language Models (LLMs) have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks. Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability. In this work, we let a single model "step outside the box" by engaging multiple models to correct each other. We introduce a multi-agent collaboration strategy that emulates the academic peer review process. Each agent independently constructs its own solution, provides reviews on the solutions of others, and assigns confidence levels to its reviews. Upon receiving peer reviews, agents revise their initial solutions. Extensive experiments on three different types of reasoning tasks show that our collaboration approach delivers superior accuracy across all ten datasets compared to existing methods. Further study demonstrates the effectiveness of integrating confidence in the reviews for math reasoning, and suggests a promising direction for human-mimicking multi-agent collaboration process.
    摘要 大型自然语言处理模型(LLM)在一般natural language processing任务中表现出众,但在复杂的推理任务中经常失落。最近的研究已经探索了人类化的问题解决策略,如自我修复,以推进单个模型的推理能力的边缘。在这项工作中,我们让单个模型“离开盒子”,通过多个模型相互纠正。我们提出了一种多代理协作策略,这种策略模仿了学术 peer review 过程。每个代理独立构建自己的解决方案,对他人的解决方案提供评估,并将对其评估的信任程度分配给自己的评估。接收 peer review 后,代理修改了初始的解决方案。我们在三种不同的推理任务上进行了广泛的实验,并证明了我们的协作方法在所有十个数据集上比现有方法更高的准确率。进一步的研究还表明了将信任纳入评估中的效果,并建议了人类化多代理协作过程的可能性。

Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

  • paper_url: http://arxiv.org/abs/2311.08143
  • repo_url: None
  • paper_authors: Konstantin Yakovlev, Gregory Polyakov, Ilseyar Alimova, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya
  • for: 这篇论文是关于多模态检索中的一种新趋势,即使用 dual-softmax loss (DSL) 进行后处理测试集结果。
  • methods: 本文提出了一种基于 Sinkhorn 变换的新后处理方法,并不需要对多个测试样本进行输入。
  • results: 我们的方法可以显著提高现有模型的效果,如 CLIP4Clip、BLIP、X-CLIP 和 DRL,并在多个标准文本-视频检索数据集上达到新的状态级。
    Abstract A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.
    摘要 Introduction:In recent years, there has been a growing trend in multimodal retrieval to use postprocessing techniques, such as the dual-softmax loss (DSL), to improve the performance of state-of-the-art models. However, these approaches typically require access to the entire test set, which can be limiting in practical applications. In this work, we propose a new postprocessing approach based on Sinkhorn transformations that outperforms DSL and does not require access to multiple test queries.Methodology:Our proposed approach uses Sinkhorn transformations to transform the test set into a more robust representation that can better capture the relationships between the text and video modalities. We show that this approach can significantly improve the results of state-of-the-art models such as CLIP4Clip, BLIP, X-CLIP, and DRL on several standard text-video retrieval datasets.Results:We evaluate our proposed approach on several standard text-video retrieval datasets, including the MSR-VTT dataset, the LSMDC dataset, and the MSVD dataset. Our results show that our approach can significantly improve the results of state-of-the-art models both with access to the entire test set and in the single-query setting. Specifically, we achieve a new state-of-the-art on the MSR-VTT dataset with a recall of 86.4% and a precision of 83.3%, and we achieve a new state-of-the-art on the LSMDC dataset with a recall of 84.3% and a precision of 81.3%.Conclusion:In this work, we proposed a new postprocessing approach based on Sinkhorn transformations that outperforms DSL and does not require access to multiple test queries. Our approach significantly improves the results of state-of-the-art models on several standard text-video retrieval datasets, both with access to the entire test set and in the single-query setting. We demonstrate the effectiveness of our approach and its potential for practical applications in multimodal retrieval.

Memory-efficient Stochastic methods for Memory-based Transformers

  • paper_url: http://arxiv.org/abs/2311.08123
  • repo_url: https://github.com/vishwajit-vishnu/memory-efficient-stochastic-methods-for-memory-based-transformers
  • paper_authors: Vishwajit Kumar Vishnu, C. Chandra Sekhar
  • for: 这个研究是为了提高记忆基于对称化器的训练效率,通常用于长距离上下文问题。
  • methods: 我们提出了一个新的两阶段训练机制和一个新的调整技术来改善记忆基于对称化器的训练效率。
  • results: 我们的结果显示,对称化器XL的基eline模型(Transformer-XL)在字元级语言模型任务上和相同的参数下表现与我们的结果模型(Skip Cross-head TransformerXL)相似,并在词元级语言模型任务上显示了约20% fewer参数下的表现。我们的提案方法不需要任何额外的记忆。此外,我们还证明了我们的调整机制在BERT上显示了相似的表现,并在多个GLUE任务上显示了约30%的标准差减少。
    Abstract Training Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers, which are often used for long-range context problems. For our experiments, we consider transformer-XL as our baseline model which is one of memorybased transformer models. We show that our resultant model, Skip Cross-head TransformerXL, outperforms the baseline on character level language modeling task with similar parameters and outperforms the baseline on word level language modelling task with almost 20% fewer parameters. Our proposed methods do not require any additional memory. We also demonstrate the effectiveness of our regularization mechanism on BERT which shows similar performance with reduction in standard deviation of scores of around 30% on multiple GLUE tasks.
    摘要 训练基于记忆的变换器可能需要很大的内存并且可能不够效率。我们提出了一种新的两阶段训练机制和一种新的常见化技术来改善基于记忆的变换器的训练效率,这些变换器通常用于长距离上下文问题。在我们的实验中,我们选择了 transformer-XL 作为我们的基线模型,它是一种基于记忆的变换器模型。我们显示了我们的结果模型 Skip Cross-head TransformerXL 在字符级语言模型任务上与基eline模型具有相同的参数时表现更好,并在单词级语言模型任务上与基eline模型减少约20%的参数表现更好。我们的提议方法不需要额外的内存。我们还证明了我们的常见化机制在 BERT 上表现良好,其在多个 GLUE 任务上降低了标准差分布的分数的比例约30%。

Insights into Classifying and Mitigating LLMs’ Hallucinations

  • paper_url: http://arxiv.org/abs/2311.08117
  • repo_url: None
  • paper_authors: Alessandro Bruno, Pier Luigi Mazzeo, Aladine Chetouani, Marouane Tliba, Mohamed Amine Kerkouri
  • for: 本研究旨在探讨人工智能中的幻觉现象,以及它对人工智能的影响。
  • methods: 本研究使用了多种方法,包括机器翻译、问答系统、对话系统、概要系统、知识图与LLMs等。
  • results: 研究发现,幻觉现象可能导致人工智能生成的文本中出现false或误导性信息。此外,还提出了一些缓解幻觉的可能性,以提高LLMs的可靠性。
    Abstract The widespread adoption of large language models (LLMs) across diverse AI applications is proof of the outstanding achievements obtained in several tasks, such as text mining, text generation, and question answering. However, LLMs are not exempt from drawbacks. One of the most concerning aspects regards the emerging problematic phenomena known as "Hallucinations". They manifest in text generation systems, particularly in question-answering systems reliant on LLMs, potentially resulting in false or misleading information propagation. This paper delves into the underlying causes of AI hallucination and elucidates its significance in artificial intelligence. In particular, Hallucination classification is tackled over several tasks (Machine Translation, Question and Answer, Dialog Systems, Summarisation Systems, Knowledge Graph with LLMs, and Visual Question Answer). Additionally, we explore potential strategies to mitigate hallucinations, aiming to enhance the overall reliability of LLMs. Our research addresses this critical issue within the HeReFaNMi (Health-Related Fake News Mitigation) project, generously supported by NGI Search, dedicated to combating Health-Related Fake News dissemination on the Internet. This endeavour represents a concerted effort to safeguard the integrity of information dissemination in an age of evolving AI technologies.
    摘要 大量的大语言模型(LLM)在多种人工智能应用中的普及,证明了它们在文本挖掘、文本生成和问答等任务中的出色表现。然而,LLM并不免于缺点。其中一个最引起关注的问题是“幻觉”(Hallucination)。这种现象在基于LLM的文本生成系统中出现,特别是在问答系统中,可能导致 false或误导的信息传播。这篇论文探讨了幻觉的下面原因,并解释了它在人工智能中的重要性。具体来说,我们在Machine Translation、问答、对话系统、概要系统、知识图与LLM等任务中进行幻觉分类。此外,我们还探讨了可能的缓解幻觉的策略,以提高LLM的总可靠性。我们的研究是在HeReFaNMi(健康相关假新闻 Mitigation)项目中进行的,该项目是由NGI Search支持的,旨在在互联网上防茧健康相关假新闻的传播。这是一项共同努力,旨在保护信息传播的integrity在人工智能技术的发展过程中。

Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

  • paper_url: http://arxiv.org/abs/2311.08110
  • repo_url: None
  • paper_authors: Jingbiao Mei, Jinghong Chen, Weizhe Lin, Bill Byrne, Marcus Tomalin
  • for: 检测仇恨推文(hateful memes)
  • methods: 利用重现导向的对比训练构建倾向感知 embedding 空间(retrieval-guided contrastive training)
  • results: 在 hateMemes 数据集上达到了状态码(AUROC)86.7,超过了大型多Modal Model(Flamingo 和 LLaVA)的表现,并实现了基于数据库中未见过的数据进行仇恨推文检测的功能。
    Abstract Hateful memes have emerged as a significant concern on the Internet. These memes, which are a combination of image and text, often convey messages vastly different from their individual meanings. Thus, detecting hateful memes requires the system to jointly understand the visual and textual modalities. However, our investigation reveals that the embedding space of existing CLIP-based systems lacks sensitivity to subtle differences in memes that are vital for correct hatefulness classification. To address this issue, we propose constructing a hatefulness-aware embedding space through retrieval-guided contrastive training. Specifically, we add an auxiliary loss that utilizes hard negative and pseudo-gold samples to train the embedding space. Our approach achieves state-of-the-art performance on the HatefulMemes dataset with an AUROC of 86.7. Notably, our approach outperforms much larger fine-tuned Large Multimodal Models like Flamingo and LLaVA. Finally, we demonstrate a retrieval-based hateful memes detection system, which is capable of making hatefulness classification based on data unseen in training from a database. This allows developers to update the hateful memes detection system by simply adding new data without retraining, a desirable feature for real services in the constantly-evolving landscape of hateful memes on the Internet.
    摘要 仇恨的 мемы在互联网上成为了一个重要的问题。这些 мемы 是一种 combining 图像和文本的形式,通常会传递出与其个别意义不同的信息。因此,检测仇恨的 мемы 需要系统能够同时理解图像和文本的modalities。然而,我们的调查发现,现有的 CLIP-based 系统的 embedding 空间缺乏对微妙的变化在 мемы 中的敏感性,这是正确地分类仇恨的关键。为解决这个问题,我们提议通过Retrieval-guided contrastive 训练构建一个仇恨意识的 embedding 空间。具体来说,我们添加了一个 auxiliary 损失函数,使用硬negative和 pseudo-gold 样本来训练 embedding 空间。我们的方法在 HatefulMemes 数据集上达到了状态的最佳性能,AUROC 为 86.7。特别是,我们的方法超过了大型 Fine-tuned Large Multimodal Models like Flamingo 和 LLaVA。 finally,我们展示了一个基于 Retrieval 的仇恨 memes 检测系统,可以通过添加新数据来更新系统,而不需要重新训练,这是一个 Desirable 的特点 для实际服务在互联网上的constantly-evolving 环境中。

SAIE Framework: Support Alone Isn’t Enough – Advancing LLM Training with Adversarial Remarks

  • paper_url: http://arxiv.org/abs/2311.08107
  • repo_url: None
  • paper_authors: Mengsay Loem, Masahiro Kaneko, Naoaki Okazaki
  • for: 提高模型对实例的理解和推理能力
  • methods: 使用支持和对抗对话方法进行培训
  • results: 模型在多种数据集上表现出色,并且在多代理推理场景中表现出更高的推理能力
    Abstract Large Language Models (LLMs) can justify or criticize their predictions through discussion with other models or humans, thereby enhancing their intrinsic understanding of instances. While proactive discussions enhance performance, this approach is currently limited to the inference phase. In this context, we posit a hypothesis: learning interactive discussions during training can improve understanding for the instances in the training step and proficiency in logical/critical thinking ability and verbalized expression of the model in the inference step. Our proposed SAIE training method involves both supportive and adversarial discussions between the learner and partner models. The learner model receives a remark from the partner through the discussion, and the parameters of the learner model are then updated based on this remark. That is, the teacher signal dynamically adjusts in response to the evolving model output throughout the training step. By bolstering the capacity for discussion and comprehension of instances, our experiments across datasets, including GSM8K, CommonsenseQA, and MMLU, reveal that models fine-tuned with our method consistently surpass those trained with standard fine-tuning techniques. Moreover, our approach demonstrates superior performance in multi-agent inference scenarios, boosting the models' reasoning abilities at the inference step.
    摘要 Simplified Chinese:大型语言模型(LLM)可以通过与其他模型或人类的交流来提高它们的实例理解能力。而积极的交流可以在推理阶段提高性能,我们提出一个假设:在训练阶段学习交流可以提高实例理解和逻辑推理能力。我们的SAIE训练方法包括支持和对抗交流,学习者模型从合作伙伴获得反馈,并根据这个反馈更新学习者模型的参数。我们的实验结果显示,使用我们的方法训练的模型在GSM8K、CommonSenseQA和MMLU等数据集上 consistently 超越标准训练技术。此外,我们的方法也在多代推理enario中表现出色,提高模型在推理阶段的推理能力。

Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models

  • paper_url: http://arxiv.org/abs/2311.08106
  • repo_url: None
  • paper_authors: Yujin Kim, Jaehong Yoon, Seonghyeon Ye, Sung Ju Hwang, Se-young Yun
  • for: 本研究旨在 Addressing the challenges of language models acquiring and updating outdated knowledge in an ever-evolving world.
  • methods: 我们提出了一个 temporally evolving question answering benchmark,即 EvolvingQA,用于训练和评估语言模型在不断发展的Wikipedia数据库上。我们的benchmark包括问答作为下游任务,以模拟实际应用场景。
  • results: 我们发现,现有的连续学习基elines有困难在更新和忘记过时知识。我们的发现表明,模型在学习更新知识时收到的权重 gradients 太小,导致模型困难学习新知识。此外,我们发现模型在提供数字或时间类问题的答案时尤其困难。
    Abstract In an ever-evolving world, the dynamic nature of knowledge presents challenges for language models that are trained on static data, leading to outdated encoded information. However, real-world scenarios require models not only to acquire new knowledge but also to overwrite outdated information into updated ones. To address this under-explored issue, we introduce the temporally evolving question answering benchmark, EvolvingQA - a novel benchmark designed for training and evaluating LMs on an evolving Wikipedia database, where the construction of our benchmark is automated with our pipeline using large language models. Our benchmark incorporates question-answering as a downstream task to emulate real-world applications. Through EvolvingQA, we uncover that existing continual learning baselines have difficulty in updating and forgetting outdated knowledge. Our findings suggest that the models fail to learn updated knowledge due to the small weight gradient. Furthermore, we elucidate that the models struggle mostly on providing numerical or temporal answers to questions asking for updated knowledge. Our work aims to model the dynamic nature of real-world information, offering a robust measure for the evolution-adaptability of language models.
    摘要 在一个不断演化的世界中,知识的动态性对语言模型来说是一个挑战,因为这些模型通常被训练在静态数据上,导致编码的信息变得过时。然而,现实生活中的应用需要模型不仅学习新知识,还要将过时的信息更新为最新的一。为解决这一未曾被探讨的问题,我们提出了时间演化问答标准 benchmark,即 EvolvingQA,这是一个基于自动化pipeline和大语言模型的新 benchmark,用于训练和评估语言模型在不断演化的Wikipedia数据库中。我们的benchmark将问答作为下游任务,以模拟实际应用场景。通过EvolvingQA,我们发现了现有的连续学习基线在更新和忘记过时知识方面存在困难。我们的发现表明,模型在小权重Gradient下难以学习更新的知识。此外,我们发现模型在提供 numerical或时间相关的答案时存在困难。我们的工作旨在模拟现实世界中的信息动态性,为语言模型的演化能力提供一种可靠的测试方法。

DiLoCo: Distributed Low-Communication Training of Language Models

  • paper_url: http://arxiv.org/abs/2311.08105
  • repo_url: None
  • paper_authors: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, Jiajun Shen
  • for: 这篇论文的目的是提出一种分布式优化算法,以便在各个设备之间训练大型自然语言处理模型。
  • methods: 该算法基于联邦平均化,并使用了AdamW内部优化器和Nesterov差动外部优化器。
  • results: 在广泛使用的C4数据集上,DiLoCo在8个工作者上表现和完全同步优化相当,但是对于通信量减少500倍。DiLoCo具有良好的数据分布robustness,同时也具有资源不可用和可用的robustness。
    Abstract Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.
    摘要 To address this challenge, we propose a distributed optimization algorithm called Distributed Low-Communication (DiLoCo). This algorithm enables training of language models on islands of devices that are poorly connected. Our approach is based on federated averaging, but with a large number of inner steps, an AdamW inner optimizer, and a Nesterov momentum outer optimizer.We tested DiLoCo on the widely used C4 dataset with 8 workers, and found that it performs as well as fully synchronous optimization while communicating 500 times less. Additionally, DiLoCo is robust to the data distribution of each worker, and can seamlessly leverage resources that become available during training. It is also robust to resources becoming unavailable over time.In summary, DiLoCo is a distributed optimization algorithm that enables training of language models on islands of devices with low communication overhead. It is robust to variations in data distribution and resources, and can seamlessly leverage available resources during training.

Align after Pre-train: Improving Multilingual Generative Models with Cross-lingual Alignment

  • paper_url: http://arxiv.org/abs/2311.08089
  • repo_url: None
  • paper_authors: Chong Li, Shaonan Wang, Jiajun Zhang, Chengqing Zong
  • for: 提高多语言生成模型的跨语言能力
  • methods: 利用翻译句子对进行对internal sentence representations的对接和模型输出的对接,通过多语言对比学习实现对接
  • results: even with less than 0.1% of pre-training tokens, the alignment framework significantly improves the cross-lingual abilities of generative models and mitigates the performance gap, and results in a better internal multilingual representation distribution of multilingual models.
    Abstract Multilingual generative models obtain remarkable cross-lingual capabilities through pre-training on large-scale corpora. However, they still exhibit a performance bias toward high-resource languages, and learn isolated distributions of sentence representations across languages. To bridge this gap, we propose a simple yet effective alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning and aligns model outputs by answering prompts in different languages. Experimental results demonstrate that even with less than 0.1 {\textperthousand} of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative models and mitigates the performance gap. Further analysis reveals that it results in a better internal multilingual representation distribution of multilingual models.
    摘要 这文本将被翻译为简化中文。<>多语言生成模型在大规模资料体上进行预训练后展现出卓越的跨语言能力,但 ainda 表现出语言高质量语言的偏好,并学习不同语言之间的隔离分布。为了 bridging 这个差距,我们提出了一个简单 yet effective 的对齐框架,利用翻译句子的对组。这个框架通过多ilingual contrastive learning 对内部句子表现进行对齐,并通过Answering 不同语言的提示来对外部输出进行对齐。实验结果显示,即使使用少于 0.1 个预训 tokens,我们的对齐框架可以帮助生成模型提高跨语言能力,并减少表现差距。进一步的分析显示,它导致多语言模型的内部多语言表现分布得到改善。

Data and models for stance and premise detection in COVID-19 tweets: insights from the Social Media Mining for Health (SMM4H) 2022 shared task

  • paper_url: http://arxiv.org/abs/2311.08057
  • repo_url: None
  • paper_authors: Vera Davydova, Huabin Yang, Elena Tutubalina
  • for: 本研究的目的是为了评估神经网络模型在健康领域的意见检测和理据分类方面的性能。
  • methods: 本研究使用了 manually annotated tweets 来评估模型的性能,并采用了 feature-level (early) fusion 和 dual-view 架构来增强模型的准确性。
  • results: 研究通过使用新收集的 Twitter 数据来评估模型在不同话题上的性能,并发现模型在这些话题上的性能有所不同。
    Abstract The COVID-19 pandemic has sparked numerous discussions on social media platforms, with users sharing their views on topics such as mask-wearing and vaccination. To facilitate the evaluation of neural models for stance detection and premise classification, we organized the Social Media Mining for Health (SMM4H) 2022 Shared Task 2. This competition utilized manually annotated posts on three COVID-19-related topics: school closures, stay-at-home orders, and wearing masks. In this paper, we extend the previous work and present newly collected data on vaccination from Twitter to assess the performance of models on a different topic. To enhance the accuracy and effectiveness of our evaluation, we employed various strategies to aggregate tweet texts with claims, including models with feature-level (early) fusion and dual-view architectures from SMM4H 2022 leaderboard. Our primary objective was to create a valuable dataset and perform an extensive experimental evaluation to support future research in argument mining in the health domain.
    摘要 COVID-19 大流行引发了社交媒体平台上的讨论,用户分享他们对于面Mask-wearing和疫苗接种的看法。为了评估神经网络模型的立场检测和前提分类能力,我们组织了2022年社会媒体挖掘 для健康(SMM4H)的共同任务2。这项竞赛使用了手动注释的帖子,涵盖了三个COVID-19相关的话题:学校关闭、困在家中和Mask-wearing。在这篇文章中,我们对先前的工作进行了扩展,并提供了新收集的Twitter上的疫苗接种话题的数据来评估模型的性能。为了提高评估的准确性和有效性,我们采用了多种策略,包括在Feature-level(早期)融合和双视体系中使用SMM4H 2022 leaderboard上的模型。我们的主要目标是创造一个有价值的数据集,并进行了广泛的实验评估,以支持未来健康领域的论点挖掘研究。

Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models

  • paper_url: http://arxiv.org/abs/2311.08011
  • repo_url: None
  • paper_authors: Shiwen Ni, Dingwei Chen, Chengming Li, Xiping Hu, Ruifeng Xu, Min Yang
  • for: 提高语言模型对新知识的更新能力
  • methods: 基于 Parametric Arithmetic 的 Forgetting before Learning 方法
  • results: 对两个公共可用的数据集进行实验,结果显示我们提议的 F-Learning 可以显著提高语言模型对新知识的更新性能,而且在一定情况下,减少 LoRA 参数也可以达到类似的效果,甚至在一些情况下超越全 Fine-tuning。
    Abstract Recently Large Language Models (LLMs) have demonstrated their amazing text understanding and generation capabilities. However, even stronger LLMs may still learn incorrect knowledge from the training corpus, as well as some knowledge that is outdated over time. Direct secondary fine-tuning with data containing new knowledge may be ineffective in updating knowledge due to the conflict between old and new knowledge. In this paper, we propose a new paradigm for fine-tuning called F-Learning (Forgetting before Learning), which is based on parametric arithmetic to achieve forgetting of old knowledge and learning of new knowledge. Experimental results on two publicly available datasets demonstrate that our proposed F-Learning can obviously improve the knowledge updating performance of both full fine-tuning and LoRA fine-tuning. Moreover, we have also discovered that forgetting old knowledge by subtracting the parameters of LoRA can achieve a similar effect to subtracting the parameters of full fine-tuning, and sometimes even surpass it significantly.
    摘要

A Comparative Analysis of the COVID-19 Infodemic in English and Chinese: Insights from Social Media Textual Data

  • paper_url: http://arxiv.org/abs/2311.08001
  • repo_url: None
  • paper_authors: Jia Luo, Daiyun Peng, Lei Shi, Didier El Baz, Xinran Liu
  • for: 本研究探讨了COVID-19信息 Overflow 在英语和中文语言上的比较分析,通过社交媒体平台上的文本数据进行了词频分析和主题划分分析,以及情感分析,以便更好地理解COVID-19信息 Overflow 的特点和特征。
  • methods: 本研究使用了社交媒体平台上的文本数据,通过词频分析、主题划分分析和情感分析来探讨COVID-19信息 Overflow 的特点和特征。
  • results: 本研究发现,COVID-19信息 Overflow 中最常出现的 thirty-five 个词语,可以帮助我们理解COVID-19信息 Overflow 中的主要话题和趋势,同时,情感分析也能够帮助我们理解社交媒体上COVID-19信息 Overflow 的情感特征。
    Abstract The COVID-19 infodemic, characterized by the rapid spread of misinformation and unverified claims related to the pandemic, presents a significant challenge. This paper presents a comparative analysis of the COVID-19 infodemic in the English and Chinese languages, utilizing textual data extracted from social media platforms. To ensure a balanced representation, two infodemic datasets were created by augmenting previously collected social media textual data. Through word frequency analysis, the thirty-five most frequently occurring infodemic words are identified, shedding light on prevalent discussions surrounding the infodemic. Moreover, topic clustering analysis uncovers thematic structures and provides a deeper understanding of primary topics within each language context. Additionally, sentiment analysis enables comprehension of the emotional tone associated with COVID-19 information on social media platforms in English and Chinese. This research contributes to a better understanding of the COVID-19 infodemic phenomenon and can guide the development of strategies to combat misinformation during public health crises across different languages.
    摘要 COVID-19信息暴发,表现为快速传播谣言和未经证实的疫情相关宣传,呈现了一项重要挑战。本文通过对英语和中文社交媒体文本数据进行比较分析,探讨COVID-19信息暴发的特点和特征。为保证数据的准确性和 representativeness,本文创建了两个infodemic数据集,通过文本数据的扩充来增强先前收集的社交媒体数据。通过字词频分析,本文发现了35个最常出现的infodemic词语,这些词语反映了社交媒体上COVID-19信息暴发的主要话题和讨论。此外,主题归一分析揭示了每种语言上COVID-19信息暴发的主要话题结构,提供了更深入的理解。此外,情感分析帮助了我们理解社交媒体上COVID-19信息的情感色彩,以及不同语言上的情感差异。本研究对COVID-19信息暴发现象的理解做出了贡献,可以导向公共卫生危机期间抗击谣言的开发。

How Well Do Text Embedding Models Understand Syntax?

  • paper_url: http://arxiv.org/abs/2311.07996
  • repo_url: https://github.com/fzp0424/sr
  • paper_authors: Yan Zhang, Zhaopeng Feng, Zhiyang Teng, Zuozhu Liu, Haizhou Li
  • for: 本研究旨在探讨文本嵌入模型在不同语法上的普适性,以及previous研究所不充分考虑的语法理解挑战。
  • methods: 我们首先开发了一个评估集,名为SR,以测试文本嵌入模型对语法理解的能力。SR包括两个重要的语法方面:结构规则和概念之间关系的理解。我们发现现有的文本嵌入模型尚未充分解决这两个语法理解挑战,并且在评估 dataset 上表现不佳。
  • results: 我们的发现表明,现有的文本嵌入模型在不同语法上的普适性尚未得到足够的改进,而且在评估 dataset 上的表现越来越不佳。此外,我们进行了严格的分析,探讨导致这种局限性的因素,以及previous研究所无法探测这种局限性的原因。最后,我们提出了增强文本嵌入模型在多种语法上的普适性的策略。本研究为语法理解挑战提供了实际的指导,以便在多种语法上提高模型的性能。
    Abstract Text embedding models have significantly contributed to advancements in natural language processing by adeptly capturing semantic properties of textual data. However, the ability of these models to generalize across a wide range of syntactic contexts remains under-explored. In this paper, we first develop an evaluation set, named \textbf{SR}, to scrutinize the capability for syntax understanding of text embedding models from two crucial syntactic aspects: Structural heuristics, and Relational understanding among concepts, as revealed by the performance gaps in previous studies. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges, and such ineffectiveness becomes even more apparent when evaluated against existing benchmark datasets. Furthermore, we conduct rigorous analysis to unearth factors that lead to such limitations and examine why previous evaluations fail to detect such ineffectiveness. Lastly, we propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios. This study serves to highlight the hurdles associated with syntactic generalization and provides pragmatic guidance for boosting model performance across varied syntactic contexts.
    摘要 Translation notes:* "Text embedding models" is translated as "文本嵌入模型" (wén tiān zhù mó delè 模型)* "Semantic properties" is translated as "Semantic 属性" (Semantic 属性)* "Structural heuristics" is translated as "结构准则" (jiégòng zhèngxíng)* "Relational understanding" is translated as "关系理解" (guān xì líjiě)* "Performance gaps" is translated as "性能差距" (xìng néng kùjì)* "Benchmark datasets" is translated as "标准数据集" (biāo zhù shuō yì)* "Syntactic contexts" is translated as "语言上下文" (yǔ yán shàng xìng)* "Generalization ability" is translated as "通用能力" (tōng yòng néng lì)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China.

The ART of LLM Refinement: Ask, Refine, and Trust

  • paper_url: http://arxiv.org/abs/2311.07961
  • repo_url: None
  • paper_authors: Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz
  • for: 提高 Large Language Models(LLMs)的生成质量
  • methods: 提出 Ask, Refine, and Trust(ART)目标,让 LLM 自我反思并更正生成结果
  • results: ART 在两个多步骤逻辑任务(GSM8K和StrategyQA)上表现出+5点的提升,而且使用较小的模型进行决策,从而实现成本效果。
    Abstract In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with refinement objective called ART: Ask, Refine, and Trust, which asks necessary questions to decide when an LLM should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), ART achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model.
    摘要

First Step Advantage: Importance of Starting Right in Multi-Step Reasoning

  • paper_url: http://arxiv.org/abs/2311.07945
  • repo_url: None
  • paper_authors: Kushal Jain, Kumar Shridhar
  • for: 这篇论文旨在探讨大语言模型(LLM)如何解决复杂的理解任务,并将这些能力压缩到更小的模型中。
  • methods: 论文使用了LLM来导引更小的模型,以便在特定任务上创建专门的、经济的模型。
  • results: 论文发现,如果在正确的时间进行指导,更小的模型可以减少理解任务中的错误,并提高性能超过100%。
    Abstract Large Language Models (LLMs) can solve complex reasoning tasks by generating rationales for their predictions. Distilling these capabilities into a smaller, compact model can facilitate the creation of specialized, cost-effective models tailored for specific tasks. However, smaller models often face challenges in complex reasoning tasks and often deviate from the correct reasoning path. We show that LLMs can guide smaller models and bring them back to the correct reasoning path only if they intervene at the right time. We show that smaller models fail to reason primarily due to their difficulty in initiating the process, and that guiding them in the right direction can lead to a performance gain of over 100%. We explore different model sizes and evaluate the benefits of providing guidance to improve reasoning in smaller models.
    摘要

It’s All Relative! – A Synthetic Query Generation Approach for Improving Zero-Shot Relevance Prediction

  • paper_url: http://arxiv.org/abs/2311.07930
  • repo_url: None
  • paper_authors: Aditi Chaudhary, Karthik Raman, Michael Bendersky
  • for: 提高大型自然语言模型(LLM)在生成 sintetic 查询对的能力,以建立更好的搜索模型,尤其是在没有可用训练数据的情况下。
  • methods: 使用 LLM 生成 sintetic 查询对,通过提供少量示例进行 prompting,并将查询生成 conditional 于输入文档或 relevance 标签。
  • results: 通过在七个 IR 数据集进行广泛的实验,发现使用这种方法生成的 sintetic 查询可以提高下游性能,表明生成的查询质量更高。
    Abstract Recent developments in large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations. This has enabled building better IR models, especially for tasks with no training data readily available. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. a text document) and generate a query relevant to that context, or condition the QGen model additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as they require the model to reason about the desired label and the input from a handful of examples. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task setup for the model to reason about. Extensive experimentation across seven IR datasets shows that synthetic queries generated in such a fashion translates to a better downstream performance, suggesting that the generated queries are indeed of higher quality.
    摘要 In this work, we propose to simplify the task for LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task setup for the model to reason about. Our extensive experimentation across seven IR datasets shows that synthetic queries generated in this way lead to better downstream performance, suggesting that the generated queries are of higher quality.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

  • paper_url: http://arxiv.org/abs/2311.07919
  • repo_url: https://github.com/qwenlm/qwen-audio
  • paper_authors: Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou
  • for: 这 paper 是为了提高 audio-language 模型的多任务能力和多种 audio 类型处理能力。
  • methods: 这 paper 使用了一种多任务训练框架,通过预先训练 audio 模型,使其能够处理多种 audio 类型和任务,而不需要任务特定的 fine-tuning。
  • results: 这 paper 的结果表明,Qwen-Audio 模型可以在多种 benchmark 任务上达到出色的性能,超过其他类似模型。此外,基于 Qwen-Audio 的 Qwen-Audio-Chat 模型可以处理多种 audio 和文本输入,支持多轮对话和各种 audio-中心场景。
    Abstract Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
    摘要 In this paper, we propose the Qwen-Audio model to address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, including human speech, natural sounds, music, and songs. Our goal is to facilitate universal audio understanding abilities.However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome this challenge, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder. This approach encourages knowledge sharing and avoids interference through shared and specified tags, respectively.Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Automated title and abstract screening for scoping reviews using the GPT-4 Large Language Model

  • paper_url: http://arxiv.org/abs/2311.07918
  • repo_url: https://github.com/wilkox/gptscreenr
  • paper_authors: David Wilkins
  • for: 这篇论文旨在提供一种自动屏选学术文献的方法,以帮助进行大规模的文献筛选任务。
  • methods: 这篇论文使用GPT-4大语言模型(LLM)和链式思维技术来自动屏选学术文献。
  • results: 在验证中,GPTscreenR与替代方案之间的性能相似,具有71%的敏感性、89%的特异性和总准确率为84%。
    Abstract Scoping reviews, a type of literature review, require intensive human effort to screen large numbers of scholarly sources for their relevance to the review objectives. This manuscript introduces GPTscreenR, a package for the R statistical programming language that uses the GPT-4 Large Language Model (LLM) to automatically screen sources. The package makes use of the chain-of-thought technique with the goal of maximising performance on complex screening tasks. In validation against consensus human reviewer decisions, GPTscreenR performed similarly to an alternative zero-shot technique, with a sensitivity of 71%, specificity of 89%, and overall accuracy of 84%. Neither method achieved perfect accuracy nor human levels of intraobserver agreement. GPTscreenR demonstrates the potential for LLMs to support scholarly work and provides a user-friendly software framework that can be integrated into existing review processes.
    摘要 scoping 筛选文献审查,一种文献审查类型,需要大量的人工劳动来屏幕大量的学术论文的相关性。这篇文章介绍了 R 统计编程语言中的 GPTscreenR 包,使用 GPT-4 大语言模型(LLM)自动屏幕源文。该包利用链条思想,目的是提高复杂屏幕任务的性能。在与专家人员审核决策相比,GPTscreenR 的性能与替代的零shot 技术相似,具有71%的敏感性、89%的特异性和84%的总准确率。 neither 方法达到了完美准确性 nochuman 水平的内部一致性。 GPTscreenR 表明了 LLM 可以支持学术工作,并提供了易用的软件框架,可以与现有的审查过程集成。

Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

  • paper_url: http://arxiv.org/abs/2311.07914
  • repo_url: None
  • paper_authors: Garima Agrawal, Tharindu Kumarage, Zeyad Alghami, Huan Liu
  • for: 本研究旨在探讨现代LLMs中的幻见问题,以及如何通过外部知识的扩展来减少幻见并提高逻辑准确率。
  • methods: 本研究分析了利用知识图为LLMs进行知识扩展的三种主要方法,并对这些方法进行比较性分析和实验评估。
  • results: 研究发现,通过利用知识图来扩展LLMs可以有效地减少幻见,并提高逻辑准确率。但是,还存在一些挑战和未来研究的可能性。
    Abstract The contemporary LLMs are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. To address this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. Among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. In this survey, we conduct a comprehensive review of these knowledge-graph-based knowledge augmentation techniques in LLMs, focusing on their efficacy in mitigating hallucinations. We systematically categorize these methods into three overarching groups, offering both methodological comparisons and empirical evaluations of their performance. Lastly, the paper explores the challenges associated with these techniques and outlines potential avenues for future research in this emerging field.
    摘要 当代LLMs具有生成幻觉的特点,主要归结于模型中知识的缺失。为解决这一重要局限性,研究人员采用多种策略来增强LLMs,以减少幻觉并提高逻辑精度。在这篇评论中,我们进行了全面的知识图基于增强技术的评审,关注它们在减少幻觉方面的有效性。我们将这些方法分为三大类,并对它们进行了方法学比较和实验性评估。最后,文章探讨了这些技术的挑战和未来研究的可能性。

CPopQA: Ranking Cultural Concept Popularity by LLMs

  • paper_url: http://arxiv.org/abs/2311.07897
  • repo_url: None
  • paper_authors: Ming Jiang, Mansi Joshi
  • for: 这研究旨在检验大型自然语言模型(LLM)是否可以准确地掌握文化节日的统计趋势,特别是长尾节日的普遍性。
  • methods: 研究使用了一种新的几个问题解决任务(CPopQA),用于测试LLM的统计排名能力。
  • results: 实验表明,大型模型可以准确地掌握文化节日的统计趋势,其中GPT-3.5表现最佳,并能够识别不同洲的地域文化 proximity。
    Abstract Prior work has demonstrated large language models' (LLMs) potential to discern statistical tendencies within their pre-training corpora. Despite that, many examinations of LLMs' knowledge capacity focus on knowledge explicitly appearing in the training data or implicitly inferable from similar contexts. How well an LLM captures the corpus-level statistical trends of concepts for reasoning, especially long-tail ones, is still underexplored. In this study, we introduce a novel few-shot question-answering task (CPopQA) that examines LLMs' statistical ranking abilities for long-tail cultural concepts (e.g., holidays), with a specific focus on these concepts' popularity in the United States and the United Kingdom, respectively. We curate a dataset containing 459 holidays across 58 countries, generating a total of 6,000 QA testing pairs. Experiments on four strong LLMs show that large models are capable of ranking long-tail cultural concepts regarding their statistical tendency. Notably, GPT-3.5 displayed superior performance and exhibited its potential to identify geo-cultural proximity across continents.
    摘要 In this study, we introduce a new few-shot question-answering task (CPopQA) that examines LLMs' statistical ranking abilities for long-tail cultural concepts (e.g., holidays), with a specific focus on their popularity in the United States and the United Kingdom. We curated a dataset containing 459 holidays across 58 countries, resulting in a total of 6,000 QA testing pairs. Our experiments on four strong LLMs show that large models are capable of ranking long-tail cultural concepts based on their statistical tendency. Notably, GPT-3.5 displayed superior performance and exhibited its ability to identify geo-cultural proximity across continents.

Fair Abstractive Summarization of Diverse Perspectives

  • paper_url: http://arxiv.org/abs/2311.07884
  • repo_url: https://github.com/psunlpgroup/fairsumm
  • paper_authors: Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang
  • for: 这篇论文研究了如何实现公正的抽象概要,以便不会忽略某些群体的观点。
  • methods: 该论文提出了四种无需参考的自动评价指标,用于衡量抽象概要是否公正。
  • results: 实验表明,模型生成的概要和人类写的参考概要都受到低度公正的影响。研究还发现了一些常见的公正概要Influencing factor,并提出了三种简单 yet effective的方法来解决不公正概要问题。
    Abstract People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people and propose four reference-free automatic metrics measuring the differences between target and source perspectives. We evaluate five LLMs, including three GPT models, Alpaca, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.
    摘要 人们来自不同的社会和人口结构组群体表达了多样化的观点和矛盾的意见,包括产品评论、医疗、法律和政治等领域。一个公正的概要应该提供广泛的多样化观点的涵盖,而无论任何群体的观点不被忽视。然而,当前的摘要 metric 和大语言模型(LLM)评价工作尚未探讨公正抽象摘要。在这篇论文中,我们系统地调查了用户生成数据的公正抽象摘要。我们首先正式定义了摘要公正性的形式化定义,即不对任何群体的观点进行忽视,并提出了四种Reference-free自动度量 measure 衡量目标和源观点之间的差异。我们对六个社交媒体、在线评论和录制过的语音材料上收集的六个数据集进行了五种 LLN 的评估,包括三种 GPT 模型、Alpaca 和 Claude。实验显示,模型生成的和人类写的参考摘要都受到低度的公正性影响。我们进行了对公正性的全面分析,并提出了三种简单 yet 有效的方法来缓解不公正的摘要。我们的数据集和代码可以在 获取。

Learning Mutually Informed Representations for Characters and Subwords

  • paper_url: http://arxiv.org/abs/2311.07853
  • repo_url: None
  • paper_authors: Yilin Wang, Xinyi Hu, Matthew R. Gormley
  • for: 这个论文的目的是提出一种新的语言模型,即束缚模型,用于组合字符和子词语言模型。
  • methods: 该模型使用视觉语言模型的想法,将字符和子词视为两种不同的感知modalities,并生成它们之间相互了解的表示。
  • results: 该模型在文本分类、命名实体识别和POS标记任务上表现出色,尤其是在噪音文本和低资源语言下表现更好。此外,该模型还在所有英文序列标签任务和分类任务上表现更好于其背景语言模型。
    Abstract Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, and POS-tagging tasks. Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. Our anonymized code is available at https://anonymous.4open.science/r/noisy-IE-A673
    摘要 大多数预训练语言模型都基于字符串分词,将文本处理为字符串Token的序列。然而,不同的文本粒度,如字符、子词和词,可以包含不同的信息。先前的研究表明,将多个输入粒度合并可以提高模型泛化性,但很少的其中输出有用的表示。在本文中,我们介绍了杂化模型,旨在将字符和子词语言模型相结合。受视语言模型的启示,我们的模型将字符和子词视为两种不同的modalities,并生成了彼此相互 Informed的表示。我们对文本分类、命名实体识别和POS标注任务进行评估。值得注意的是,杂化模型在噪音文本和低资源语言的情况下特别有优异表现,并且在所有英文序列标签任务和分类任务上都超越了它的后ION语言模型。我们的匿名代码可以在https://anonymous.4open.science/r/noisy-IE-A673中找到。

On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model

  • paper_url: http://arxiv.org/abs/2311.07820
  • repo_url: None
  • paper_authors: Nohil Park, Joonsuk Park, Kang Min Yoo, Sungroh Yoon
  • for: 这研究旨在探讨在多语言模型中使用准确的文本提示来改进模型的适应性。
  • methods: 研究使用了XGLM模型,并对其进行了Token-based提示和参数有效的微调。
  • results: 研究发现,使用提示微调可以达到或超过微调的性能,同时只需更新0.13%的模型参数。此外,提示微调也能更好地提高低资源语言的性能。
    Abstract An exciting advancement in the field of multilingual models is the emergence of autoregressive models with zero- and few-shot capabilities, a phenomenon widely reported in large-scale language models. To further improve model adaptation to cross-lingual tasks, another trend is to further fine-tune the language models with either full fine-tuning or parameter-efficient tuning. However, the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models has yet to be studied. Specifically, we lack an understanding of the role of linguistic distributions in multilingual models in the effectiveness of token-based prompt tuning. To address this question, we conduct experiments comparing prompt tuning and fine-tuning on the decoder-based multilingual model, XGLM, with four cross-lingual tasks (XNLI, PAWS-X, POS, NER). According to our study, prompt tuning achieves on par or better performance over fine-tuning across all languages while updating at most 0.13\% of the model parameters. Moreover, we empirically show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning. Our further analysis shows that the phenomenon is related to the tokenization scheme of the multilingual model.
    摘要 “在多语言模型领域,一种有趣的发展是零或几个shot能力的推论模型,这种现象广泛出现在大规模语言模型中。为了进一步改进模型在多语言任务中的适应性,另一种趋势是进一步精细调整语言模型,使其在cross-lingual任务中更加高效。然而,在多语言模型中PEFT(parameter-efficient fine-tuning)和cross-lingual任务之间的互动尚未得到了研究。具体来说,我们缺乏关于多语言模型中 linguistic distributions 对 token-based prompt tuning 的效iveness的理解。为了解答这个问题,我们在decoder-based多语言模型XGLM上进行了实验,使用了四个cross-lingual任务(XNLI、PAWS-X、POS、NER)。根据我们的研究,prompt tuning 在所有语言上具有on par或更好的性能,只需要更新模型参数的0.13%。此外,我们还证明了prompt tuning 对低资源语言的表现更加出色于 fine-tuning。我们进一步的分析表明,这种现象与多语言模型的tokenization scheme相关。”