cs.CL - 2023-09-09

Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

  • paper_url: http://arxiv.org/abs/2309.04858
  • repo_url: None
  • paper_authors: Daphne Ippolito, Nicholas Carlini, Katherine Lee, Milad Nasr, Yun William Yu
  • for: 本研究旨在逆引导语言模型生成文本的方法,以便检测生成文本和掌握模型生成过程中的偏见。
  • methods: 本研究使用逆引导技术揭示语言模型生成文本时使用的排名方法(top-$k$或核心抽样),并评估这些方法对模型生成过程中的偏见的影响。
  • results: 研究人员通过逆引导语言模型生成文本的方法,成功地探测了许多家族的开源语言模型和生产系统(如ChatGPT)中使用的排名方法,并发现这些方法可能导致模型生成过程中的偏见。
    Abstract Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).
    摘要 神经语言模型在API和网站中逐渐被部署,允许用户输入提示并接收生成的文本。许多这些系统并不披露生成参数。在这篇论文中,我们提出了一些方法来逆向工程生成文本的方法(即top-$k$或核心采样)。我们能够找到生成文本的decoding策略,这有关于检测生成文本的意义。此外,通过发现decoding设置的选择会产生模型预测分布的偏见,我们可以通过这个过程揭示生成文本的偏见。我们对一些开源语言模型家族以及生产系统(如ChatGPT)进行了攻击。

Leveraging Large Language Models for Exploiting ASR Uncertainty

  • paper_url: http://arxiv.org/abs/2309.04842
  • repo_url: None
  • paper_authors: Pranay Dighe, Yi Su, Shangshang Zheng, Yunshu Liu, Vineet Garg, Xiaochuan Niu, Ahmed Tewfik
  • for: 这项研究旨在提高语言模型在语音理解任务中的表现,而不需要采用复杂或特化的架构设计。
  • methods: 研究人员使用n-best列表作为语音识别器输入,并通过训练低维适应器来调整下游任务。
  • results: 使用n-best列表提高了语言模型在语音检测和关键词检测任务中的表现,而且在这些任务中,使用n-best列表的系统表现较为稳定。
    Abstract While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.
    摘要 大型自然语言处理(NLP)模型在各种任务上表现出色,但是在语音理解(SLU)任务上,它们需要依赖于可用的自动语音识别(ASR)系统进行转录,或者具备语音模式。本工作关注于前一种情况,即LLM的SLU任务表现受ASR系统对语音输入的准确率的限制。具体来说,我们解决了语音意图分类任务,其中高度的单词错误率可能限制LLM的意图理解能力。而不是通过设计复杂或特殊的架构来提高准确率,我们寻求如何不需要重大变革ASR和LLM的基础结构,以便在多个无关的任务上共享。为此,我们提出了使用n-best列表而不是唯一的错误的1-best假设来提示LLM。我们进行了提示工程来解释n-best列表的概念给LLM,然后进行了下游任务的训练。我们的方法使用n-best列表证明有效于设备指定的语音检测任务以及关键词检测任务,其中使用n-best列表提示的系统在1-best假设的系统上表现出色,因此为语音基于应用程序的有效方法。

Neurons in Large Language Models: Dead, N-gram, Positional

  • paper_url: http://arxiv.org/abs/2309.04827
  • repo_url: None
  • paper_authors: Elena Voita, Javier Ferrando, Christoforos Nalmpantis
  • for: 这篇论文主要研究了一种大型自然语言处理模型,并在单个GPU上进行轻量级的分析。
  • methods: 研究者使用了OPT家族模型,从125m到66b参数的范围内进行研究,并仅仅基于FFN神经元是否活跃或不活跃。
  • results: 研究者发现,早期网络部分很 sparse,表示许多特征是离散的。大约70%的神经元在66b模型中是“死亡”的,即 nunca activated 在大量多样化数据集上。同时,有些 alive 神经元 acts as token和n-gram探测器。 FFN更新不仅推荐下一个元素 канди达,而且还专门消除输入中的信息。这是研究者发现的首个特有的机制,用于从剩余流中移除(而不是添加)信息。随着规模的增加,模型变得更加 sparse,即有更多的“死亡”神经元和token探测器。最后,一些神经元被发现为位置 dependent,即它们的活跃或不活跃取决于位置,而不是文本数据。
    Abstract We analyze a family of large language models in such a lightweight manner that can be done on a single GPU. Specifically, we focus on the OPT family of models ranging from 125m to 66b parameters and rely only on whether an FFN neuron is activated or not. First, we find that the early part of the network is sparse and represents many discrete features. Here, many neurons (more than 70% in some layers of the 66b model) are "dead", i.e. they never activate on a large collection of diverse data. At the same time, many of the alive neurons are reserved for discrete features and act as token and n-gram detectors. Interestingly, their corresponding FFN updates not only promote next token candidates as could be expected, but also explicitly focus on removing the information about triggering them tokens, i.e., current input. To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream. With scale, models become more sparse in a sense that they have more dead neurons and token detectors. Finally, some neurons are positional: them being activated or not depends largely (or solely) on position and less so (or not at all) on textual data. We find that smaller models have sets of neurons acting as position range indicators while larger models operate in a less explicit manner.
    摘要 我们对一家大型语言模型进行轻量级分析,可以在单个GPU上进行。我们专注于OPT家族模型, Parameter范围从125m到66b,并且仅基于FFN neuron是否活动。我们发现,早期网络部分是稀畴的,表示许多独特的特征。在某些层中,66b模型中的超过70%的神经元是"死亡"的,即它们从未在大量多样化数据上活动。同时,大多数活跃神经元作为token和n-gram探测器,其FFN更新不仅推荐下一个Token候选者,还显著地减少输入信息,即当前输入。我们认为,这是首次特有的机制,从 residual 流中Explicitly removing 信息而不是添加信息。随着scale,模型变得更加稀畴,即它们有更多的"死亡"神经元和token探测器。最后,一些神经元是位置特征,它们的活动或不活动主要取决于位置,而不是文本数据。我们发现小型模型有一组神经元作为位置范围指示器,而更大的模型在更不那么显式的方式下运行。

FaNS: a Facet-based Narrative Similarity Metric

  • paper_url: http://arxiv.org/abs/2309.04823
  • repo_url: None
  • paper_authors: Mousumi Akter, Shubhra Kanti Karmaker Santu
  • for: 本研究的目的是提出一种新的叙事相似度度量方法,以便更好地比较叙事的细节。
  • methods: 本研究使用了现有的大语言模型(LLMs)来提取5W1H的特征(Who, What, When, Where, Why, and How),并将其作为叙事相似度度量的基础。
  • results: 实验结果表明,FaNS度量在比较叙事的细节方面表现出了37%的高 corrrelation,与传统的文本相似度度量方法相比,表明FaNS度量能够更好地捕捉叙事的细节。
    Abstract Similar Narrative Retrieval is a crucial task since narratives are essential for explaining and understanding events, and multiple related narratives often help to create a holistic view of the event of interest. To accurately identify semantically similar narratives, this paper proposes a novel narrative similarity metric called Facet-based Narrative Similarity (FaNS), based on the classic 5W1H facets (Who, What, When, Where, Why, and How), which are extracted by leveraging the state-of-the-art Large Language Models (LLMs). Unlike existing similarity metrics that only focus on overall lexical/semantic match, FaNS provides a more granular matching along six different facets independently and then combines them. To evaluate FaNS, we created a comprehensive dataset by collecting narratives from AllSides, a third-party news portal. Experimental results demonstrate that the FaNS metric exhibits a higher correlation (37\% higher) than traditional text similarity metrics that directly measure the lexical/semantic match between narratives, demonstrating its effectiveness in comparing the finer details between a pair of narratives.
    摘要 相似的故事检索是一项非常重要的任务,因为故事是解释和理解事件的重要工具。多个相关的故事可以帮助创建一个事件的总体视图。为了准确地标识semantically相似的故事,这篇论文提出了一种新的故事相似度度量方法,称为 Facet-based Narrative Similarity(FaNS),基于经典的5W1Hfacets(Who、What、When、Where、Why和How)。与现有的相似度度量方法不同,FaNS提供了更加细化的匹配,分别对六个不同的facet进行独立匹配,然后进行组合。为了评估FaNS,我们创建了一个完整的数据集,通过收集AllSides第三方新闻门户上的故事。实验结果表明,FaNS metric在比较两个故事的细节方面 exhibits higher correlation(37%高于传统的文本相似度度量方法),这demonstrates its effectiveness in comparing the finer details between a pair of narratives.

MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images

  • paper_url: http://arxiv.org/abs/2309.04790
  • repo_url: None
  • paper_authors: Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, Kang Liu
  • for: 解决多modal和多种类型的问答问题 (addressing multimodal and heterogeneous question answering problems)
  • methods: 提出了MMHQA-ICL框架,包括强化的多元数据检索器和图像caption模块,以及类型特定的在 контекス学习策略 (proposed MMHQA-ICL framework, including a strengthened heterogeneous data retriever and an image caption module, as well as a type-specific in-context learning strategy)
  • results: 实验结果表明,我们的框架在少量数据下的少 shot Setting下表现出state-of-the-art的result,在MultimodalQA数据集上超过所有基线和数据集全部训练的方法 (experimental results show that our framework achieves state-of-the-art results under the few-shot setting on the MultimodalQA dataset, outperforming all baselines and methods trained on the full dataset)
    Abstract In the real world, knowledge often exists in a multimodal and heterogeneous form. Addressing the task of question answering with hybrid data types, including text, tables, and images, is a challenging task (MMHQA). Recently, with the rise of large language models (LLM), in-context learning (ICL) has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems, which includes stronger heterogeneous data retriever and an image caption module. Most importantly, we propose a Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage their powerful performance in this task. We are the first to use end-to-end LLM prompting method for this task. Experimental results demonstrate that our framework outperforms all baselines and methods trained on the full dataset, achieving state-of-the-art results under the few-shot setting on the MultimodalQA dataset.
    摘要 在现实世界中,知识经常存在多模式和多种形式。解决问答问题时使用混合数据类型,包括文本、表格和图像,是一项复杂的任务(MMHQA)。近些年来,大型自然语言模型(LLM)的出现,使得在场景学习(ICL)成为解决问答问题的最受欢迎方法。我们提出了MMHQA-ICL框架,该框架包括更强的多种数据检索器和图像描述模块。最重要的是,我们提出了特定类型的场景学习策略,使得LLM可以在这个任务中发挥出色的表现。我们是第一个使用端到端LLM推荐方法来解决这个任务。实验结果表明,我们的框架在几shotSetting下超过了所有基线和已经训练的方法,在多媒体Question Answering dataset上达到了当前最佳result。

Data Augmentation for Conversational AI

  • paper_url: http://arxiv.org/abs/2309.04739
  • repo_url: https://github.com/dataug-convai/dataug-convai.github.io
  • paper_authors: Heydar Soudani, Evangelos Kanoulas, Faegheh Hasibi
  • for: 提高对话系统的信息访问,超越单个查询的限制
  • methods: 使用数据扩充(DA)方法,解决低资源领域和语言的数据罕见问题
  • results: 提供了对话系统中最新的扩充技术,包括对话生成、开放领域对话生成和任务域对话生成,以及评估这些模型的不同方法。
    Abstract Advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are labor-intensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in conversation augmentation, open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area.
    摘要 “对话系统的进步已经改变了信息存取的方式,超过了单一查询的限制。但是开发对话系统需要大量的训练数据,这是低资源领域和语言的挑战。传统的数据收集方法如聚思网络是劳动密集和时间负担重的,使其在这个上下文中无法有效。数据增强(DA)是一种有效的方法来解决数据缺乏问题在对话系统中。本教程提供了对话系统中 DA 方法的全面和最新的概述,包括最新的对话增强、开放领域和任务对话生成、以及不同类型的评估这些模型。我们还讨论了现在的挑战和未来的方向,以帮助研究者和实践者继续推动这个领域。”Note that Simplified Chinese is used here, as it is the more widely used standard for Chinese writing in mainland China and other countries. If you prefer Traditional Chinese, I can provide that version as well.

Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

  • paper_url: http://arxiv.org/abs/2309.04734
  • repo_url: None
  • paper_authors: Yifan Dong, Suhang Wu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jianxin Lin, Jinsong Su
  • for: 本研究旨在提出一种基于多模态信息的关键短语生成模型,以便更好地捕捉输入文本和图像对的核心意思。
  • methods: 我们提出了一种新的多模态关键短语生成模型,该模型不仅通过外部知识补充模型输入,还能够有效地过滤图像噪音。我们首先引入图像外部视觉实体作为模型输入,以便进行交叉模态Semantic对齐。其次,我们同时计算图像文本匹配分数和图像区域文本相关分数,以进行多granularity图像噪音过滤。尤其是,我们引入图像区域和真实关键短语之间的相关分数,以进一步改进图像匹配分数的计算。
  • results: 我们在 benchmark 数据集上进行了多组实验,实验结果表明,我们的模型可以达到领先的性能。我们的代码可以在 https://github.com/DeepLearnXMU/MM-MKP 上找到。
    Abstract Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.
    摘要 多Modal键词生成的目标是生成对输入文本-图像对的核心点的集合。在这意义上,主宰方法主要强调多Modal融合 для键词生成。然而,还有两个主要缺点:1)只有一定的资源,如图像描述,可以提供辅助信息。然而,这些资源可能不够充分 для后续的键词生成。2)输入文本和图像可能不匹配,因此图像可能会干扰模型。为了解决这些限制,在这篇论文中,我们提出了一个新的多Modal键词生成模型,不仅丰富模型的输入,而且有效地范围干扰图像噪声。首先,我们将图像的外部视觉实体作为模型的辅助输入,从而促进跨Modal semantic alignment。其次,我们同时计算图像文本匹配分数和图像区域文本相似度分数,以进行多粒度图像噪声范围。特别是,我们引入图像区域和真实键词之间的相似度分数,以进一步调整先前述的相似度分数。为了证明我们的模型的效果,我们进行了多组实验,并进行了深入的分析。结果显示,我们的模型在 benchmark 数据集上达到了国际级的表现。我们的代码可以在 上下载。

EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets

  • paper_url: http://arxiv.org/abs/2309.04725
  • repo_url: None
  • paper_authors: Hongyuan Lu, Wai Lam
    for: This paper aims to improve the performance of large language models (LLMs) on various natural language processing (NLP) tasks by developing a novel method called Easy Prompt Augmentation (EPA).methods: The proposed EPA method uses paraphrasing as an augmentation method to automatically generate multiple sources/targets for demonstrations, which are then used to improve the performance of LLMs on NLP tasks.results: The proposed EPA method effectively improves the performance of LLMs on various NLP tasks, including natural language inference and machine translation, covering tens of languages.
    Abstract Large language models (LLMs) have shown promising performance on various NLP tasks via task prompting. And their performance can be further improved by appending task demonstrations to the head of the prompt. And usually, a better performance can be achieved with more demonstrations. However, asking the users to write the demonstrations can be cumbersome. As a simple yet cost-effective workaround, this paper proposes a novel method called EPA (\textbf{E}asy \textbf{P}rompt \textbf{A}ugmentation)\footnote{While this paper considers augmenting prompts via demonstrations, we name it EPA as the name EDA is already taken by a well-known NLP method \citep{wei-zou-2019-eda}.} that effectively minimizes user efforts in writing demonstrations while improving the model performance at the same time. EPA achieves these goals by automatically augmenting the demonstrations with multiple sources/targets, where each of them paraphrases each other. This is well motivated as augmenting data via paraphrasing effectively improves neural language models. EPA thus employs paraphrasing as an augmentation method for in-context learning. Extensive experiments indicate that EPA effectively improves both NLU and NLG tasks, covering from natural language inference to machine translation in translating tens of languages.\footnote{Code and data will be released upon publication.}
    摘要 大型语言模型(LLM)在不同的自然语言处理(NLP)任务上显示了拥有推进性的表现,并且可以通过将任务示范复制到请求的首部来进一步提高表现。然而,要求用户写示范可能是一个困难和费时的过程。为了解决这个问题,这篇论文提出了一个简单 yet cost-effective的方法,即EPA(易于提高表现的请求补充,即EDA的一个修改)。EPA透过自动将示范复制多个来源/目标,每个来源/目标都会对另一个进行重新诠释,以提高语言模型的表现。这是因为将数据进行重新诠释实际上可以提高神经语言模型的表现。EPA因此使用重新诠释作为对应的增强方法,以进行内容学习。实验结果显示,EPA可以有效地提高自然语言推理和机器翻译等多种NLP任务,涵盖了多种语言的翻译。[Code和数据将在出版时发布.]

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

  • paper_url: http://arxiv.org/abs/2309.04679
  • repo_url: None
  • paper_authors: C. M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld
  • for: 本研究旨在提高低资源语言下的模型性能,通过特点化多语言模型的词库和嵌入矩阵。
  • methods: 本研究提出了一些简单的技术来取代多语言词库,包括词库特定化和嵌入矩阵重新初始化策略。
  • results: 研究结果显示,使用词库特定化和嵌入矩阵重新初始化策略可以提高低资源语言下的模型性能,并且与ocus方法相当。
    Abstract Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.
    摘要 预训练多语言模型在现代自然语言处理(NLP)工具中占据主导地位,尤其是外语模型。为了特化这些模型,我们可以使用语言适应预训练(LAPT)。然而,保留大量的跨语言词汇和嵌入矩阵来自恰到位的计算成本增加。在本研究中,我们提出了一些简单的技巧来替代跨语言词汇。首先,我们考虑了在特циализиasi词汇后重新初始化Token嵌入矩阵的策略。然后,我们对我们的技巧进行了系统性的实验比较,以及最近提出的关注方法(Focus)。我们的结果表明:1)在单语言传输文献中使用嵌入替换技术是不充分的。2)将跨语言词汇替换为更小的特定语言词汇可以有效地提高低资源语言的性能。3)基于书写系统的子分布的简单嵌入重新初始化技术可以与关注方法(Focus)相比。

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

  • paper_url: http://arxiv.org/abs/2309.04662
  • repo_url: None
  • paper_authors: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat
  • for: 这个论文是为了介绍一个新的、通用领域的3Ttoken单语言 dataset,名为MADLAD-400,该 dataset 基于 CommonCrawl,覆盖了419种语言。
  • methods: 论文使用了自我审核的方法来检测 dataset 的局限性,并讨论了数据审核在 dataset 创建过程中的作用。
  • results: 论文在使用公共可用数据进行训练后,发现一个10.7B参数的多语言翻译模型和一个8B参数的语言模型,并对不同领域进行评估。Results 表明这些模型在翻译和ew-shot翻译方面具有竞争力,并且提供了基准模型供研究人员使用。
    Abstract We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
    摘要 我们介绍MADLAD-400,一个人工审核的、通用领域3Token单语言数据集,基于CommonCrawl,覆盖419种语言。我们讨论自我审核MADLAD-400中的局限性,以及数据审核在数据集创建过程中的角色。然后我们使用公共可用数据进行训练,并发布一个10.7B参数的多语言翻译模型,覆盖超过450种语言,并发现其与更大的模型相比竞争性强。此外,我们还训练了8B参数的语言模型,并评估其在几个语言翻译中的表现。我们将基准模型公开发布给研究社区。

Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf

  • paper_url: http://arxiv.org/abs/2309.04658
  • repo_url: None
  • paper_authors: Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, Yang Liu
  • for: 这个论文探讨了如何让大语言模型(LLMs)参与交流游戏,并提出了一个不需要调整的框架。
  • methods: 该方法采用了采集和反思过去交流和经验来提高。
  • results: 实验表明,该框架可以无需调整LLMs的参数来玩“人狼”游戏,并且在实验中出现了策略性行为,表明将LLMs引入交流游戏和相关领域是一个有前途的研究方向。
    Abstract Communication games, which we refer to as incomplete information games that heavily depend on natural language communication, hold significant research value in fields such as economics, social science, and artificial intelligence. In this work, we explore the problem of how to engage large language models (LLMs) in communication games, and in response, propose a tuning-free framework. Our approach keeps LLMs frozen, and relies on the retrieval and reflection on past communications and experiences for improvement. An empirical study on the representative and widely-studied communication game, ``Werewolf'', demonstrates that our framework can effectively play Werewolf game without tuning the parameters of the LLMs. More importantly, strategic behaviors begin to emerge in our experiments, suggesting that it will be a fruitful journey to engage LLMs in communication games and associated domains.
    摘要 通信游戏,我们称之为受限信息游戏,在经济学、社会科学和人工智能等领域具有重要的研究价值。在这种游戏中,我们研究如何让大型自然语言模型(LLM)参与通信游戏,并提出了一个不需要调参的框架。我们的方法是冻结LLM,并基于过去的交流和经验进行改进。在一个代表性和广泛研究的通信游戏“狼人”的实验中,我们证明了我们的框架可以无需调参地在狼人游戏中进行效果性的游戏。此外,我们的实验还发现了一些策略性的行为,这表示将LLM参与到通信游戏和相关领域的研究将是一项有前途的冒险。