2023-07-01

cs.CL

cs.CL - 2023-07-01

Improving Text Matching in E-Commerce Search with A Rationalizable, Intervenable and Fast Entity-Based Relevance Model

paper_url: http://arxiv.org/abs/2307.00370
repo_url: None
paper_authors: Jiong Cai, Yong Jiang, Yue Zhang, Chengyue Jiang, Ke Yu, Jianhui Ji, Rong Xiao, Haihong Tang, Tao Wang, Zhongqiang Huang, Pengjun Xie, Fei Huang, Kewei Tu
for: 该论文主要目标是提高电商搜索系统中的搜索效果，尤其是预测用户查询的项目是否相关。
methods: 该论文提出了一种基于实体的相关性模型（EBRM），通过将查询-项目（QI）相关性问题拆分成多个查询-实体（QE）相关性问题，然后使用软逻辑形式进行汇聚，以提高准确率和搜索效率。
results: 该论文在电商网站上的实验结果表明，EBRM可以获得了可观的改善，同时具有计算效率的优势。

Abstract
Discovering the intended items of user queries from a massive repository of items is one of the main goals of an e-commerce search system. Relevance prediction is essential to the search system since it helps improve performance. When online serving a relevance model, the model is required to perform fast and accurate inference. Currently, the widely used models such as Bi-encoder and Cross-encoder have their limitations in accuracy or inference speed respectively. In this work, we propose a novel model called the Entity-Based Relevance Model (EBRM). We identify the entities contained in an item and decompose the QI (query-item) relevance problem into multiple QE (query-entity) relevance problems; we then aggregate their results to form the QI prediction using a soft logic formulation. The decomposition allows us to use a Cross-encoder QE relevance module for high accuracy as well as cache QE predictions for fast online inference. Utilizing soft logic makes the prediction procedure interpretable and intervenable. We also show that pretraining the QE module with auto-generated QE data from user logs can further improve the overall performance. The proposed method is evaluated on labeled data from e-commerce websites. Empirical results show that it achieves promising improvements with computation efficiency.

摘要
发现用户查询的目标项从大量的项目库中是电商搜索系统的主要目标之一。准确预测 relevance 是搜索系统的关键因素，它可以提高系统的性能。现在，广泛使用的模型，如Bi-encoder 和 Cross-encoder，它们在准确性和推理速度之间存在限制。在这种工作中，我们提出了一种新的模型，即基于实体的准确性模型（EBRM）。我们Identify 在项目中含有的实体，并将 QI（查询项）准确性问题分解为多个 QE（查询实体）准确性问题；然后，我们将其结果聚合以形成 QI 预测，使用软逻辑表述。这种分解允许我们使用 Cross-encoder QE 准确性模块以获得高准确性，同时缓存 QE 预测以便在线推理。使用软逻辑使预测过程可见和可操作。此外，我们还表明了在用户日志中自动生成的 QE 数据进行预处理可以进一步提高总性能。我们的方法在电商网站上的标注数据上进行了实验，实验结果表明，它可以获得了可观的改进，同时具有计算效率。

BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer

paper_url: http://arxiv.org/abs/2307.00360
repo_url: None
paper_authors: Zuchao Li, Shitou Zhang, Hai Zhao, Yifei Yang, Dongjie Yang
for: batGPT是一种大规模语言模型，用于生成自然流畅的文本回应不同类型的输入，包括文本提示、图像和音频。
methods: 模型采用双向autoregressive架构，可以高效地捕捉自然语言中的复杂依赖关系，使其在语言生成、对话系统和问答等任务中表现出色。此外，双向autoregressive模型不仅从左到右运行，还从右到左运行，有效地减少固定内存效应和模型幻想。
results: 通过提出novel parameter expansion方法，可以利用 smaller模型的预训练和人工智能和人类反馈的强化学习，提高模型的对齐性性能。总的来说，这些方法有效地提高了batGPT的效果，并可以在各种自然语言应用中使用。

Abstract
BatGPT is a large-scale language model designed and trained jointly by Wuhan University and Shanghai Jiao Tong University. It is capable of generating highly natural and fluent text in response to various types of input, including text prompts, images, and audio. In the modeling level, we employ a bidirectional autoregressive architecture that allows the model to efficiently capture the complex dependencies of natural language, making it highly effective in tasks such as language generation, dialog systems, and question answering. Moreover, the bidirectional autoregressive modeling not only operates from left to right but also from right to left, effectively reducing fixed memory effects and alleviating model hallucinations. In the training aspect, we propose a novel parameter expansion method for leveraging the pre-training of smaller models and employ reinforcement learning from both AI and human feedback, aimed at improving the model's alignment performance. Overall, these approaches significantly improve the effectiveness of BatGPT, and the model can be utilized for a wide range of natural language applications.

摘要
batgpt是一种大规模语言模型，由武汉大学和上海交通大学共同设计和训练。它可以生成高度自然和流畅的文本响应不同类型的输入，包括文本提示、图像和音频。在模型层次上，我们采用了双向 autoregressive 架构，使模型可以有效地捕捉自然语言中的复杂依赖关系，从而在语言生成、对话系统和问答等任务中表现出色。此外，双向 autoregressive 模型不仅从左到右还从右到左运行，有效地减少固定内存效应和解决模型幻想现象。在训练方面，我们提出了一种新的参数扩展方法，利用小型模型的预训练和人工智能和人类反馈的强化学习，以提高模型的对齐性。总的来说，这些方法有效地提高了 batgpt 的效果，该模型可以应用于各种自然语言应用。

Improving Multitask Retrieval by Promoting Task Specialization

paper_url: http://arxiv.org/abs/2307.00342
repo_url: https://github.com/wenzhengzhang/taco
paper_authors: Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk
for: 这个论文目的是提高多任务检索的性能，并且比用单个任务特定的检索方法更高效。
methods: 这个论文使用了一种新的多任务学习方法，该方法可以使 Parameters 在不同的任务上特化。此外，文章还使用了一种适应学习方法，以便每个参数都可以在特定任务上特化。
results: 根据 KILT 测试 benchmark，这个多任务检索方法可以高效地 Retrieval 多个任务上的相关文献。并且，文章的分析表明，这个方法实际上学习了更加任务特化的参数，比单个任务检索方法更高效。

Abstract
In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingredients are: (1) a better choice of pretrained model (one that is explicitly optimized for multitasking) along with compatible prompting, and (2) a novel adaptive learning method that encourages each parameter to specialize in a particular task. The resulting multitask retriever is highly performant on the KILT benchmark. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.

摘要
在多任务检索中，单个检索器被训练来检索多个任务的相关上下文。尽管这有实践的吸引力，但是直接的多任务检索 lag behind 每个任务特定的检索器。我们示示可以通过促进任务专业化来训练一个高性能的多任务检索器。主要的成分是：（1）更好的预训练模型（一个explicitly optimized for multitasking），以及与其兼容的提示，和（2）一种新型的 adaptive learning 方法，该方法使每个参数特化在特定的任务中。结果的多任务检索器在 KILT benchmark 上表现出色。经分析发现，模型实际上学习了更特化于任务的参数，相比无提示或adaptive learning的多任务检索无法达到这种水平。

Single Sequence Prediction over Reasoning Graphs for Multi-hop QA

paper_url: http://arxiv.org/abs/2307.00335
repo_url: None
paper_authors: Gowtham Ramesh, Makesh Sreedhar, Junjie Hu
for: 这个论文的目的是提高多步问答（QA）模型的准确率和可解性。
methods: 这个论文使用了一种基于本地逻辑图的单序预测方法（\model），通过在每个问题上的关键实体之间建立图 estructure来提高模型的准确率和可解性。
results: 实验结果显示，这个方法可以在HotpotQA数据集上提高答案匹配/F1分数和理由路径的准确性，并在Musique数据集上达到了当前最佳数据。

Abstract
Recent generative approaches for multi-hop question answering (QA) utilize the fusion-in-decoder method~\cite{izacard-grave-2021-leveraging} to generate a single sequence output which includes both a final answer and a reasoning path taken to arrive at that answer, such as passage titles and key facts from those passages. While such models can lead to better interpretability and high quantitative scores, they often have difficulty accurately identifying the passages corresponding to key entities in the context, resulting in incorrect passage hops and a lack of faithfulness in the reasoning path. To address this, we propose a single-sequence prediction method over a local reasoning graph (\model)\footnote{Code/Models will be released at \url{https://github.com/gowtham1997/SeqGraph} that integrates a graph structure connecting key entities in each context passage to relevant subsequent passages for each question. We use a graph neural network to encode this graph structure and fuse the resulting representations into the entity representations of the model. Our experiments show significant improvements in answer exact-match/F1 scores and faithfulness of grounding in the reasoning path on the HotpotQA dataset and achieve state-of-the-art numbers on the Musique dataset with only up to a 4\% increase in model parameters.

摘要
最近的生成方法 для多步问答（QA）使用融合在decoder中的方法~\cite{izacard-grave-2021-leveraging}来生成一个单个序列输出，该输出包括最终答案以及用于到达该答案的思维路径，例如段落标题和关键事实。这些模型可能会导致更好的解释性和高量级分数，但它们经常在正确地标识问题中的关键段落上遇到困难，从而导致错误的段落跳跃和思维路径的不准确。为解决这个问题，我们提出了基于本地逻辑图（\model）的单序列预测方法，该方法使用逻辑图结构连接每个问题上的关键实体与其相关的后续段落。我们使用图神经网络来编码这个逻辑图结构，并将其与模型中的实体表示进行融合。我们的实验表明，使用我们的方法可以在HotpotQA数据集上提高答案匹配分/F1分数和思维路径的固有性。此外，我们在Musique数据集上达到了状态之最的数据，只需要增加模型参数4\%。

Let Me Teach You: Pedagogical Foundations of Feedback for Language Models

paper_url: http://arxiv.org/abs/2307.00279
repo_url: None
paper_authors: Beatriz Borges, Niket Tandon, Tanja Käser, Antoine Bosselut
for: 本研究旨在提供一种Feedback Framework（FELT），用于对Large Language Models（LLMs）进行人工定制。
methods: 本研究使用了教学学科中已有的Feedback模型，将其应用于NLF领域，并提出了一个Feedback内容分类法。
results: 研究发现，不同类型的反馈对LLMs的修订生成有不同的影响，并提出了一些新的可能性 дляNLF研究。

Abstract
Natural Language Feedback (NLF) is an increasingly popular avenue to align Large Language Models (LLMs) to human preferences. Despite the richness and diversity of the information it can convey, NLF is often hand-designed and arbitrary. In a different world, research in pedagogy has long established several effective feedback models. In this opinion piece, we compile ideas from pedagogy to introduce FELT, a feedback framework for LLMs that outlines the various characteristics of the feedback space, and a feedback content taxonomy based on these variables. Our taxonomy offers both a general mapping of the feedback space, as well as pedagogy-established discrete categories, allowing us to empirically demonstrate the impact of different feedback types on revised generations. In addition to streamlining existing NLF designs, FELT also brings out new, unexplored directions for research in NLF. We make our taxonomy available to the community, providing guides and examples for mapping our categorizations to future resources.

摘要
自然语言反馈（NLF）是现在吸引着越来越多的研究者的一个热门领域，用于将大语言模型（LLM）与人类的偏好相对应。尽管NLF可以传递各种多样化的信息，但是它们frequently hand-designed和arbitrary。在另一个世界，教学研究已经长期确立了多种有效的反馈模型。在这篇观点文章中，我们从教学研究中综合提出了FELT，一个反馈框架 для LLM，其中包括反馈空间的多种特征和基于这些变量的反馈内容分类法。我们的分类法不仅提供了反馈空间的总体地图，还提供了教学研究确立的明确分类，使我们能够实证不同类型的反馈对修改后的生成的影响。除了使 existed NLF设计更加流畅，FELT还探索了未曾被研究的NLF方向。我们将我们的分类法公布给社区，并提供了将我们的分类映射到未来资源的指南和示例。

Discovering Patterns of Definitions and Methods from Scientific Documents

paper_url: http://arxiv.org/abs/2307.01216
repo_url: None
paper_authors: Yutian Sun, Hai Zhuge
for: 这个论文主要是为了提出一种方法来自动抽取科学文献中的定义和方法。
methods: 该方法基于自然语言处理技术，包括定义和方法的分别抽取、完整性验证等步骤。
results: 实验表明，该方法可以准确地抽取科学文献中的定义和方法，并且可以在不同的应用场景中进行修改或扩展。

Abstract
The difficulties of automatic extraction of definitions and methods from scientific documents lie in two aspects: (1) the complexity and diversity of natural language texts, which requests an analysis method to support the discovery of pattern; and, (2) a complete definition or method represented by a scientific paper is usually distributed within text, therefore an effective approach should not only extract single sentence definitions and methods but also integrate the sentences to obtain a complete definition or method. This paper proposes an analysis method for discovering patterns of definition and method and uses the method to discover patterns of definition and method. Completeness of the patterns at the semantic level is guaranteed by a complete set of semantic relations that identify definitions and methods respectively. The completeness of the patterns at the syntactic and lexical levels is guaranteed by syntactic and lexical constraints. Experiments on the self-built dataset and two public definition datasets show that the discovered patterns are effective. The patterns can be used to extract definitions and methods from scientific documents and can be tailored or extended to suit other applications.

摘要
科学文献中定义和方法的自动提取具有两个方面的挑战：一是自然语言文本的复杂性和多样性，需要一种分析方法来支持发现模式；二是科学论文中的定义和方法通常分散在文本中，因此一种有效的方法不仅需要提取单句定义和方法，还需要将它们集成起来，以获得完整的定义和方法。本文提出了一种分析方法，用于发现定义和方法的模式，并使用该方法在自己建立的数据集和两个公共定义数据集上进行实验，实验结果表明，发现的模式具有完整性。这些模式可以用来从科学文献中提取定义和方法，并可以根据需要进行修改或扩展。

How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain

paper_url: http://arxiv.org/abs/2307.00186
repo_url: https://github.com/toneli/rt-retrieving-and-thinking
paper_authors: Mingchen Li, Rui Zhang
for: 这paper的目的是对医疗领域中LMs的性能进行全面的研究，以及探讨如何使用LMs来提高NER表现。
methods: 这paper使用了16种NER模型，从2018年到2023年进行了广泛的实验，并结合了一种简单有效的方法 called \textsc{RT} (Retrieving and Thinking)，以提高NER表现。
results: 实验结果表明，LMs在医疗领域中的少数例NER任务中表现更好，但仍然存在一些挑战，如误认、模板预测等。\textsc{RT}方法在两个开源医疗benchmark数据集上显著超过了强开放基线。

Abstract
Recent advancements in language models (LMs) have led to the emergence of powerful models such as Small LMs (e.g., T5) and Large LMs (e.g., GPT-4). These models have demonstrated exceptional capabilities across a wide range of tasks, such as name entity recognition (NER) in the general domain. (We define SLMs as pre-trained models with fewer parameters compared to models like GPT-3/3.5/4, such as T5, BERT, and others.) Nevertheless, their efficacy in the medical section remains uncertain and the performance of medical NER always needs high accuracy because of the particularity of the field. This paper aims to provide a thorough investigation to compare the performance of LMs in medical few-shot NER and answer How far is LMs from 100\% Few-shot NER in Medical Domain, and moreover to explore an effective entity recognizer to help improve the NER performance. Based on our extensive experiments conducted on 16 NER models spanning from 2018 to 2023, our findings clearly indicate that LLMs outperform SLMs in few-shot medical NER tasks, given the presence of suitable examples and appropriate logical frameworks. Despite the overall superiority of LLMs in few-shot medical NER tasks, it is important to note that they still encounter some challenges, such as misidentification, wrong template prediction, etc. Building on previous findings, we introduce a simple and effective method called \textsc{RT} (Retrieving and Thinking), which serves as retrievers, finding relevant examples, and as thinkers, employing a step-by-step reasoning process. Experimental results show that our proposed \textsc{RT} framework significantly outperforms the strong open baselines on the two open medical benchmark datasets

摘要
最近的语言模型（LM）的进步已导致小型LM（例如T5）和大型LM（例如GPT-4）的出现。这些模型在各种任务上表现出色，如通用领域中的名实体识别（NER）。（我们定义SLMs为预训练模型，比如GPT-3/3.5/4，T5、BERT等。）然而，它们在医疗领域的表现仍然存在uncertainty，因为医疗领域的特殊性。本文旨在对LMs在医疗领域的少量NER任务进行全面的调查，以确定LMs在这些任务中的表现有多好，并且探讨一种有效的实体识别器，以提高NER表现。根据我们在2018年至2023年间进行的广泛实验，我们的发现显示，LMs在医疗领域的少量NER任务中表现出色，尤其当给出合适的示例和适当的逻辑框架时。尽管LMs在这些任务中的总表现优于SLMs，但它们仍然面临一些挑战，如误认、错误的模板预测等。基于之前的发现，我们提出了一种简单有效的方法，称为\textsc{RT}（检索和思考），它可以作为检索器，找到相关的示例，并作为思考者，采用步骤式的思考过程。实验结果显示，我们的提议的\textsc{RT}框架在两个开源医疗benchmark数据集上得到了显著的改进。

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks

paper_url: http://arxiv.org/abs/2307.00175
repo_url: https://github.com/balevinstein/probes
paper_authors: B. A. Levinstein, Daniel A. Herrmann
for: 这篇论文探讨了大语言模型（LLMs）是否具有信念，以及如果它们具有信念，如何测量它们。
methods: 论文评估了两种现有的方法，一种是由Azaria和Mitchell（2023）提出的，另一种是由Burns等人（2022）提出的。
results: 论文提供了实验结果，表明这两种方法在基本上无法泛化。然后，论文 argue了这些方法不太可能成功，因为LLMs具有信念是一个概念上的问题。因此，目前还没有一种可靠的侦测LLMs的信念的方法。

Abstract
We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. We conclude by suggesting some concrete paths for future work.

摘要
我们考虑了大语言模型（LLM）是否有信仰的问题，以及如果它们有的话，我们如何测量它们。首先，我们评估了两种现有的方法，一种由Azaria和Mitchell（2023）提出，另一种由Burns等人（2022）提出。我们提供了实验结果，表明这些方法在非常基础的方面无法泛化。我们然后 argue that，即使 LLMs 有信仰，这些方法是不可能成功的，这些方法因概念上的原因。因此，目前还没有一种lie-detector для LLMs。我们然后承认我们的实验结果，然后考虑 LLMs 是否应该有类似于信仰的问题。我们考虑了一些最近的Arguments，它们声称 LLMs 不能有信仰。我们表示这些Arguments是误导的。我们提供了一种更产生的框架，用于问题周围的信仰的状态。我们结束时，建议了一些未来工作的具体路径。

What do self-supervised speech models know about words?

paper_url: http://arxiv.org/abs/2307.00162
repo_url: None
paper_authors: Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu
for: 这个研究是为了探究不同的自助学习speech模型（S3M）是如何编码语言信息的，以及这些模型是否可以学习单词单元。
methods: 研究使用了三种S3M模型：wav2vec2、HuBERT和WavLM，并使用了可读性 corr 分析（CCA）来测试这些模型的层次结构是否具有语言特征。
results: 研究发现，最佳的单词语言特征通常位于模型中间层次结构中，而一些更低级的信息，如发音，也保留在HuBERT和WavLM的高层次结构中。同时，研究发现在不同层次结构中，模型的性能也有明显的层次特征。

Abstract
Many self-supervised speech models (S3Ms) have been introduced over the last few years, producing performance and data efficiency improvements for a variety of speech tasks. Evidence is emerging that different S3Ms encode linguistic information in different layers, and also that some S3Ms appear to learn phone-like sub-word units. However, the extent to which these models capture larger linguistic units, such as words, and where word-related information is encoded, remains unclear. In this study, we conduct several analyses of word segment representations extracted from different layers of three S3Ms: wav2vec2, HuBERT, and WavLM. We employ canonical correlation analysis (CCA), a lightweight analysis tool, to measure the similarity between these representations and word-level linguistic properties. We find that the maximal word-level linguistic content tends to be found in intermediate model layers, while some lower-level information like pronunciation is also retained in higher layers of HuBERT and WavLM. Syntactic and semantic word attributes have similar layer-wise behavior. We also find that, for all of the models tested, word identity information is concentrated near the center of each word segment. We then test the layer-wise performance of the same models, when used directly with no additional learned parameters, on several tasks: acoustic word discrimination, word segmentation, and semantic sentence similarity. We find similar layer-wise trends in performance, and furthermore, find that when using the best-performing layer of HuBERT or WavLM, it is possible to achieve performance on word segmentation and sentence similarity that rivals more complex existing approaches.

摘要

paper_url: http://arxiv.org/abs/2307.00135
repo_url: None
paper_authors: Vasilisa Bashlovkina, Riley Matthews, Zhaobin Kuang, Simon Baumgartner, Michael Bendersky
for: 这 paper 的目的是研究基于 transformer 语言模型 (LMs) 理解社交媒体语言。
methods: 这 paper 使用了一种新的 Social MedIa Language Evaluation (SMILE) 标准，以评估 LM 在社交媒体上的表现。
results: 研究发现，基于 social media 和标准语言的混合预训练可以提高 LM 的表现，比最佳相似的替代方案提高了4.2个 SMILE 分数。

Abstract
We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.

摘要
我们研究基于转换器的语言模型（LM）在社交媒体语言理解方面的能力。社交媒体语言与普通的书面语言有所不同，但现有的标准 benchmark 无法准确地测试LM在这一重要领域的性能。我们评估社交媒体语言与普通语言之间的差异，并发现这些差异在字符分布和语言变革速度方面都是显著的。然后，我们介绍了一个新的社交媒体语言评估标准（SMILE），该标准覆盖了四个社交媒体平台和十一个任务。最后，我们展示了一种使用社交媒体和普通语言的tokenizer和预训练的LM，其在总体SMILE分数上超过了相同大小的相似性LM的最佳选择 by 4.2分。

iMETRE: Incorporating Markers of Entity Types for Relation Extraction

paper_url: http://arxiv.org/abs/2307.00132
repo_url: None
paper_authors: N Harsha Vardhan, Manav Chaudhary
for: 本文是关于 sentence-level 关系抽象（RE）在金融数据集 REFinD 中进行研究的论文。
methods: 本文使用了类型 entity marker 表示法和特制化的模型，在验证集上达到了69.65%的 F1 分数。
results: 本文在验证集上实现了69.65%的 F1 分数，并讨论了多种方法和可能的限制。

Abstract
Sentence-level relation extraction (RE) aims to identify the relationship between 2 entities given a contextual sentence. While there have been many attempts to solve this problem, the current solutions have a lot of room to improve. In this paper, we approach the task of relationship extraction in the financial dataset REFinD. Our approach incorporates typed entity markers representations and various models finetuned on the dataset, which has allowed us to achieve an F1 score of 69.65% on the validation set. Through this paper, we discuss various approaches and possible limitations.

摘要
句子关系EXTRACTION (RE) 目标是在给定一个上下文句子中identify两个实体之间的关系。虽然有很多人尝试解决这个问题，但目前的解决方案还有很大的改进空间。在这篇论文中，我们对金融数据集ReFiND的关系EXTRACTION问题进行了approach。我们的approach使用了类型标记表示和适应于数据集的多种模型，这使得我们在验证集上达到了69.65%的F1分数。在这篇论文中，我们讨论了多种方法和可能的限制。

Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches

paper_url: http://arxiv.org/abs/2307.00130
repo_url: None
paper_authors: Shiyu Yuan, Carlo Lipizzi
for: 本研究旨在investigate文本处理领域中information extraction（IE）任务中document genre和length的影响。
methods: 本研究采用了两种主流实现方法：heuristic-based searching和data-driven learning。
results: 研究发现，不同的文档特点和 genre具有不同的extraction outcome。 Specifically, short documents may yield better accuracy results, while generic documents may exhibit superior extraction outcomes due to training document genre limitations. Additionally, different semantic roles exhibited varying accuracy levels with the same method.

Abstract
Information extraction (IE) plays very important role in natural language processing (NLP) and is fundamental to many NLP applications that used to extract structured information from unstructured text data. Heuristic-based searching and data-driven learning are two main stream implementation approaches. However, no much attention has been paid to document genre and length influence on IE tasks. To fill the gap, in this study, we investigated the accuracy and generalization abilities of heuristic-based searching and data-driven to perform two IE tasks: named entity recognition (NER) and semantic role labeling (SRL) on domain-specific and generic documents with different length. We posited two hypotheses: first, short documents may yield better accuracy results compared to long documents; second, generic documents may exhibit superior extraction outcomes relative to domain-dependent documents due to training document genre limitations. Our findings reveals that no single method demonstrated overwhelming performance in both tasks. For named entity extraction, data-driven approaches outperformed symbolic methods in terms of accuracy, particularly in short texts. In the case of semantic roles extraction, we observed that heuristic-based searching method and data-driven based model with syntax representation surpassed the performance of pure data-driven approach which only consider semantic information. Additionally, we discovered that different semantic roles exhibited varying accuracy levels with the same method. This study offers valuable insights for downstream text mining tasks, such as NER and SRL, when addressing various document features and genres.

摘要
信息抽取（IE）在自然语言处理（NLP）中扮演着非常重要的角色，对许多NLP应用程序的结构化信息抽取具有基本性。使用规则基本搜索和数据驱动学习是两种主流实现方法。然而，对文档类型和长度的影响很少得到了关注。为了填补这一空白，在本研究中，我们investigated the accuracy和generalization能力 of heuristic-based searching和数据驱动学习在两个IE任务中：名称实体识别（NER）和Semantic Role Labeling（SRL）中进行了两个IE任务。我们提出了两个假设：一、短文本可能比长文本更高的准确率；二、通用文档可能相对于领域特定文档更好地进行抽取，因为训练文档类型的限制。我们的发现表明，没有任何方法在两个任务中表现出优异的表现。对名称实体抽取而言，数据驱动方法在短文本中的准确率较高，特别是使用语法表示。在Semantic Role Labeling任务中，我们发现，使用符号基本搜索方法和数据驱动基于语法表示的模型可以超越数据驱动方法的性能，特别是在短文本中。此外，我们发现不同的semantic role具有不同的准确率水平。本研究对下游文本挖掘任务，如NER和SRL，提供了有价值的发现，当Addressing various document features and genres时。

Meta-training with Demonstration Retrieval for Efficient Few-shot Learning

paper_url: http://arxiv.org/abs/2307.00119
repo_url: None
paper_authors: Aaron Mueller, Kanika Narang, Lambert Mathias, Qifan Wang, Hamed Firooz
for: 用于增强几何学模型在几何学任务上的泛化能力和效率。
methods: 使用示例检索来提供更多的相似示例，以增强模型的学习和泛化能力。
results: 比较多种目标任务上的表现，包括SQuAD、QNLI和TREC等。

Abstract
Large language models show impressive results on few-shot NLP tasks. However, these models are memory and computation-intensive. Meta-training allows one to leverage smaller models for few-shot generalization in a domain-general and task-agnostic manner; however, these methods alone results in models that may not have sufficient parameterization or knowledge to adapt quickly to a large variety of tasks. To overcome this issue, we propose meta-training with demonstration retrieval, where we use a dense passage retriever to retrieve semantically similar labeled demonstrations to each example for more varied supervision. By separating external knowledge from model parameters, we can use meta-training to train parameter-efficient models that generalize well on a larger variety of tasks. We construct a meta-training set from UnifiedQA and CrossFit, and propose a demonstration bank based on UnifiedQA tasks. To our knowledge, our work is the first to combine retrieval with meta-training, to use DPR models to retrieve demonstrations, and to leverage demonstrations from many tasks simultaneously, rather than randomly sampling demonstrations from the training set of the target task. Our approach outperforms a variety of targeted parameter-efficient and retrieval-augmented few-shot methods on QA, NLI, and text classification tasks (including SQuAD, QNLI, and TREC). Our approach can be meta-trained and fine-tuned quickly on a single GPU.

摘要
大型语言模型在少量示例NLPT任务中显示出很好的结果，但这些模型占用内存和计算资源很多。元训练可以使用较小的模型进行少量示例总结，但这些方法独立使用不具备足够的参数或知识来快速适应多种任务。为解决这个问题，我们提议使用示例检索和元训练，使用笛卡尔 passage retriever来检索每个示例的相似标注示例，以提供更多的多样化监督。通过将外部知识与模型参数分离，我们可以使用元训练来训练效率高的模型，并且这些模型可以在多种任务上总结良好。我们使用UnifiedQA和CrossFit构建元训练集，并提出一个基于UnifiedQA任务的示例银行。根据我们所知，我们的方法是首次将检索与元训练结合使用，使用DPR模型来检索示例，并同时使用多个任务的示例来帮助模型快速适应。我们的方法在SQuAD、QNLI和文本分类任务上（包括SQuAD、QNLI和TREC）表现出色，并且可以快速在单个GPU上meta-训练和精度调整。

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

paper_url: http://arxiv.org/abs/2306.17842
repo_url: https://github.com/google-research/magvit
paper_authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
for: 该论文旨在启用冻结的自然语言处理模型（LLM）进行多modal任务，包括图像理解和生成任务。
methods: 该论文提出了Semantic Pyramid AutoEncoder（SPAE），用于将图像转化为可读性强的lexical token（或词），这些token捕捉了图像的semantic meaning和细节信息，可以让LLM进行多modal任务。
results: 该论文通过在冻结PaLM 2和GPT 3.5上进行Context Learning实验， Validated SPAE的方法可以使LLM在图像理解和生成任务中表现出state-of-the-art的水平，在同一个设定下，比前一个方法提高了25%以上。

Abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

摘要
在这个工作中，我们介绍了含义峰自动编码器（SPAE），用于让冻结的 LLM 能够执行包括非语言类modalities（如图像或视频）的理解和生成任务。 SPAE 将raw像素转化为 LLM 词汇中提取的可读取的lexical token（或词），这些token capture了semantic meaning以及需要于视觉重建的细节信息，从而将视觉内容翻译成可以被 LLM 理解的语言，并赋予它执行多modal任务的能力。我们的方法通过封闭PaLM 2和 GPT 3.5在多种图像理解和生成任务上进行 Context Learning 实验验证。我们的方法成功地使得冻结的 LLM 能够生成图像内容，同时在同一个设定下，胜过state-of-the-art的图像理解任务性能，提高了25%以上。

Statler: State-Maintaining Language Models for Embodied Reasoning

paper_url: http://arxiv.org/abs/2306.17840
repo_url: https://github.com/statler-lm/statler-lm.github.io
paper_authors: Takuma Yoneda, Jiading Fang, Peng Li, Huanyu Zhang, Tianchong Jiang, Shengjie Lin, Ben Picker, David Yunis, Hongyuan Mei, Matthew R. Walter
for: 提高 LLM 在长时间规划中的能力，使 robot 能够更好地完成复杂的任务。
methods: 使用两个特定的 LLM 实例，一个读取世界模型，一个写入世界模型，以维护世界状态的表示。
results: 在三个虚拟表格 manipulation 领域和一个真实 robot 领域中，实现了比现有 LLM 更高的表示能力。

Abstract
Large language models (LLMs) provide a promising tool that enable robots to perform complex robot reasoning tasks. However, the limited context window of contemporary LLMs makes reasoning over long time horizons difficult. Embodied tasks such as those that one might expect a household robot to perform typically require that the planner consider information acquired a long time ago (e.g., properties of the many objects that the robot previously encountered in the environment). Attempts to capture the world state using an LLM's implicit internal representation is complicated by the paucity of task- and environment-relevant information available in a robot's action history, while methods that rely on the ability to convey information via the prompt to the LLM are subject to its limited context window. In this paper, we propose Statler, a framework that endows LLMs with an explicit representation of the world state as a form of ``memory'' that is maintained over time. Integral to Statler is its use of two instances of general LLMs -- a world-model reader and a world-model writer -- that interface with and maintain the world state. By providing access to this world state ``memory'', Statler improves the ability of existing LLMs to reason over longer time horizons without the constraint of context length. We evaluate the effectiveness of our approach on three simulated table-top manipulation domains and a real robot domain, and show that it improves the state-of-the-art in LLM-based robot reasoning. Project website: https://statler-lm.github.io/

摘要
大型语言模型（LLM）提供了一种承诺的工具，帮助机器人执行复杂的机器人逻辑任务。然而，当前LLM的限定上下文窗口使得在长时间规划中难以进行逻辑。附属任务，如家用机器人所需的任务，通常需要计划器考虑在环境中遇到的多种物体的性质。使用LLM的隐式内部表示 capture 世界状态困难，因为机器人的行动历史中缺乏任务和环境相关的信息。此外，通过提示来传递信息给LLM的方法受到了上下文窗口的限制。在本文中，我们提出了Statler框架，它使得LLM具有一种“记忆”的显式表示，并且在时间上保持这个表示。Statler使用两个通用LLM实例——世界模型读取器和世界模型写入器——与世界状态进行交互和维护。通过提供这个世界状态“记忆”，Statler使得现有的LLM可以在不受上下文长度限制的情况下进行较长时间规划。我们在三个模拟的桌面拼接任务和一个真实机器人任务上进行了评估，并证明了我们的方法可以超越现有的LLM-基于机器人逻辑的状态。项目网站：https://statler-lm.github.io/

Meta-Reasoning: Semantics-Symbol Deconstruction For Large Language Models

paper_url: http://arxiv.org/abs/2306.17820
repo_url: None
paper_authors: Yiming Wang, Zhuosheng Zhang, Rui Wang
for: 提高大语言模型（LLM）的理解能力
methods: 使用自然语言中的符号来简化LLM的学习过程
results: 在符号理解任务中，GPT-3（文本-达文西-002）可以达到99%的准确率，比现有的LLM更高，只需要一次Meta-Reasoning示例。

Abstract
Symbolization methods in large language models (LLMs) have been shown effective to improve LLMs' reasoning ability. However, most of these approaches hinge on mapping natural languages to formal languages (e.g., Python, SQL) that are more syntactically complete and free of ambiguity. Although effective, they depart from the natural language itself and deviate from the habits of human thinking, and instead cater more to the execution mindset of computers. In contrast, we hope to simplify natural language by starting from the concept of symbols in linguistics itself, so that LLMs can learn the common formulation and general solution of reasoning problems wrapped in different natural semantics. From this consideration, we propose \textbf{Meta-Reasoning}, which allows LLMs to automatically accomplish semantic-symbol deconstruction, i.e., semantic resolution, to maximally reduce different questions of certain reasoning tasks to similar natural language representation, thus gaining the ability to learn by analogy and facilitating data-efficient in-context learning. Our experiments show that the Meta-Reasoning paradigm saliently enhances LLMs' reasoning performance with fewer demonstrations. They can learn not only reasoning chains but also general solutions to certain types of tasks. In particular, for symbolic reasoning tasks, such as 7-step Tracking Shuffled Objects, GPT-3 (text-davinci-002) achieves over 99% accuracy with only one Meta-Reasoning demonstration, outperforming all current LLMs with the standard chain-of-thought prompting.

摘要
大型自然语言模型（LLM）中的符号化方法已经被证明可以提高LLM的理解能力。然而，大多数这些方法都基于将自然语言映射到更加完整和不含歧义的 formaL语言（如Python、SQL）中。虽然有效，但这些方法偏离自然语言本身，而且更倾向于计算机的执行意识，而不是人类的思维习惯。相反，我们希望通过从语言学中的符号概念开始，使LLM可以学习自然语言中的通用形式和通用解决方案，从而在不同的自然语言表示中解决理解问题。基于这种考虑，我们提出了“Meta-Reasoning”，允许LLM通过自动完成语义符号解构，即语义解析，将不同的理解任务问题最大化地减少到相似的自然语言表示，从而获得学习analogy的能力和数据效率的在 Context 中学习。我们的实验显示，Meta-Reasoning方法能够明显提高LLM的理解性能，只需要 fewer demonstrations。它不仅可以学习理解链，还可以学习ertain types of tasks的通用解决方案。例如，对于符号逻辑任务，如7步追踪排序物品，GPT-3（text-davinci-002）可以在一个Meta-Reasoning示例下达到99%的准确率，超过当前所有LLMs的标准链条提问。

A Massive Scale Semantic Similarity Dataset of Historical English

paper_url: http://arxiv.org/abs/2306.17810
repo_url: None
paper_authors: Emily Silcock, Melissa Dell
for: 本研究使用新闻稿来构建一个大规模的 semantic similarity 数据集，覆盖70年时间段从1920年到1989年，包含约400万个正面 semantic similarity 对。
methods: 本研究使用了深度神经网络方法来检测articles的来源，并利用文档格式和语言理解来关联articles和其关联的headlines。
results: 研究得到了一个公共可用的 HEADLINES 数据集，覆盖了一段很长的时间段和大量的正面 semantic similarity 对，可以用于许多任务，如研究 semantic change 的发展和变化。

Abstract
A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time.

摘要
各种任务都利用基于Semantic similarity的语言模型训练。尽管有很多 datasets 捕捉 Semantic similarity，但它们是从现代网络数据构建或者是过去十年人工标注者创建的较小的 datasets。这个研究使用一种新的来源， newly digitized 的Off-copyright 的美国地方报纸文章，拼接了70年从1920年到1989年的大规模Semantic similarity dataset，包含约400万个正面Semantic similarity pair。在过去，约半个美国地方报纸文章来自新闻 wire like the Associated Press。而地方报纸 reproduce 文章，但是它们写了自己的标题，这些标题形成了报纸文章的抽象摘要。我们利用文档布局和语言理解来相关articles和其标题，然后使用深度神经网络来检测articles 是否来自同一个源，在存在较大的噪音和缩短后。 reproduce 文章的标题形成了正面Semantic similarity pair。 resulting publicly available HEADLINES dataset 比大多数现有的Semantic similarity datasets更大，覆盖了一个更长的时间间隔。它将为各种任务，包括时间和空间上的Semantic change 研究提供一个更大的 dataset。

Stay on topic with Classifier-Free Guidance

paper_url: http://arxiv.org/abs/2306.17806
repo_url: https://github.com/Vermeille/cfg-llm
paper_authors: Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, Stella Biderman
for: 这个论文的目的是探讨Classifier-Free Guidance（CFG）在文本到图像生成中作为轻量级技术，以提高提示遵循性。
methods: 这篇论文使用CFG作为推理时间技术，并在不同任务中（如问答、逻辑、代码生成和machine translation）进行了广泛的应用，并取得了LAMBADA中LLaMA-7B模型的SOTA成绩，超过PaLM-540B。
results: 这篇论文的结果表明，CFG可以提高Pythia、GPT-2和LLaMA-family模型的性能，相当于增加了模型参数的两倍，并可以与其他推理时间方法（如链条思维和自我一致）相结合，以提高在困难任务中的性能。此外，CFG还可以增加助手的准确性和 coherence，在 complex 提示中表现出优异。

Abstract
Classifier-Free Guidance (CFG) has recently emerged in text-to-image generation as a lightweight technique to encourage prompt-adherence in generations. In this work, we demonstrate that CFG can be used broadly as an inference-time technique in pure language modeling. We show that CFG (1) improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks: Q\&A, reasoning, code generation, and machine translation, achieving SOTA on LAMBADA with LLaMA-7B over PaLM-540B; (2) brings improvements equivalent to a model with twice the parameter-count; (3) can stack alongside other inference-time methods like Chain-of-Thought and Self-Consistency, yielding further improvements in difficult tasks; (4) can be used to increase the faithfulness and coherence of assistants in challenging form-driven and content-driven prompts: in a human evaluation we show a 75\% preference for GPT4All using CFG over baseline.

摘要
classifier-free guidance (CFG) 是一种近期在文本到图像生成中出现的轻量级技术，用于鼓励描述遵循。在这项工作中，我们示示了 CFG 可以广泛应用于纯语言模型的推理时间阶段。我们发现 CFG 可以：（1）提高 Pyithia、GPT-2 和 LLaMA 家族模型在各种任务上的表现：问答、理解、代码生成和机器翻译，并在 LAMBADA 上与 LLaMA 7B 超过 PaLM 540B 的 SOTA;（2）提供一个比 Paramater-count 为 twice 的模型的改进;（3）可以与其他推理时间方法 like Chain-of-Thought 和 Self-Consistency 混合使用，在困难任务中提供进一步的改进;（4）可以用来增加助手的准确性和 coherence，在复杂的形式驱动和内容驱动的提问中表现出人类评价中的75% 的偏好。

Voting-based Multimodal Automatic Deception Detection

paper_url: http://arxiv.org/abs/2307.07516
repo_url: None
paper_authors: Lana Touma, Mohammad Al Horani, Manar Tailouni, Anas Dahabiah, Khloud Al Jallad
for: 本研究旨在提出一种投票方法来自动检测谎言，并使用机器学习和深度学习技术对视频、音频和文本特征进行检测。
methods: 本研究使用了三种模型来实现投票方法，包括一个用于从图像中检测谎言的CNN模型，一个用于从音频中检测谎言的SVM模型，以及一个用于从文本中检测谎言的Word2Vec-SVM模型。
results: 本研究的实验结果显示，提出的投票方法在两个 datasets 上均达到了州际先进水平，图像、音频和文本特征的检测精度分别为97%、96%和92%。

Abstract
Automatic Deception Detection has been a hot research topic for a long time, using machine learning and deep learning to automatically detect deception, brings new light to this old field. In this paper, we proposed a voting-based method for automatic deception detection from videos using audio, visual and lexical features. Experiments were done on two datasets, the Real-life trial dataset by Michigan University and the Miami University deception detection dataset. Video samples were split into frames of images, audio, and manuscripts. Our Voting-based Multimodal proposed solution consists of three models. The first model is CNN for detecting deception from images, the second model is Support Vector Machine (SVM) on Mel spectrograms for detecting deception from audio and the third model is Word2Vec on Support Vector Machine (SVM) for detecting deception from manuscripts. Our proposed solution outperforms state of the art. Best results achieved on images, audio and text were 97%, 96%, 92% respectively on Real-Life Trial Dataset, and 97%, 82%, 73% on video, audio and text respectively on Miami University Deception Detection.

摘要
自动欺骗检测已经是长期的研究热点，使用机器学习和深度学习自动检测欺骗，带来新的灯光。在这篇论文中，我们提出了基于投票的多Modal自动欺骗检测方法，使用图像、音频和文本特征。我们的提议方案包括三个模型：首先是用Convolutional Neural Network (CNN)检测图像中的欺骗，第二个是使用Support Vector Machine (SVM)对音频的Mel spectrogram进行检测，第三个是使用Word2Vec和SVM对文本进行检测。我们的提议方案在比较的状况下表现出色，在Real-Life Trial Dataset上得到了97%、96%和92%的最佳结果，在Miami University Deception Detection Dataset上得到了97%、82%和73%的最佳结果。

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

paper_url: http://arxiv.org/abs/2306.17792
repo_url: None
paper_authors: Andrei-Marius Avram, Răzvan-Alexandru Smădu, Vasile Păiş, Dumitru-Clementin Cercel, Radu Ion, Dan Tufiş
for: 提高 speech 模型的表现，特别是在 low-resource 语言上。
methods: 取代 fine-tuning 稠密层，使用 lateral inhibition 层，这种层 Draw inspiration from biological process。
results: 在 Romanian 语言上测试， average improvement 12.5% word error rate (WER)。此外，在 Romanian Speech Corpus 和 Robin Technical Acquisition Corpus 上都达到了 state-of-the-art result，WER 分别为 1.78% 和 29.64%。

Abstract
With the rise of bidirectional encoder representations from Transformer models in natural language processing, the speech community has adopted some of their development methodologies. Therefore, the Wav2Vec models were introduced to reduce the data required to obtain state-of-the-art results. This work leverages this knowledge and improves the performance of the pre-trained speech models by simply replacing the fine-tuning dense layer with a lateral inhibition layer inspired by the biological process. Our experiments on Romanian, a low-resource language, show an average improvement of 12.5% word error rate (WER) using the lateral inhibition layer. In addition, we obtain state-of-the-art results on both the Romanian Speech Corpus and the Robin Technical Acquisition Corpus with 1.78% WER and 29.64% WER, respectively.

摘要
随着Transformer模型的bidirectionalEncoder Representation在自然语言处理领域的普及，speech社区开始采用其开发方法。因此，Wav2Vec模型被引入，以降低获得state-of-the-art结果所需的数据量。本工作利用这些知识，改进了预训练的语音模型性能，通过将细致调教层换为后alerter Inhibition层，这种层取自生物过程。我们的实验表明，在罗马尼亚语言中，使用后alerter Inhibition层可以提高语音识别率（WER）的平均值12.5%。此外，我们在罗马尼亚语言 corpus和Robin Technical Acquisition corpus上都获得了state-of-the-art结果，具体的WER分别为1.78%和29.64%。

Should you marginalize over possible tokenizations?

paper_url: http://arxiv.org/abs/2306.17757
repo_url: https://github.com/naver/marginalization
paper_authors: Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman
for: 本研究探讨了autoregressive语言模型（LMs）是否正确地计算字符串的概率。
methods: 作者提出了一种基于重要性采样的算法，可以计算概率的 marginalization，并与常见的方法进行比较。
results: 研究发现，大多数情况下，忽略 marginalization 的做法并不会导致显著的损失（log-likelihood gap 在0.5%之内），但在具有长复杂单词的数据上，差异变得更明显。

Abstract
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

摘要
自然语言模型（LM）可以将字符序列映射到概率上。通常情况下，计算英文句子等字符串的概率时，首先将其转换为模型可以评估的Token序列，然后计算模型的概率。但是，任何给定的字符串都有无数个Token序列来表示它，因此，计算字符串的真实概率实际上是通过所有Token序列的重要性抽样来实现的。在这里，我们分析了是否可以忽略这种重要性抽样。为此，我们提出了一种基于重要性抽样的算法，允许我们计算重要性抽样后的概率并与标准过程进行比较。我们的结果表明，在大多数情况下，忽略重要性抽样后的概率差异不大于0.5%，但是在具有长复杂词的数据时，差异变得更加明显。

2023-07-01

Improving Text Matching in E-Commerce Search with A Rationalizable, Intervenable and Fast Entity-Based Relevance Model

BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer

Improving Multitask Retrieval by Promoting Task Specialization

Single Sequence Prediction over Reasoning Graphs for Multi-hop QA

Let Me Teach You: Pedagogical Foundations of Feedback for Language Models

Discovering Patterns of Definitions and Methods from Scientific Documents

How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks

What do self-supervised speech models know about words?

SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding

iMETRE: Incorporating Markers of Entity Types for Relation Extraction

Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches

Meta-training with Demonstration Retrieval for Efficient Few-shot Learning

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Statler: State-Maintaining Language Models for Embodied Reasoning

Meta-Reasoning: Semantics-Symbol Deconstruction For Large Language Models

A Massive Scale Semantic Similarity Dataset of Historical English

Stay on topic with Classifier-Free Guidance

Voting-based Multimodal Automatic Deception Detection

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

Should you marginalize over possible tokenizations?