results: 比基eline方法高的0.74的和� proprio的主题纯度,以及更加可读性的主题描述和自然语言标签。Abstract
Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require "reading the tea leaves" to interpret; additionally, they offer users minimal semantic control over topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics within a provided text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: for example, it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also more interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. TopicGPT can be further extended to hierarchical topical modeling, enabling users to explore topics at various levels of granularity. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling.
摘要
Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants
results: 本文的实验结果显示,通过在服务器端使用不同类型的LM,可以实现23%-35%的话语识别误差提升,并且模型融合多个服务器端的LM可以最有效地结合各模型的优点和对域特定数据的学习知识。Abstract
On-device Virtual Assistants (VAs) powered by Automatic Speech Recognition (ASR) require effective knowledge integration for the challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries using various categories of Language Models (LMs) (N-gram word LMs, sub-word neural LMs). We investigate the combination of on-device and server-side signals, and demonstrate significant WER improvements of 23%-35% on various entity-centric query subpopulations by integrating various server-side LMs compared to performing ASR on-device only. We also perform a comparison between LMs trained on domain data and a GPT-3 variant offered by OpenAI as a baseline. Furthermore, we also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model and integrates knowledge learned from domain-specific data to a VA ASR system.
摘要
在设备上的虚拟助手(VAs)通过自动语音识别(ASR)需要有效的知识集成以处理复杂的实体rich查询。在这篇论文中,我们进行了实验室研究,使用不同类型的语言模型(LMs)(N-gram字符LMs、子字符神经LMs)来模型服务器端的重新评分语音信息域查询。我们研究了在设备和服务器端的信号组合,并证明了通过将不同类型的服务器端LMs集成到VA ASR系统中,可以实现23%-35%的话语识别错误率(WER)下降。此外,我们还进行了基于域数据进行LMs的训练和OpenAI提供的GPT-3变体作为基准。此外,我们还发现,通过将多个服务器端LMs从零开始训练并融合这些模型的 complementary strengths 可以最好地将域特定数据中学习的知识集成到VA ASR系统中。
Can Language Models Be Tricked by Language Illusions? Easier with Syntax, Harder with Semantics
paper_authors: Yuhan Zhang, Edward Gibson, Forrest Davis
for: 这个研究是为了检验语言模型(LM)是否可以模仿人类语言处理的行为。
methods: 研究使用了三种语言玄学(illusion)测试语言模型的能力:比较玄学(example:”更多人去过俄罗斯than I have”)、深度炸弹玄学(example:”没有头部伤害是过分的”)和负极性项(NPI)玄学(example:”不信任的猎人不会打熊”)。
results: 研究发现,LMs对NPI玄学的评估更容易与人类的判断相符,相比之下,对比玄学和深度炸弹玄学的评估更容易与人类的判断不符。 none of the LMs or metrics yielded results that were entirely consistent with human behavior。这些结果表明,LMs在语言处理方面的能力有限,并且不能够完全模仿人类的语言处理行为。Abstract
Language models (LMs) have been argued to overlap substantially with human beings in grammaticality judgment tasks. But when humans systematically make errors in language processing, should we expect LMs to behave like cognitive models of language and mimic human behavior? We answer this question by investigating LMs' more subtle judgments associated with "language illusions" -- sentences that are vague in meaning, implausible, or ungrammatical but receive unexpectedly high acceptability judgments by humans. We looked at three illusions: the comparative illusion (e.g. "More people have been to Russia than I have"), the depth-charge illusion (e.g. "No head injury is too trivial to be ignored"), and the negative polarity item (NPI) illusion (e.g. "The hunter who no villager believed to be trustworthy will ever shoot a bear"). We found that probabilities represented by LMs were more likely to align with human judgments of being "tricked" by the NPI illusion which examines a structural dependency, compared to the comparative and the depth-charge illusions which require sophisticated semantic understanding. No single LM or metric yielded results that are entirely consistent with human behavior. Ultimately, we show that LMs are limited both in their construal as cognitive models of human language processing and in their capacity to recognize nuanced but critical information in complicated language materials.
摘要
语言模型(LM)已经被论证为与人类语言处理能力 overlap substantially 。然而,当人类系统atically makes errors in language processing 时,我们该预期LMs behaving like cognitive models of language and mimic human behavior? We answer this question by investigating LMs' more subtle judgments associated with "language illusions" -- sentences that are vague in meaning, implausible, or ungrammatical but receive unexpectedly high acceptability judgments by humans. We looked at three illusions: the comparative illusion (e.g. "More people have been to Russia than I have"), the depth-charge illusion (e.g. "No head injury is too trivial to be ignored"), and the negative polarity item (NPI) illusion (e.g. "The hunter who no villager believed to be trustworthy will ever shoot a bear"). We found that probabilities represented by LMs were more likely to align with human judgments of being "tricked" by the NPI illusion, which examines a structural dependency, compared to the comparative and the depth-charge illusions, which require sophisticated semantic understanding. No single LM or metric yielded results that are entirely consistent with human behavior. Ultimately, we show that LMs are limited both in their construal as cognitive models of human language processing and in their capacity to recognize nuanced but critical information in complicated language materials.
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks
results: GPT-4V 在多种任务上与人类评估结果高度相似,表明 GPT-4V 可以作为多媒体任务的自动评估器。尽管有一些限制,如视觉清晰度评估和实际世界复杂的理解,但 GPT-4V 能够提供人类相似的分数以及详细的解释,表示它在多媒体 LLM 中具有潜力。Abstract
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. We employ two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators. Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.
摘要
自动评估视觉语言任务是具有挑战性的,尤其是在准确地考虑细节方面有限制。虽然GPT-4V在多模态任务中表现出了可喜的结果,但是利用GPT-4V作为多模态评估器并没有得到系统性的探讨。我们全面验证GPT-4V的评估能力,涵盖图像到文本和文本到图像生成、高级图像到图像翻译以及多个图像到文本对齐等任务。我们采用了两种评估方法:单答题评估和对比评估,使用GPT-4V进行评估。值得注意的是,GPT-4V与人类的评估结果有良好的一致性,表现出了大量的可能性作为多模态LLM的评估器。尽管有限制的视觉清晰度评估和实际世界的复杂逻辑 reasoning 存在限制,但GPT-4V能够提供人类相似的分数,并且具有详细的解释,这是一个有前途的自动评估器。
The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models
results: 我们发现,不同的组件在Atlas模型中对一致性做出了不同的贡献。此外,我们还发现了不同的语言模型在不同的语言任务上 exhibit 不同的一致性问题。为所有评估过的语言模型而言,我们发现了语法形式和其他评估任务的假设 artifacts 对一致性具有影响。Abstract
Large Language Models (LLMs) make natural interfaces to factual knowledge, but their usefulness is limited by their tendency to deliver inconsistent answers to semantically equivalent questions. For example, a model might predict both "Anne Redpath passed away in Edinburgh." and "Anne Redpath's life ended in London." In this work, we identify potential causes of inconsistency and evaluate the effectiveness of two mitigation strategies: up-scaling and augmenting the LM with a retrieval corpus. Our results on the LLaMA and Atlas models show that both strategies reduce inconsistency while retrieval augmentation is considerably more efficient. We further consider and disentangle the consistency contributions of different components of Atlas. For all LMs evaluated we find that syntactical form and other evaluation task artifacts impact consistency. Taken together, our results provide a better understanding of the factors affecting the factual consistency of language models.
摘要
大型语言模型(LLM)可以作为自然界面来访问事实知识,但它们的用途受到它们对 semantically equivalent 问题的答案不一致的限制。例如,一个模型可能会预测 "安妮·雷普薇在Edinburgh去世" 和 "安妮·雷普薇的生命在London conclude"。在这项工作中,我们确定了可能导致不一致的原因,并评估了两种缓解策略:缩放和通过检索库补充语言模型。我们的结果表明,两种策略都可以减少不一致,而检索补充是远远更高效。我们进一步考虑了Atlas模型中的一致性贡献,并发现了不同组件的一致性贡献。对所有评估模型来说,我们发现了语法形式和其他评估任务的artifacts对一致性有影响。总之,我们的结果为语言模型的事实一致性提供了更好的理解。
FlashDecoding++: Faster Large Language Model Inference on GPUs
results: 实现了up to 4.86x和2.18x的速度提升 compared to Hugging Face 实现,以及平均比state-of-the-art LLM 推理引擎提高1.37倍Abstract
As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
摘要
As the Large Language Model (LLM) becomes increasingly important in various domains, there are still several challenges that need to be addressed in order to accelerate LLM inference. These challenges include:1. Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to approximately 20% overhead for attention computation in LLMs.2. Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and a performance loss of over 50% after padding zeros in previous designs.3. Performance loss due to static dataflow. The performance of LLM inference kernels depends on various factors such as input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference.To address these challenges, we present FlashDecoding++, a fast LLM inference engine that supports mainstream LLMs and hardware back-ends. Our approach includes:1. Asynchronized softmax with unified max value. We introduce a unified max value technique for different partial softmax computations to avoid synchronization.2. Flat GEMM optimization with double buffering. We point out that flat GEMMs with different shapes face varied bottlenecks, and techniques like double buffering are introduced to optimize the computation.3. Heuristic dataflow with hardware resource adaptation. We heuristically optimize the dataflow using different hardware resources considering input dynamics.Thanks to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
Finding Common Ground: Annotating and Predicting Common Ground in Spoken Conversations
paper_authors: Magdalena Markowska, Mohammad Taghizadeh, Adil Soubki, Seyed Abolghasem Mirroshandel, Owen Rambow
for: This paper is written for researchers and scientists in the field of cognitive science and natural language processing.
methods: The paper introduces a new annotation and corpus to capture common ground, and describes initial experiments extracting propositions from dialog and tracking their status in the common ground from the perspective of each speaker.
results: The paper presents initial experiments extracting propositions from dialog and tracking their status in the common ground from the perspective of each speaker, with the goal of capturing common ground in natural language processing.Abstract
When we communicate with other humans, we do not simply generate a sequence of words. Rather, we use our cognitive state (beliefs, desires, intentions) and our model of the audience's cognitive state to create utterances that affect the audience's cognitive state in the intended manner. An important part of cognitive state is the common ground, which is the content the speaker believes, and the speaker believes the audience believes, and so on. While much attention has been paid to common ground in cognitive science, there has not been much work in natural language processing. In this paper, we introduce a new annotation and corpus to capture common ground. We then describe some initial experiments extracting propositions from dialog and tracking their status in the common ground from the perspective of each speaker.
摘要
当我们与其他人交流时,我们不仅是生成一个字串的序列。而是使用我们的认知状态(信念、愿望、意图)和我们对听众认知状态的模型来创造影响听众认知状态的语言表达。认知状态中的共同知识是speaker认为自己和听众认为自己相信的内容,以及这些内容在听众和speaker之间的共同认知。虽然认知科学中对共同知识进行了大量研究,但是自然语言处理领域中对其进行了 relativamente little research。在这篇论文中,我们介绍了一个新的注释和 корпуス来捕捉共同知识。然后,我们描述了一些初步的实验,从每个speaker的角度提取对话中的提案,并跟踪它们在共同知识中的状态。
People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection
paper_authors: Indira Sen, Dennis Assenmacher, Mattia Samory, Isabelle Augenstein, Wil van der Aalst, Claudia Wagne
For: The paper aims to improve the robustness of NLP models to spurious features by automating the process of generating Counterfactually Augmented Data (CADs).* Methods: The authors use three generative NLP models - Polyjuice, ChatGPT, and Flan-T5 - to automatically generate CADs, and evaluate their effectiveness in improving model robustness compared to manually-generated CADs.* Results: The authors find that while manually-generated CADs are still the most effective, CADs generated by ChatGPT come a close second. However, the changes introduced by the automated methods are often insufficient to flip the original label, which limits their performance.Abstract
NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
摘要
A Study of Continual Learning Under Language Shift
results: 研究结果表明,前向传递效果几乎是无关语言顺序的,而后向传递效果则可能受到语言顺序和语言特征的影响,并且与不同的学习率规则相关。Abstract
The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. In this paper, we study the benefits and downsides of updating a language model when new data comes from new languages - the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Norwegian and Icelandic to investigate how forward and backward transfer effects depend on the pre-training order and characteristics of languages, for different model sizes and learning rate schedulers. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be either positive or negative depending on the order and characteristics of new languages. To explain these patterns we explore several language similarity metrics and find that syntactic similarity appears to have the best correlation with our results.
摘要
We begin with a monolingual English language model and incrementally add data from Norwegian and Icelandic to examine how forward and backward transfer effects depend on the pre-training order and characteristics of languages, as well as different model sizes and learning rate schedulers. Our findings show that forward transfer is generally positive and independent of language order, while backward transfer can be either positive or negative, depending on the order and characteristics of the new languages.To understand these patterns, we investigate several language similarity metrics and find that syntactic similarity is the most strongly correlated with our results. Our study provides valuable insights into the benefits and challenges of continual learning under language shift, and highlights the importance of considering language similarity when updating a language model with new data.
CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL
results: 研究发现,使用hallucination可以提高文本到SQL生成器的准确率,并且与现有的State-of-the-Art retrieval-based augmentation方法相比,本方法可以获得更高的回归率。Abstract
Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.
摘要
现有的文本到SQL生成器需要整个 schema 被编码到用户文本中。这是costly或实际不切实际的 для大型数据库,其中包含 tens of thousands 的列。标准稠密检索技术无法对大型结构化数据库进行schemasubsetting,因为正确的 semantics of retrieval 需要我们将set of schema elements 排序,而不是单个元素。因此,我们提出了一个两stage的过程,以实现有效的覆盖。首先,我们请求 LLM 生成一个最小的DB schema,可以回答查询。我们使用生成的 schema 来 retrieve 一个实际 schema 的子集,通过多个稠密检索的结果进行组合。 Surprisingly,hallucination $\unicode{x2013}$ ,一直被视为幻觉 $\unicode{x2013}$ ,实际上是一种有用的桥接机制。由于现有的 benchmark 不存在于大型数据库上的schema subsetting,我们引入了三个 benchmark。这三个 benchmark 包括两个 semi-synthetic 数据集,它们来自 SPIDER 和 BIRD 两个well-known数据集,共计4502 和 798 的 schema element。此外,我们还引入了一个实际的大数据库,即 SocialDB,它包含 17844 的 schema element。我们表明,我们的方法1 可以达到 significantly higher recall than SOTA retrieval-based augmentation methods。
ACES: Translation Accuracy Challenge Sets at WMT 2023
results: 研究发现,1)没有明确的赢家 Among the metrics submitted to WMT 2023,2)2023和2022年度metric的性能变化很大。研究建议, metric developer should focus on:建立多家 metric ensemble,开发更注重源语言和避免surface-level overlap的 metric,以及仔细确定多语言嵌入的影响于MT评估。Abstract
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs. The phenomena range from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison. We also measure the incremental performance of the metrics submitted to both WMT 2023 and 2022. We find that 1) there is no clear winner among the metrics submitted to WMT 2023, and 2) performance change between the 2023 and 2022 versions of the metrics is highly variable. Our recommendations are similar to those from WMT 2022. Metric developers should focus on: building ensembles of metrics from different design families, developing metrics that pay more attention to the source and rely less on surface-level overlap, and carefully determining the influence of multilingual embeddings on MT evaluation.
摘要
我们对WTM 2023中提交的segmentlevel metric进行了性能测试,使用ACES Challenge Set(Amrhein et al., 2022)。这个挑战集包含36K个例子,表示68种现象和146种语言对。这些现象包括单个字/字符级别的简单扰乱到更加复杂的错误基于话语和实际知识。对每个指标,我们提供了错误类别的详细分布图以及一个总的ACES-Score,方便比较。我们还测量了2023和2022两年度metric submission中的增量性能。我们发现:1)WTM 2023中提交的指标没有一个明确的赢家,2)2023和2022两年度metric submission中的性能变化很大。我们的建议与WTM 2022相似:指标开发者应该:1)建立不同设计家族的metric ensemble,2)开发更加关注源语和 superficialevel overlap的metric,3)仔细确定多语言嵌入影响MT评估。
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency
results: 研究结果表明,与基elines比较,本 frameworks 能够显著地提高 LLM 的问答性能, demonstrating encouraging results.Abstract
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community. We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.
摘要
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community. We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.Here's the translation in Traditional Chinese as well:Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community. We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.
Chinesewebtext: Large-scale high-quality Chinese web text extracted with effective evaluation model
methods: 我们提出了一个新的完整工具链EvalWeb,可以从不 cleaner web 数据中提取高质量的中文文本。我们使用手动编写的规则排除Raw crawled web 内容中的直接的噪音文本。然后,我们使用一种高效的评估模型,对剩下的相对清洁数据进行评分,并将每个文本分配一个特定的质量分数。
results: 我们采用EvalWeb工具链,从不 cleaner web 数据中提取了1.42 TB的高质量中文文本,每个文本都有一个质量分数。此外,我们还释放了600 GB的更加干净的中文数据,其质量超过90%。Abstract
During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.
摘要
在大语言模型(LLM)的研发过程中,预训练数据的规模和质量对LML的能力产生关键作用。为加速LLM研究,许多大规模数据集,如C4 [1]、Pile [2]、RefinedWeb [3]和WanJuan [4],已经公开发布给广大科学家。然而,大多数发布的 corpus 都主要关注英语,中文 corpus 缺乏完整的工具链,并且每个文本的质量信息缺失。为解决这些挑战,我们在本纸提出一种新的完整工具链 EvalWeb,用于从不 cleaner 的网络数据中提取中文高质量文本。首先,与前一工作类似,我们手动制定规则来排除 raw 爬取网络内容中的直接不良文本。其次,我们设计了一种可以评估剩下的相对清洁数据的评价模型,每个文本都分配了特定的质量分数。最后,我们可以轻松地使用适当的阈值来选择中文高质量预训练数据。使用我们的提议方法,我们发布了1.42 TB 的最新和最大规模高质量中文网络文本 ChineseWebText,每个文本都有质量分数,以便 LLM 研究人员根据需要的质量阈值来选择数据。此外,我们还发布了90%以上的更加清洁的600 GB 中文数据。
Noise-Robust Fine-Tuning of Pretrained Language Models via External Guidance
methods: 本文提出了一个创新的PLM fine-tuning策略,利用噪音标签和大型语言模型(LLM) like ChatGPT 的指导,帮助精确地分辨清洁和噪音标签的标签,并提供了额外的信息,以推动PLM的学习过程中。
results: 实验结果显示,本文的框架在Synthetic和实际的噪音标签 datasets 上具有明显的优势,较前一些基eline的性能。Abstract
Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PLMs) have achieved substantial advancements in the field of natural language processing. However, in real-world scenarios, data labels are often noisy due to the complex annotation process, making it essential to develop strategies for fine-tuning PLMs with such noisy labels. To this end, we introduce an innovative approach for fine-tuning PLMs using noisy labels, which incorporates the guidance of Large Language Models (LLMs) like ChatGPT. This guidance assists in accurately distinguishing between clean and noisy samples and provides supplementary information beyond the noisy labels, thereby boosting the learning process during fine-tuning PLMs. Extensive experiments on synthetic and real-world noisy datasets further demonstrate the superior advantages of our framework over the state-of-the-art baselines.
摘要
使用两阶段模型的预训练和精度调整方法,预训练语言模型(PLM)在自然语言处理领域已经实现了重要进步。然而,在实际场景中,数据标签往往受到复杂的注释过程影响,导致标签上的噪声,因此需要开发适应这类噪声标签的精度调整策略。为此,我们提出了一种新的精度调整策略,利用大型语言模型(LLM)如ChatGPT的指导,以帮助在精度调整PLM时更加准确地分辨清洁和噪声样本,并提供额外的信息,以便在精度调整过程中增强学习。我们在 sintetic和实际噪声数据集上进行了广泛的实验,并证明了我们的框架在状态静态基准上具有显著优势。
DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts
results: 比标准精度调教或LoRA适应器更有效,提高目标语言ASR性能,并增加仅一个微 Parameters overhead during inference.Abstract
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still under-performs on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
摘要
喷零是一个多任务多语言的语音模型,覆盖99种语言。它在一部分覆盖语言上表现出色,但模型仍然在一些被束缚语言上表现不佳,这问题在较小的模型版本中变得更加严重。在这项工作中,我们提出了DistilWhisper这种方法,能够bridging ASR表现差距在这些语言上,同时保留多任务多语言预训练的优点。我们的方法包括两个关键策略:使用语言特定专家进行轻量级模块ASR精度调整,以及从whisper-large-v2中进行知识填充。这种双重方法allow我们可以很好地提高ASR表现,保留多任务多语言预训练中的稳定性。结果表明,我们的方法比标准精度调整或LoRA扩展器更有效,在目标语言上对 both in-和out-of-domain测试集中提高表现,而且只增加了一个微 parameter overhead在推理过程中。
COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances
results: 研究发现,even the current best open-source, multilingual model struggles to perform well on COPAL-ID,其中的accuracy是65.47%, significanly lower than XCOPA-ID(79.40%)。这显示这些语言模型仍然很难理解印度尼西亚的地方特点。Abstract
We present publicly available COPAL-ID, a novel Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID. In addition, we present COPAL-ID in both standard Indonesian and in Jakartan Indonesian--a dialect commonly used in daily conversation. COPAL-ID poses a greater challenge for existing open-sourced and closed state-of-the-art multilingual language models, yet is trivially easy for humans. Our findings suggest that even the current best open-source, multilingual model struggles to perform well, achieving 65.47% accuracy on COPAL-ID, significantly lower than on the culturally-devoid XCOPA-ID (79.40%). Despite GPT-4's impressive score, it suffers the same performance degradation compared to its XCOPA-ID score, and it still falls short of human performance. This shows that these language models are still way behind in comprehending the local nuances of Indonesian.
摘要
我们公开提供了一个新的印度尼西亚语言常识理解数据集,称为COPAL-ID。与前一代印度尼西亚COPA数据集(XCOPA-ID)不同,COPAL-ID包含了印度尼西亚当地的本地文化特点,因此更能呈现日常生活中印度尼西亚文化圈中的 causal reasoning 的自然场景。由本地Native编写,COPAL-ID更加流畅,无awkward phrase,与XCOPA-ID不同。此外,我们还提供了COPAL-ID的标准印度尼西亚和雅加达印度尼西亚语言版本--一种日常交流中广泛使用的方言。COPAL-ID对现有的开源和关闭式状态艺术语言模型提出了更大的挑战,然而对人类来说是极其容易的。我们的发现表明, même GPT-4 在COPAL-ID 上表现出色,但它仍然落后于人类表现,只有65.47%的准确率,与 XCOPA-ID 的79.40%的准确率相比有所下降。这显示了这些语言模型仍然远远落后于理解印度尼西亚本地特点。
Blending Reward Functions via Few Expert Demonstrations for Faithful and Accurate Knowledge-Grounded Dialogue Generation
paper_authors: Wanyu Du, Yangfeng Ji for: 实现对话式资讯搜寻系统的可靠性和准确性,需要对话模型能够基于相关知识文本生成 faithful和 precisel 的回答。methods: 我们使用了增强学习算法,创建了一个新的赏罚函数,让模型能够在不可靠的知识文本中快速地学习并生成高质量的回答。results: 我们的方法在两个对话式资讯搜寻 datasets 上进行了实验,结果显示我们的方法可以与其他强制学习基于标签的基底搜寻方法竞争。Abstract
The development of trustworthy conversational information-seeking systems relies on dialogue models that can generate faithful and accurate responses based on relevant knowledge texts. However, two main challenges hinder this task. Firstly, language models may generate hallucinations due to data biases present in their pretraining corpus. Secondly, knowledge texts often contain redundant and irrelevant information that distracts the model's attention from the relevant text span. Previous works use additional data annotations on the knowledge texts to learn a knowledge identification module in order to bypass irrelevant information, but collecting such high-quality span annotations can be costly. In this work, we leverage reinforcement learning algorithms to overcome the above challenges by introducing a novel reward function. Our reward function combines an accuracy metric and a faithfulness metric to provide a balanced quality judgment of generated responses, which can be used as a cost-effective approximation to a human preference reward model when only a few preference annotations are available. Empirical experiments on two conversational information-seeking datasets demonstrate that our method can compete with other strong supervised learning baselines.
摘要
发展可靠的对话信息搜索系统需要对话模型可以基于相关知识文本生成准确和 faithful的回答。然而,两大挑战妨碍这项任务。首先,语言模型可能会生成幻觉,因为它们的预训练集中存在数据偏见。其次,知识文本经常包含无关和不必要的信息,这会让模型的注意力偏离重要的文本段落。在这种情况下,先前的工作是通过添加知识文本上的数据标注来学习知识标识模块,以避免无关信息的影响。然而,收集这些高质量的标注数据可以是昂贵的。在这种情况下,我们利用了强化学习算法,通过引入一个新的奖励函数来解决以上挑战。我们的奖励函数组合了一个准确度指标和一个忠实度指标,以提供一个平衡的质量评价标准,可以用作只有几个偏好标注available的情况下的cost-effective approximation。我们的实验结果表明,我们的方法可以与其他强制学习基elines相比竞争。
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
results: 实验表明,这种模型可以生成高 fideltity 的声音,与当前最佳 neural TTS 系统的性能相似。声音样本可以在 https://e3tts.github.io 中找到。Abstract
We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.
摘要
我们提出了易用的端到端扩散基于文本到语音模型E3 TTS,这是一种简单、高效的端到端文本到语音模型。E3 TTS直接从普通文本输入中获取 Audio waveform ,通过迭代精化过程来生成。与许多先前的工作不同,E3 TTS不依赖于任何中间表示,如spectrogram特征或投入信息。相反,E3 TTS通过扩散过程来模型波形的时间结构。不需要额外的条件信息,E3 TTS可以轻松地适应零扩展任务,如编辑,无需进行额外的训练。实验表明,E3 TTS可以生成高品质的Audio,与当前的 neural TTS 系统性能相似。Audio 样本可以在https://e3tts.github.io 找到。
Task-Agnostic Low-Rank Adapters for Unseen English Dialects
results: HyperLoRA方法在5种口语方言的零条件设定下实现了最好或最竞争的性能,并且比传统方法更加数据效率。Abstract
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English. As a result, speakers of other dialects experience significantly more failures when interacting with these technologies. In practice, these speakers often accommodate their speech to be better understood. Our work shares the belief that language technologies should be designed to accommodate the diversity in English dialects and not the other way around. However, prior works on dialect struggle with generalizing to evolving and emerging dialects in a scalable manner. To fill this gap, our method, HyperLoRA, leverages expert linguistic knowledge to enable resource-efficient adaptation via hypernetworks. By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion. Not only is HyperLoRA more scalable in the number of parameters, but it also achieves the best or most competitive performance across 5 dialects in a zero-shot setting. In this way, our approach facilitates access to language technology for billions of English dialect speakers who are traditionally underrepresented.
摘要
Self-Influence Guided Data Reweighting for Language Model Pre-training
results: 经过广泛的分析, authors 发现 PRESENCE 可以提高模型的新鲜度和稳定性,并且适用于不同的模型大小、数据集和任务。Abstract
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the corpus are treated with equal importance during LM pre-training. However, due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. While data reweighting has been explored in the context of task-specific supervised learning and LM fine-tuning, model-driven reweighting for pre-training data has not been explored. We fill this important gap and propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training. PRESENCE promotes novelty and stability for model pre-training. Through extensive analysis spanning multiple model sizes, datasets, and tasks, we present PRESENCE as an important first step in the research direction of sample reweighting for pre-training language models.
摘要
Language Models (LMs) 预先训练于大量文本资料上已成为开发不同NLPTask的默认开始点。在LM预训练中,所有数据样本均受到相同的重要性对待。然而,由于数据样本的相关性和质量可能存在差异,那么对所有数据样本都受到相同的重要性可能并不是最佳选择。而数据重新分配已在任务特定的超vised学习和LM精度调整中被探索,但模型驱动的数据重新分配 для预训练数据尚未被探索。我们填补了这一重要的空白,并提出了PRESENCE,一种通过自我影响(SI)分数作为样本重要性指标来重新分配样本的方法。PRESENCE提高了模型预训练的新鲜性和稳定性。通过多个模型大小、数据集和任务的广泛分析,我们展示了PRESENCE作为预训练语言模型样本重新分配的重要首步。
Re-weighting Tokens: A Simple and Effective Active Learning Strategy for Named Entity Recognition
results: 实验结果显示,将重新定Weighting策略与现有的获取函数搭配使用,可以实现大幅提高NER模型的性能。Abstract
Active learning, a widely adopted technique for enhancing machine learning models in text and image classification tasks with limited annotation resources, has received relatively little attention in the domain of Named Entity Recognition (NER). The challenge of data imbalance in NER has hindered the effectiveness of active learning, as sequence labellers lack sufficient learning signals. To address these challenges, this paper presents a novel reweighting-based active learning strategy that assigns dynamic smoothed weights to individual tokens. This adaptable strategy is compatible with various token-level acquisition functions and contributes to the development of robust active learners. Experimental results on multiple corpora demonstrate the substantial performance improvement achieved by incorporating our re-weighting strategy into existing acquisition functions, validating its practical efficacy.
摘要
Here's the text in Simplified Chinese:活动学习,一种广泛采用的技术以增强机器学习模型在文本和图像分类任务中,尤其是在有限的标注资源的情况下,受到了NER领域的Relative little attention。NER数据不均衡的挑战使得活动学习效果受到了限制,因为序列标签器缺乏足够的学习信号。为Address these challenges, this paper presents a novel reweighting-based active learning strategy that assigns dynamic smoothed weights to individual tokens. This adaptable strategy is compatible with various token-level acquisition functions and contributes to the development of robust active learners. Experimental results on multiple corpora demonstrate the significant performance improvement achieved by incorporating our re-weighting strategy into existing acquisition functions, validating its practical effectiveness.