2023-09-18

cs.CL

cs.CL - 2023-09-18

Few-Shot Adaptation for Parsing Contextual Utterances with LLMs

paper_url: http://arxiv.org/abs/2309.10168
repo_url: https://github.com/microsoft/few_shot_adaptation_for_parsing_contextual_utterances_with_llms
paper_authors: Kevin Lin, Patrick Xia, Hao Fang
for: 这个论文主要探讨了基于大语言模型（LLM）的语义解析器在实际场景中如何处理上下文语言。
methods: 论文提出了四种主要方法来处理上下文语言，即 Parse-with-Utterance-History、Parse-with-Reference-Program、Parse-then-Resolve 和 Rewrite-then-Parse。
results: 实验表明，使用 Rewrite-then-Parse 方法可以在考虑解析精度、注释成本和错误类型的情况下取得最佳效果。

Abstract
We evaluate the ability of semantic parsers based on large language models (LLMs) to handle contextual utterances. In real-world settings, there typically exists only a limited number of annotated contextual utterances due to annotation cost, resulting in an imbalance compared to non-contextual utterances. Therefore, parsers must adapt to contextual utterances with a few training examples. We examine four major paradigms for doing so in conversational semantic parsing i.e., Parse-with-Utterance-History, Parse-with-Reference-Program, Parse-then-Resolve, and Rewrite-then-Parse. To facilitate such cross-paradigm comparisons, we construct SMCalFlow-EventQueries, a subset of contextual examples from SMCalFlow with additional annotations. Experiments with in-context learning and fine-tuning suggest that Rewrite-then-Parse is the most promising paradigm when holistically considering parsing accuracy, annotation cost, and error types.

摘要
我们评估基于大语言模型（LLM）的semantic parser在处理上下文性语言时的能力。在实际场景中，通常只有有限的上下文性语言标注数据，因此训练数据的偏度较大，需要 parser 适应上下文性语言的少量标注。我们检查了四种主要的方法来实现这一点，即在 conversational semantic parsing 中使用Parse-with-Utterance-History、Parse-with-Reference-Program、Parse-then-Resolve和Rewrite-then-Parse等四种方法。为便于这些跨 парадиг的比较，我们构建了SMCalFlow-EventQueries，一 subset of 上下文性示例从 SMCalFlow 中，并添加了更多的标注。实验表明，使用 rewrite-then-parse 方法可以最大化 parsing 准确率、标注成本和错误类型。

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

paper_url: http://arxiv.org/abs/2309.10105
repo_url: https://github.com/kothasuhas/understanding-forgetting
paper_authors: Suhas Kotha, Jacob Mitchell Springer, Aditi Raghunathan
for: 本研究旨在 investigating the effects of fine-tuning on language models’ performance on tasks outside the fine-tuning distribution.
methods: 研究者采用了 instruction-tuning 和 reinforcement learning from human feedback 等方法进行 fine-tuning，并使用了 conjugate prompting 来测试 Language Models 的能力。
results: 研究发现， improving performance on tasks within the fine-tuning data distribution 会导致模型在其他任务上表现下降，特别是与 fine-tuning 数据分布最相似的任务。此外，研究者发现可以通过 conjugate prompting 来系统地回归 Language Models 的预训练能力。

Abstract
Fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback) is a crucial step in training language models to robustly carry out tasks of interest. However, we lack a systematic understanding of the effects of fine-tuning, particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of suppressing model capabilities on other tasks. This degradation is especially pronounced for tasks "closest" to the fine-tuning distribution. We hypothesize that language models implicitly infer the task of the prompt corresponds, and the fine-tuning process predominantly skews this task inference towards tasks in the fine-tuning distribution. To test this hypothesis, we propose Conjugate Prompting to see if we can recover pretrained capabilities. Conjugate prompting artificially makes the task look farther from the fine-tuning distribution while requiring the same capability. We find that conjugate prompting systematically recovers some of the pretraining capabilities on our synthetic setup. We then apply conjugate prompting to real-world LLMs using the observation that fine-tuning distributions are typically heavily skewed towards English. We find that simply translating the prompts to different languages can cause the fine-tuned models to respond like their pretrained counterparts instead. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, to recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.

摘要
精度调整（如 instrucion-tuning 或人类反馈学习）是训练语言模型完成任务的关键步骤。然而，我们缺乏对精度调整的系统性理解，特别是在外部精度调整分布之外的任务上。在简化的场景中，我们发现，通过提高 task 内的性能，模型在其他任务上的表现会受到抑制。这种干扰特别明显于 task 与 fine-tuning 数据分布之间最相似的任务上。我们假设语言模型会隐式地推理出提示中的任务类型，而 fine-tuning 过程会主要偏向于 fine-tuning 数据分布中的任务类型。为了测试这一假设，我们提出了 conjugate prompting。 conjugate prompting 通过人工地使任务看起来更加远离 fine-tuning 数据分布，并且需要同样的能力。我们发现 conjugate prompting 系统地恢复了一些预训练能力。然后，我们应用 conjugate prompting 于实际世界的 LLMs，基于观察， fine-tuning 分布通常倾斜到英语。我们发现，只需将提示翻译成不同语言，就能让 fine-tuned 模型回归到预训练模型的状态。这允许我们恢复在 instruction tuning 中失去的 Context Learning 能力，以及在 chatbots 中被安全 fine-tuning 抑制的危险内容生成能力。

paper_url: http://arxiv.org/abs/2309.10057
repo_url: None
paper_authors: Itay Yair, Hillel Taub-Tabib, Yoav Goldberg
for: 本研究旨在提供一种方法，以便在探索 Setting 中，用户可以快速获得各种相关信息的概述，同时还能深入探究一些具体的方面。
methods: 本研究使用了一种组合分组和层次结构生成的方法，将相似的项集成到一起，并将剩下的项排序成一个可导航的 DAG 结构。
results: 本研究应用于医疗信息抽取，可以帮助用户快速获得医疗信息的概述，并且可以深入探究具体的方面。

Abstract
Information extraction systems often produce hundreds to thousands of strings on a specific topic. We present a method that facilitates better consumption of these strings, in an exploratory setting in which a user wants to both get a broad overview of what's available, and a chance to dive deeper on some aspects. The system works by grouping similar items together and arranging the remaining items into a hierarchical navigable DAG structure. We apply the method to medical information extraction.

摘要
将文本翻译成简化中文：信息提取系统经常生成大量与特定主题相关的字符串。我们提出了一种方法，可以帮助用户更好地消化这些字符串，在探索性的 Setting 中，用户希望同时获得概括和深入了解一些方面。系统通过将相似的项集成一起，并将剩下的项组织成嵌入式的DAG结构，以便用户可以方便地浏览和探索。我们在医疗信息提取中应用了这种方法。

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

paper_url: http://arxiv.org/abs/2309.10020
repo_url: None
paper_authors: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
for: 这篇论文旨在概述多模态基础模型的分类和演化，强调从专家模型转化为通用助手。
methods: 论文涵盖了五个核心研究领域，分为两类：一是已有的研究领域，包括两个主题：学习视觉基础模型 для视觉理解和文本到图生成；二是现代探索性研究领域，包括三个主题：基于大语言模型的统一视觉模型、端到端训练多模态语言模型、将多模态工具与语言模型串联。
results: 论文的目标受众是计算机视觉和视觉语言多模态研究人员，包括研究生、博士生和专业人士，他们想要了解多模态基础模型的基础知识和最新进展。

Abstract
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

摘要

Well-established research areas: * Methods of learning vision backbones for visual understanding * Text-to-image generation2. Recent advances in exploratory, open research areas: * Unified vision models inspired by large language models (LLMs) * End-to-end training of multimodal LLMs * Chaining multimodal tools with LLMsThe target audience for this paper includes researchers, graduate students, and professionals in the computer vision and vision-language multimodal communities who are interested in learning about the basics and recent advances in multimodal foundation models.

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

paper_url: http://arxiv.org/abs/2309.09958
repo_url: https://github.com/haotian-liu/LLaVA
paper_authors: Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
for: 这个论文是为了研究开源大型多模态模型（LLaVA和MiniGPT-4）的可视 instrucion 调教进行 empirical 研究，以便为未来的研究提供更强的基准。
methods: 这个论文使用了扩大 LLVA 的参数量至 33B 和 65B/70B，并研究了LoRA/QLoRA 等参数效率训练方法的影响。
results: 研究发现，扩大 LMM 的表现和语言能力有显著提升，LoRA/QLoRA 的训练方法与全模型精度调教的性能相当，而高像素分辨率和混合多Modal-语言数据也有助于提高 LMM 表现。

Abstract
Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

摘要
幻视 instrucion 优化在最近已经取得了鼓舞人心的进步，使用开源大型多模型（LLaVA）和 MiniGPT-4 等模型。然而，大多数现有的开源 LLaVA 模型都是使用 13B 参数或更小的模型进行研究。在这篇论文中，我们将对 LLaVA 模型进行扩展，并在 33B 和 65B/70B 的参数量上进行实验。我们还将分享我们在图像分辨率、数据混合和参数有效训练方法（LoRA/QLoRA）等方面的发现。这些发现对于完成真实世界任务时 LMM 模型的多模态和语言能力的影响进行评估。我们发现，将 LMM 模型扩展可以顺利提高模型性能和语言能力，而 LoRA/QLoRA 的训练方法与全模型精度训练的性能相当。此外，这个研究还证明了更高的图像分辨率和混合多模态语言数据可以提高 LMM 模型性能，而视 instrucion 优化可以帮助 LMM 模型提高纯语言能力。我们希望这篇研究可以让开源 LMM 研究更加可 accessible，从而帮助建立更强的基线 для未来的研究。我们将在线上公开代码和检查点。

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

paper_url: http://arxiv.org/abs/2309.09902
repo_url: None
paper_authors: Tobias Bornheim, Niklas Grieger, Patrick Gustav Blaneck, Stephan Bialonski
for: 这个论文旨在提高德国议会辩论中的自动发言人分配，以便更好地进行计算文本分析。
methods: 作者使用了大型自然语言模型家族Llama 2，并使用QLoRA的高效训练策略进行细化。
results: 研究表明，使用Llama 2可以实现竞争性的自动发言人分配性能，提供了计算政治 Diskursanalyse 的可能性，以及 semantic role labeling 系统的发展。

Abstract
The growing body of political texts opens up new opportunities for rich insights into political dynamics and ideologies but also increases the workload for manual analysis. Automated speaker attribution, which detects who said what to whom in a speech event and is closely related to semantic role labeling, is an important processing step for computational text analysis. We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021. We fine-tune Llama 2 with QLoRA, an efficient training strategy, and observe our approach to achieve competitive performance in the GermEval 2023 Shared Task On Speaker Attribution in German News Articles and Parliamentary Debates. Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.

摘要
政治文本的增长开创了新的可观察机会，提供了有价值的政治动力和意识形态分析。然而，这也增加了人工分析的劳动负担。自动分配说话人，即在语音事件中识别说话者和他们所说的内容，是计算文本分析中重要的处理步骤。我们研究利用大语言模型家族Llama 2自动化德国议会辩论中的说话人分配，从2017年至2021年。我们使用QLoRA，一种高效的训练策略，微调Llama 2，并观察我们的方法在GermEval 2023共享任务中的竞赛性表现。我们的结果描绘了大语言模型在自动化说话人分配中的能力，揭示了计算政治讨论的可能性和意识形态分析系统的发展。

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

paper_url: http://arxiv.org/abs/2309.10707
repo_url: None
paper_authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang, Karren Yang, Gautam Varma Mantena, Oncel Tuzel
for: 这篇论文的目的是提出一种新的自动话语识别（ASR）模型适应新目标领域的策略，不需要目标领域的文本或声音数据。
methods: 这篇论文提出了一个新的数据合成管道，使用大型自然语言模型（LLM）生成目标领域的文本库，并使用现有的控制可能性声合成模型生成相应的声音。另外，这篇论文也提出了一个简单 yet 有效的内部 instrucion 微调策略，以增加 LLM 在新领域中的效果。
results: 实验结果显示，提出的方法在 SLURP 数据集上实现了不见天色的话语识别error rate 下降，平均相对词SError rate 下降 $28%$。同时，模型在源领域中的表现也不变。

Abstract
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. To accomplish this, we propose a novel data synthesis pipeline that uses a Large Language Model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. We propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28\%$ on unseen target domains without any performance drop in source domains.

摘要
while automatic speech recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. however, target-domain data usually are not readily available in many scenarios. in this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. to accomplish this, we propose a novel data synthesis pipeline that uses a large language model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. we propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. experiments on the slurp dataset show that the proposed method achieves an average relative word error rate improvement of 28% on unseen target domains without any performance drop in source domains.Here's the translation in Traditional Chinese:而 while automatic speech recognition (ASR) 系统 widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. however, target-domain data usually are not readily available in many scenarios. in this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. to accomplish this, we propose a novel data synthesis pipeline that uses a large language model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. we propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. experiments on the slurp dataset show that the proposed method achieves an average relative word error rate improvement of 28% on unseen target domains without any performance drop in source domains.

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

paper_url: http://arxiv.org/abs/2309.09877
repo_url: None
paper_authors: Joseph Gatto, Sarah M. Preum
for: 本研究旨在提出一种基于抽象意义表示（AMR）的低资源自然语言处理（NLP）方法，用于解决各种在线健康资源和社交平台上的长文本和复杂语言问题。
methods: 本研究使用AMR图来模型低资源健康NLP任务，通过将文本转化为 semantic graph embeddings，提高预训练语言模型对高复杂文本的理解和推理能力。
results: 研究表明，通过将文本 embeddings 与 semantic graph embeddings 结合使用，可以提高6种低资源健康NLP任务的性能，并且这种方法是任务无关的，可以轻松地与标准文本分类管道结合使用。

Abstract
User-generated texts available on the web and social platforms are often long and semantically challenging, making them difficult to annotate. Obtaining human annotation becomes increasingly difficult as problem domains become more specialized. For example, many health NLP problems require domain experts to be a part of the annotation pipeline. Thus, it is crucial that we develop low-resource NLP solutions able to work with this set of limited-data problems. In this study, we employ Abstract Meaning Representation (AMR) graphs as a means to model low-resource Health NLP tasks sourced from various online health resources and communities. AMRs are well suited to model online health texts as they can represent multi-sentence inputs, abstract away from complex terminology, and model long-distance relationships between co-referring tokens. AMRs thus improve the ability of pre-trained language models to reason about high-complexity texts. Our experiments show that we can improve performance on 6 low-resource health NLP tasks by augmenting text embeddings with semantic graph embeddings. Our approach is task agnostic and easy to merge into any standard text classification pipeline. We experimentally validate that AMRs are useful in the modeling of complex texts by analyzing performance through the lens of two textual complexity measures: the Flesch Kincaid Reading Level and Syntactic Complexity. Our error analysis shows that AMR-infused language models perform better on complex texts and generally show less predictive variance in the presence of changing complexity.

摘要
User-generated文本在网络和社交平台上经常是长度很长，涉猎度很高的，使得其annotate成为越来越困难的。尤其是在专业领域问题上，需要培训专家来参与注释管道。因此，我们需要开发低资源NLPTools，能够处理这些有限数据问题。在这项研究中，我们使用抽象意思表示（AMR）图来模型低资源医疗NLPTasks。AMR图适合模型在线医疗文本，可以表示多句输入，抽象化复杂术语，并模型距离匹配token的长距离关系。AMR图因此提高了预训练语言模型对高复杂文本的理解能力。我们的实验表明，通过将文本嵌入与semantic图嵌入结合在一起，可以提高6个低资源医疗NLPTasks的性能。我们的方法是任务无关的，可以轻松地与标准文本分类管道集成。我们实验证明了AMR图在模型复杂文本方面的用用。我们通过分析性能的方式，包括读取难度和 sintactic complexity，发现AMR-激发的语言模型在复杂文本中表现得更好，并且在文本复杂度发生变化时显示更少的预测异常。

Instruction-Following Speech Recognition

paper_url: http://arxiv.org/abs/2309.09843
repo_url: https://github.com/abusufyanvu/6S191_MIT_DeepLearning
paper_authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang
for: 这研究旨在探索大自然语言模型在语音处理中的理解和“理智”能力。
methods: 研究者采用了听写抄写模型，让模型理解和执行多种自由文本指令。
results: 研究发现，没有需要大自然语言模型或预训练的语音模块，模型可以根据指令来选择ively transcribe部分语音，提供额外的隐私和安全层。

Abstract
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.

摘要
（传统的终端到终端自动语音识别（ASR）模型主要关注准确的转录任务，缺乏用户互动的灵活性。大语言模型（LLMs）在语音处理中出现后，可以实现更加自然的文本提示基于互动。然而，这些模型在语音理解和“理解”能力的机制尚未得到足够的探讨。为了从数据角度来研究这个问题，我们介绍了听写执行语音识别，通过训练一个听写执行模型来理解和执行多种自由形式文本指令。这使得许多语音识别任务——从转录修改到摘要——无需靠背景知识库或预训练语音模块。另外，我们的模型可以根据指令来选择转录内容，如“转录首半并then关闭听写”，提供了额外的隐私和安全层次。我们的发现表明了指令执行训练可以推动语音基础模型的进步。）

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

paper_url: http://arxiv.org/abs/2309.09838
repo_url: None
paper_authors: Yi-Wei Wang, Ke-Han Lu, Kuan-Yu Chen
for: 提高自动语音识别（ASR）性能， revising recognition results 是一种轻量级 yet efficient 的方法。
methods: 研究使用 N-best reranking 方法和 error correction 模型来重新评估 ASR 结果。
results: 发布了 ASR 假设修正（HypR）数据集，包括 AISHELL-1、TED-LIUM 2 和 LibriSpeech 等常用 corpora，并提供每个语音片断50个认知假设。还实现了多种经典和代表性的方法，展示了最新的研究进展。

Abstract
With the development of deep learning, automatic speech recognition (ASR) has made significant progress. To further enhance the performance, revising recognition results is one of the lightweight but efficient manners. Various methods can be roughly classified into N-best reranking methods and error correction models. The former aims to select the hypothesis with the lowest error rate from a set of candidates generated by ASR for a given input speech. The latter focuses on detecting recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result. However, we observe that these studies are hardly comparable to each other as they are usually evaluated on different corpora, paired with different ASR models, and even use different datasets to train the models. Accordingly, we first concentrate on releasing an ASR hypothesis revising (HypR) dataset in this study. HypR contains several commonly used corpora (AISHELL-1, TED-LIUM 2, and LibriSpeech) and provides 50 recognition hypotheses for each speech utterance. The checkpoint models of the ASR are also published. In addition, we implement and compare several classic and representative methods, showing the recent research progress in revising speech recognition results. We hope the publicly available HypR dataset can become a reference benchmark for subsequent research and promote the school of research to an advanced level.

摘要
Translation in Simplified Chinese:随着深度学习的发展，自动语音识别（ASR）的进步也在不断。为了进一步提高性能，重新评估结果是一种轻量级 yet efficient的方式。不同的方法可以大致分为N-best重新排序方法和错误修复模型。前者targets selecting the lowest error rate hypothesis from a set of candidates generated by ASR for a given input speech。后者则是关注检测recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result。然而，我们注意到这些研究通常不能相互比较，因为它们通常在不同的corpus上进行评估，与不同的ASR模型对应，甚至使用不同的数据集来训练模型。因此，我们首先集中精力于发布ASR假设修复（HypR）数据集。HypR包含了一些常用的corpus（AISHELL-1、TED-LIUM 2和LibriSpeech），并为每个语音词提供50个识别假设。ASR的checkpoint模型也同时发布。此外，我们实现并比较了一些经典和代表性的方法，展示了最近的研究进展。我们希望公开的HypR数据集可以成为后续研究的参考准 marker，并推动这一领域的研究进入更高水平。

AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

paper_url: http://arxiv.org/abs/2309.09800
repo_url: https://github.com/update-for-integrated-business-ai/amurd
paper_authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt
for: 本研究は receipt 资讯抽取の问题に关する、新的多语言数据集的开发而建立的。
methods: 本研究使用的方法是 InstructLLaMA 方法，它可以解决资讯抽取和项目分类中的主要挑战。
results: 本研究获得的结果是 F1 分数为 0.76 和准确率为 0.68，这表明 InstructLLaMA 方法可以实现高精度的资讯抽取和项目分类。

Abstract
Key information extraction involves recognizing and extracting text from scanned receipts, enabling retrieval of essential content, and organizing it into structured documents. This paper presents a novel multilingual dataset for receipt extraction, addressing key challenges in information extraction and item classification. The dataset comprises $47,720$ samples, including annotations for item names, attributes like (price, brand, etc.), and classification into $44$ product categories. We introduce the InstructLLaMA approach, achieving an F1 score of $0.76$ and an accuracy of $0.68$ for key information extraction and item classification. We provide code, datasets, and checkpoints.\footnote{\url{https://github.com/Update-For-Integrated-Business-AI/AMuRD}.

摘要
“这份研究将提出一个新的多语言 dataset，用于收据EXTRACTION，并解决资讯EXTRACTION和项目分类中的主要挑战。该dataset包含47,720个样本，包括项目名称、特征（价格、品牌等）的标注，以及44个产品类别的分类。我们提出了InstructLLaMA方法，实现了关键资讯EXTRACTION和项目分类中的 F1 分数为0.76，和精度为0.68。我们提供了代码、dataset和检查点。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

paper_url: http://arxiv.org/abs/2309.09799
repo_url: None
paper_authors: Shanglin Lei, Xiaoping Wang, Guanting Dong, Jiang Li, Yingjian Liu
for: 这个研究是为了提高对话中的情感识别能力，并且实现在不同情感场景下的通用性。
methods: 这个研究使用了一个混合式连续特征网络（HCAN），包括一个混合的回传和注意力模组，以模型全局情感连续性。另外，一个新的情感对应编码（EAE）是提出来模型每个语言的内部和外部情感对应。
results: 这个研究在三个数据集上取得了顶尖性能，证明了我们的方法的优越性。另外，一系列的比较实验和抽象研究还进行了三个Benchmark上，以支持每个模员的有效性。

Abstract
Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{\rm EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.

摘要
受欢迎的情感认知在对话中（ERC）在自然语言处理领域中受到广泛的关注，因为它在实际应用中具有巨大的潜力。现有的ERC方法在不同的场景下进行泛化是一个大的挑战，主要是因为不充分考虑对话上下文，不充分捕捉对话关系，以及模型过拟合。在这种情况下，我们提出了一种混合连续属性网络（HCAN），以解决这些问题。具体来说，HCAN采用混合回卷和注意力基元来模型全局情感连续性。然后，我们提出了一种新的情感归属编码（EAE），以模型每个语音句中的内部和间接情感归属。此外，为了增强模型在 speaker 模型中的稳定性和不同场景下的性能，我们提出了一种完整的情感认知loss函数 $\mathcal{L}_{\rm EC}$，以避免情感漂移和模型过拟合。我们的模型在三个数据集上达到了领先的性能，证明了我们的工作的优越性。此外，我们还进行了广泛的比较实验和简要的减少实验，以提供对每个模块的证明。进一步的普适性实验表明了EAE模块的插入性。

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

paper_url: http://arxiv.org/abs/2309.09783
repo_url: None
paper_authors: Michal Mochtak, Peter Rupnik, Nikola Ljubešić
for: 这篇论文主要用于研究政治决策中的情感因素，以及如何系统地研究和测量这些情感。
methods: 论文使用了一个新的情感注释句子数据集，并在这些句子上进行了一系列的实验，旨在训练一个可靠的情感分类器。此外，论文还介绍了首次针对政治科学应用的域specific LLM，并在这些数据上进行了额外预训练。
results: 实验表明，额外预训练 LLM 在域specific 任务上可以显著提高模型的下游性能，并且多语言模型在未看过语言上表现非常好。此外，论文还证明了在其他语言上额外收集数据可以大幅提高目标议会的结果。

Abstract
Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classifier for parliamentary proceedings. The paper also introduces the first domain-specific LLM for political science applications additionally pre-trained on 1.72 billion domain-specific words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can significantly improve the model downstream performance on the domain-specific tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages significantly improves the target parliament's results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.

摘要

Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation

paper_url: http://arxiv.org/abs/2309.09749
repo_url: https://github.com/qiuhuachuan/CensorChat
paper_authors: Huachuan Qiu, Shuai Zhang, Hongliang He, Anqi Li, Zhenzhong Lan
for: 本研究旨在提高对对话系统中NSFW语言检测的能力，以保障用户的安全和 благополучие在数字对话中。
methods: 本研究使用了知识储存技术，包括GPT-4和ChatGPT，来建立NSFW对话检测 dataset。通过人工标注和自我批判策略，该dataset提供了一种可靠的NSFW语言检测方法。
results: 研究表明，通过使用BERT模型进行文本分类，可以准确地检测NSFW语言在对话中。此外，该方法还能够考虑到用户的自由表达权，从而提高对话系统的安全性和可靠性。

Abstract
NSFW (Not Safe for Work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. However, research on detecting NSFW language, especially sexually explicit content, within a dialogue context has significantly lagged behind. To address this issue, we introduce CensorChat, a dialogue monitoring dataset aimed at NSFW dialogue detection. Leveraging knowledge distillation techniques involving GPT-4 and ChatGPT, this dataset offers a cost-effective means of constructing NSFW content detectors. The process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is employed to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. The study emphasizes the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.

摘要
The process involves collecting real-life human-machine interaction data and breaking it down into individual utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is used to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is evaluated.The study highlights the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.

When Large Language Models Meet Citation: A Survey

paper_url: http://arxiv.org/abs/2309.09727
repo_url: None
paper_authors: Yang Zhang, Yufei Wang, Kai Wang, Quan Z. Sheng, Lina Yao, Adnan Mahmood, Wei Emma Zhang, Rongying Zhao
For: This paper reviews the use of large language models (LLMs) for in-text citation analysis tasks and how citation linkage knowledge can be used to improve text representations of LLMs.* Methods: The paper discusses the application of LLMs for citation classification, citation-based summarization, and citation recommendation, as well as the use of citation prediction, network structure information, and inter-document relationship to improve text representations.* Results: The paper provides a preliminary review of the mutually beneficial relationship between LLMs and citation analysis, and highlights potential promising avenues for further investigation.Here’s the information in Simplified Chinese text:* For: 这篇论文探讨了大语言模型（LLMs）在文本中引用分析任务中的应用，以及如何使用引用链接知识来提高 LLMs 的文本表示。* Methods: 论文讨论了 LLMs 的应用包括引用分类、基于引用的摘要和引用推荐，以及使用引用预测、网络结构信息和间文件关系来提高 LLMs 的文本表示。* Results: 论文提供了一项初步的评估，探讨了 LLMs 和引用分析之间的互利关系，并提出了可能的进一步研究方向。

Abstract
Citations in scholarly work serve the essential purpose of acknowledging and crediting the original sources of knowledge that have been incorporated or referenced. Depending on their surrounding textual context, these citations are used for different motivations and purposes. Large Language Models (LLMs) could be helpful in capturing these fine-grained citation information via the corresponding textual context, thereby enabling a better understanding towards the literature. Furthermore, these citations also establish connections among scientific papers, providing high-quality inter-document relationships and human-constructed knowledge. Such information could be incorporated into LLMs pre-training and improve the text representation in LLMs. Therefore, in this paper, we offer a preliminary review of the mutually beneficial relationship between LLMs and citation analysis. Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.

摘要
文献引用在学术作品中服务于重要的目的是承认和归因原始知识的 incorporation 和参考。根据文本上下文，这些引用有不同的动机和目的。大语言模型（LLM）可以通过对应的文本上下文捕捉细腻的引用信息，从而更好地理解文献。此外，这些引用还建立了科学论文之间的连接，提供高质量的交叉文献关系和人类建构的知识。这些信息可以在 LLM 的预训练中包含，以提高文本表示。因此，在这篇论文中，我们提供了一个初步的文献综述，探讨 LLM 和引用分析之间的互利关系。 Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Dealing with negative samples with multi-task learning on span-based joint entity-relation extraction

paper_url: http://arxiv.org/abs/2309.09713
repo_url: None
paper_authors: Chenguang Xue, Jiamin Lu
for: span-based joint extraction modelsmethods: multitask learning, intersection over union (IoU) concept, entity Logitsresults: mitigating adverse effects of excessive negative samples, commendable F1 scores of 73.61%, 53.72%, and 83.72% on three widely employed public datasets (CoNLL04, SciERC, and ADE)

Abstract
Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed SpERT.MT model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61\%, 53.72\%, and 83.72\% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.

摘要
最近的span-based联合提取模型已经显示了很大的优势在实体识别和关系提取任务中。这些模型将文本段视为候选实体，并将段对 viewed as candidate relationship tuples，达到了最新的结果在ADE等 dataset上。然而，这些模型在任务中遇到了大量的非实体段或无关的段对，这会减少模型的性能。为了解决这个问题，本文提出了一种span-based多任务实体关系联合提取模型。这种方法使用多任务学习来减轻非实体段和无关段对对实体和关系分类器的影响。此外，我们利用了Intersection over Union(IoU)概念来将位置信息引入实体分类器中，实现了段边界检测。此外，通过将实体预测值 integrate into the embedded representation of entity pairs，我们可以为关系分类器提供更加semantic的输入。实验结果表明，我们提出的SpERT.MT模型可以有效地减轻非实体段和无关段对对模型性能的影响。此外，模型在三个广泛使用的公共数据集上达到了很高的F1分数，具体分别是73.61%, 53.72%, 83.72%。

Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels

paper_url: http://arxiv.org/abs/2309.09697
repo_url: https://github.com/panatchakorn-a/bias-eval-nli-considering-all-labels
paper_authors: Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
for: 本研究旨在评估语言模型中的偏见，包括生物偏见和语言偏见。
methods: 本研究提出了一种基于所有标签的语言推理偏见评估方法，包括创建评估数据集和定义偏见度量。
results: 实验结果显示，该评估方法可以更准确地评估语言模型的偏见性，并且可以应用于多种语言。此外，本研究还评估了不同语言的语言模型偏见性。

Abstract
Discriminatory social biases, including gender biases, have been found in Pre-trained Language Models (PLMs). In Natural Language Inference (NLI), recent bias evaluation methods have observed biased inferences from the outputs of a particular label such as neutral or entailment. However, since different biased inferences can be associated with different output labels, it is inaccurate for a method to rely on one label. In this work, we propose an evaluation method that considers all labels in the NLI task. We create evaluation data and assign them into groups based on their expected biased output labels. Then, we define a bias measure based on the corresponding label output of each data group. In the experiment, we propose a meta-evaluation method for NLI bias measures, and then use it to confirm that our measure can evaluate bias more accurately than the baseline. Moreover, we show that our evaluation method is applicable to multiple languages by conducting the meta-evaluation on PLMs in three different languages: English, Japanese, and Chinese. Finally, we evaluate PLMs of each language to confirm their bias tendency. To our knowledge, we are the first to build evaluation datasets and measure the bias of PLMs from the NLI task in Japanese and Chinese.

摘要
社会偏见，包括性别偏见，在预训练语言模型（PLM）中发现。在自然语言推理（NLI）任务中，最近的偏见评估方法发现PLM的输出中存在偏见。然而，由于不同的偏见可能与不同的输出标签相关，因此使用一个标签是不准确的。在这种情况下，我们提议一种评估方法，该方法考虑所有的输出标签在NLI任务中。我们创建评估数据，并将其分组 Based on their expected biased output labels。然后，我们定义基于每个数据组的标签输出的偏见度量。在实验中，我们提议一种元评估方法，并使用其来验证我们的度量方法可以更准确地评估偏见。此外，我们展示了我们的评估方法可以应用于多种语言，通过在英语、日语和中文三种语言的PLM上进行元评估。最后，我们评估每种语言的PLM，以验证它们的偏见倾向。到目前为止，我们是第一个在NLI任务中为日语和中文的PLM建立评估数据集，并测量它们的偏见倾向。

Do learned speech symbols follow Zipf’s law?

paper_url: http://arxiv.org/abs/2309.09690
repo_url: None
paper_authors: Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari
for: 本研究探讨了深度学习学习的speech symbol是否遵循Zipf的法则，与自然语言符号一样。
methods: 本研究使用了深度学习来学习speech symbol，并对其进行分析，以判断它们是否遵循Zipf的法则。
results: 研究发现，数据驱动的speech symbol遵循Zipf的法则，与自然语言符号一样。这些结果为 spoken语言处理领域提供了新的统计分析方法。

Abstract
In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols. Zipf's law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized to comply with this law. On the other hand, recent breakthroughs in spoken language processing have given rise to the development of learned speech symbols; these are data-driven symbolizations of speech content. Our objective is to ascertain whether these data-driven speech symbols follow Zipf's law, as the same as natural language symbols. Through our investigation, we aim to forge new ways for the statistical analysis of spoken language processing.

摘要
在这个研究中，我们研究了深度学习学习的语音符号是否遵循Zipf的法则，与自然语言符号一样。Zipf的法则是一个实际法则，描述了语言中词语频率的分布，成为自然语言处理的统计分析基础。人类创造的自然语言符号遵循这个法则，而现在的口语处理技术突破使得数据驱动的语音符号出现了。我们的目标是确定这些数据驱动的语音符号是否遵循Zipf的法则，与自然语言符号一样。通过我们的调查，我们希望开拓新的统计分析方法 для口语处理。

Multi-turn Dialogue Comprehension from a Topic-aware Perspective

paper_url: http://arxiv.org/abs/2309.09666
repo_url: None
paper_authors: Xinbei Ma, Yi Xu, Hai Zhao, Zhuosheng Zhang
for: 本文主要针对对话机器理解中的对话发展需要语言模型能够有效地解耦和模型多个对话转帖。由于对话发展的主题可能不会在整个对话过程中保持相同，因此对话模型化非常具有挑战性。本文提出了一种基于主题的对话模型化方法，通过对对话段进行自动分割，将对话段转化为主题相关的语言处理单元。
methods: 本文提出了一种基于主题的对话段分割算法，并使用这些分割后的对话段作为主题相关的语言处理单元进行进一步的对话理解。此外，本文还提出了一种基于自适应autoencoder的主题划分系统，以及两个自定义的评估数据集。
results: 对三个公共 benchmark 进行了实验，并证明了与基eline相比，本文的方法具有显著的改善。本文继承了之前关于文档主题的研究，并将对话模型化带入了一个新的主题意识领域，并进行了广泛的实验和分析。

Abstract
Dialogue related Machine Reading Comprehension requires language models to effectively decouple and model multi-turn dialogue passages. As a dialogue development goes after the intentions of participants, its topic may not keep constant through the whole passage. Hence, it is non-trivial to detect and leverage the topic shift in dialogue modeling. Topic modeling, although has been widely studied in plain text, deserves far more utilization in dialogue reading comprehension. This paper proposes to model multi-turn dialogues from a topic-aware perspective. We start with a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way. Then we use these fragments as topic-aware language processing units in further dialogue comprehension. On one hand, the split segments indict specific topics rather than mixed intentions, thus showing convenient on in-domain topic detection and location. For this task, we design a clustering system with a self-training auto-encoder, and we build two constructed datasets for evaluation. On the other hand, the split segments are an appropriate element of multi-turn dialogue response selection. For this purpose, we further present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements and matches response candidates with a dual cross-attention. Empirical studies on three public benchmarks show great improvements over baselines. Our work continues the previous studies on document topic, and brings the dialogue modeling to a novel topic-aware perspective with exhaustive experiments and analyses.

摘要
对话相关的机器阅读理解需要语言模型能够有效地解耦和模型多个对话段。因为对话的目的可能会在整个段落中变化，因此检测和利用对话中的主题转换非常重要。主题分析，尽管在普通文本中广泛研究，在对话阅读理解中尚未得到充分利用。这篇论文提议在一个主题意识角度上模型多个对话段。我们从对话分割算法开始，将对话段落分解成主题强调的小段，然后使用这些小段作为主题意识语言处理单元进行进一步的对话理解。在一个方面，分割段落可以帮助检测对话中的主题，而不是混合的意图，因此在领域内主题检测和定位变得更加便捷。在另一方面，分割段落是多Turn对话回答选择的适当元素。为此，我们采用了一种新的模型，主题意识双重注意网络（TADAM），该模型将主题段落作为处理元素，并将回答候选者与双重跨注意相匹配。我们对三个公共 benchmark 进行了实验，得到了很大的改进。我们的工作继承了之前的文档主题研究，并将对话模型带到了一个新的主题意识角度，并进行了详细的实验和分析。

A Novel Method of Fuzzy Topic Modeling based on Transformer Processing

paper_url: http://arxiv.org/abs/2309.09658
repo_url: None
paper_authors: Ching-Hsun Tseng, Shin-Jye Lee, Po-Wei Cheng, Chien Lee, Chih-Chieh Hung
for: 本研究旨在提出一种基于软划分和文档嵌入的模糊主题分析方法，以便更好地监测市场趋势。
methods: 本研究使用了state-of-the-art transformer-based模型来实现词embedding和软划分 clustering，并应用于新闻发布监测中。
results: 实际应用中，模糊主题分析方法比传统的LDA模型更能够提供自然的结果。

Abstract
Topic modeling is admittedly a convenient way to monitor markets trend. Conventionally, Latent Dirichlet Allocation, LDA, is considered a must-do model to gain this type of information. By given the merit of deducing keyword with token conditional probability in LDA, we can know the most possible or essential topic. However, the results are not intuitive because the given topics cannot wholly fit human knowledge. LDA offers the first possible relevant keywords, which also brings out another problem of whether the connection is reliable based on the statistic possibility. It is also hard to decide the topic number manually in advance. As the booming trend of using fuzzy membership to cluster and using transformers to embed words, this work presents the fuzzy topic modeling based on soft clustering and document embedding from state-of-the-art transformer-based model. In our practical application in a press release monitoring, the fuzzy topic modeling gives a more natural result than the traditional output from LDA.

摘要
Recently, fuzzy membership clustering and transformer-based word embedding have gained popularity. This study introduces fuzzy topic modeling based on soft clustering and document embedding using state-of-the-art transformer-based models. In our practical application of press release monitoring, fuzzy topic modeling provides more natural results than traditional LDA output.

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

paper_url: http://arxiv.org/abs/2309.09652
repo_url: None
paper_authors: Peter Ochieng
for: 提高Diffusion基于 vocoder的速度和质量
methods: 使用卷积神经网络进行降噪和数据回归，并定义skip参数以降低数据分布恢复步骤数量
results: 提出一种新的 diffusion vocoder 技术，可以在竞争性的时间内生成高质量的语音，并且具有普适性和扩展性。

Abstract
Diffusion based vocoders have been criticised for being slow due to the many steps required during sampling. Moreover, the model's loss function that is popularly implemented is designed such that the target is the original input $x_0$ or error $\epsilon_0$. For early time steps of the reverse process, this results in large prediction errors, which can lead to speech distortions and increase the learning time. We propose a setup where the targets are the different outputs of forward process time steps with a goal to reduce the magnitude of prediction errors and reduce the training time. We use the different layers of a neural network (NN) to perform denoising by training them to learn to generate representations similar to the noised outputs in the forward process of the diffusion. The NN layers learn to progressively denoise the input in the reverse process until finally the final layer estimates the clean speech. To avoid 1:1 mapping between layers of the neural network and the forward process steps, we define a skip parameter $\tau>1$ such that an NN layer is trained to cumulatively remove the noise injected in the $\tau$ steps in the forward process. This significantly reduces the number of data distribution recovery steps and, consequently, the time to generate speech. We show through extensive evaluation that the proposed technique generates high-fidelity speech in competitive time that outperforms current state-of-the-art tools. The proposed technique is also able to generalize well to unseen speech.

摘要
Diffusion基于的 vocoder 被批评为因 sampling 过程中需要多个步骤，导致速度较慢。另外，流行的模型损失函数设计为目标是原始输入 $x_0$ 或错误 $\epsilon_0$，对于早期时间步的反向过程而言，这会导致大的预测错误，从而导致语音扭曲和增加学习时间。我们提议一种设置，其中目标是在不同的时间步骤中的输出，以降低预测错误的大小和减少学习时间。我们使用不同层次的神经网络（NN）来进行去噪，通过训练它们学习生成与前向过程中的杂音输出相似的表示。NN层次逐步去噪输入，直到最后一层估计干净的语音。为了避免 NN 层次与前向过程步骤之间的一对一映射，我们定义了跳过参数 $\tau>1$，以便 NN 层次在 $\tau$ 步前进行共同去噪。这 significantly 减少了数据回归步骤的数量，并且因此减少了生成语音的时间。我们通过广泛的评估表明，我们提议的技术可以生成高效率、高质量的语音，并且可以很好地泛化到未seen语音。

Summarization is (Almost) Dead

paper_url: http://arxiv.org/abs/2309.09558
repo_url: None
paper_authors: Xiao Pu, Mingqi Gao, Xiaojun Wan
for: 这 paper 是为了评估大语言模型（LLM）在摘要任务中的表现。
methods: 该 paper 使用了新的数据集和人类评估实验来评估 LLM 在五种不同摘要任务中的零基础生成能力。
results: 研究发现，人类评估者对 LLM 生成的摘要优于人工写成的摘要和 fine-tuned 模型生成的摘要，特别是在事实一致性和外部幻见方面表现更好。因此，我们认为，在 LLM 时代，大多数文本摘要领域的传统工作都不再必要了。然而，我们还需要继续探索一些方向，例如创造更高质量和更可靠的评估方法，以及新的数据集。

Abstract
How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.

摘要
大型语言模型（LLM）是否能够生成好的摘要？我们开发了新的数据集和进行了人类评估实验，以评估 LLM 在五种不同摘要任务中的零基础生成能力。我们的结果显示，人类评审者对 LLM 生成的摘要表示偏好，而且与人工撰写的摘要和 fine-tuned 模型生成的摘要相比， LLM 生成的摘要更有内容和外部错误的优势。尤其是 LLM 生成的摘要在事实上是更加精确和有 fewer 外部错误。由于 LLM 在摘要任务中的表现非常满意（甚至超过参考摘要的 benchmark），我们认为，现在的文本摘要领域中大多数传统的工作不再是必要的。然而，我们认为还有一些值得探索的方向，例如创建更高质量和更可靠的评估方法，以及创建更多的数据集。

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

paper_url: http://arxiv.org/abs/2309.09546
repo_url: None
paper_authors: George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti
for: 这篇论文旨在提出一种可以在处理时间上动态地调整神经网络模型的 Computational Load，以便在资源有限的设备上进行处理。
methods: 本论文使用了初级终端架构，具体是透过中间终端分支来实现动态模型的调整。此外，本论文还对已预训神经网络进行了从头开始训练。
results: 实验结果显示，使用了初级终端架构且从头开始训练的模型不仅在使用较少层数时能够保持表现水准，更能够提高任务的准确率，比single-exit模型或使用预训模型更好。此外，本论文还提出了一种基于 posterior probability的exit选择策略作为一个替代方案。

Abstract
The possibility of dynamically modifying the computational load of neural models at inference time is crucial for on-device processing, where computational power is limited and time-varying. Established approaches for neural model compression exist, but they provide architecturally static models. In this paper, we investigate the use of early-exit architectures, that rely on intermediate exit branches, applied to large-vocabulary speech recognition. This allows for the development of dynamic models that adjust their computational cost to the available resources and recognition performance. Unlike previous works, besides using pre-trained backbones we also train the model from scratch with an early-exit architecture. Experiments on public datasets show that early-exit architectures from scratch not only preserve performance levels when using fewer encoder layers, but also improve task accuracy as compared to using single-exit models or using pre-trained models. Additionally, we investigate an exit selection strategy based on posterior probabilities as an alternative to frame-based entropy.

摘要
可以在推理时动态修改神经网络模型的计算负担非常重要，特别是在设备上进行处理，因为计算能力有限并且时间变化。现有的神经网络压缩方法已经存在，但它们提供的模型是建筑性的静态的。在这篇论文中，我们研究使用中途离开结构，该结构 rely on intermediate exit branches，应用于大 vocabulary 语音识别。这 permit 开发动态模型，可以根据可用资源和识别性来调整计算成本。与先前的工作不同，我们不仅使用预训练后台，还从scratch 训练模型。实验表明，使用 fewer encoder layers 的 early-exit 模型不仅保持性能水平，而且在使用单exit模型或使用预训练模型时，也提高了任务准确率。此外，我们还研究基于 posterior probabilities 的 exit 选择策略作为 frame-based entropy 的替代方案。

Adapting Large Language Models via Reading Comprehension

paper_url: http://arxiv.org/abs/2309.09530
repo_url: https://github.com/microsoft/lmops
paper_authors: Daixuan Cheng, Shaohan Huang, Furu Wei
for: 本研究探讨了如何通过继续在域pecific的corpus上进行预训练，以influence大语言模型，发现预训练 Raw corpora 可以让模型学习域知识，但是会严重降低其回答问题的能力。
methods: 我们提出了一种简单的方法，通过将 Raw corpora 转换成reading comprehension texts来解决这个问题。我们的方法可以扩展到任何预训练 corpora，并且可以在不同的域上进行缩放。
results: 我们的研究表明，通过使用我们的方法，我们的7B语言模型可以与更大的域pecific模型（如BloombergGPT-50B） achieve competitive performance。此外，我们还证明了域pecific reading comprehension texts可以提高模型的性能，并且表明可以开发一个可以在多个域上进行预训练的通用模型。

Abstract
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

摘要
我们探究了将领域特定文献作为预训练数据的影响于大型自然语言模型，发现将模型训练在原始文献上授知了领域知识，但会对问答能力产生极大的影响。启发自人类通过阅读理解——阅读后练习提高了根据学习的知识回答问题的能力——我们提议一种简单的方法，可以将原始文献转化为阅读理解文本。每个原始文本都会被授加一系列与其内容相关的任务。我们的方法可扩展到任何预训练 corpora，并在生物医学、金融和法律等三个领域中实现了稳定的提升性。尤其是我们的7B语言模型可以与大小相对较小的领域特定模型相比，如 BloombergGPT-50B，达到竞争性的表现。此外，我们还证明了领域特定的阅读理解文本可以提高模型的表现，即使是通用的标准 bencmarks。我们的模型、代码和数据将会在https://github.com/microsoft/LMOps 上公开。

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

paper_url: http://arxiv.org/abs/2309.09524
repo_url: https://github.com/cppan-packages/43c6cb2c61134ec7e23098e41ca6ee7bfe3573342f9f6f196bc095247e062001
paper_authors: Junzhe Liu, Jianwei Yu, Xie Chen
for: 实现佳绩的语音识别模型，例如神经Transducer，可以同时融合语音和语言信息，但将这些模型调整到仅有文本数据的情况下是困难的。
methods: 为了解决这个问题，我们提出了一个称为factorized neural Transducer（FNT）的模型结构，它将引入一个独立的词汇解oder，以预测词汇，并实现传统的文本数据适应。
results: 对于这个挑战，我们提出了一个改进的factorized neural Transducer（IFNT）模型结构，可以实现语音和语言信息的完整融合，并实现有效的文本适应。我们通过对GigaSpeech和三个类别数据集进行实验，证明了IFNT的性能提升。相比于标准的神经Transducer和FNT模型，IFNT在文本适应后可以获得7.9%至28.5%的相对WRER改善，并在三个测试集上获得1.6%至8.2%的相对WRER改善。

Abstract
End-to-end models, such as the neural Transducer, have been successful in integrating acoustic and linguistic information jointly to achieve excellent recognition performance. However, adapting these models with text-only data is challenging. Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary, which can effectively perform traditional text data adaptation. Nonetheless, this approach has limitations in fusing acoustic and language information seamlessly. Moreover, a degradation in word error rate (WER) on the general test sets was also observed, leading to doubts about its overall performance. In response to this challenge, we present an improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information while enabling effective text adaptation. We evaluate the performance of our proposed methods through in-domain experiments on GigaSpeech and out-of-domain experiments adapting to EuroParl, TED-LIUM, and Medical datasets. After text-only adaptation, IFNT yields 7.9% to 28.5% relative WER improvements over the standard neural Transducer with shallow fusion, and relative WER reductions ranging from 1.6% to 8.2% on the three test sets compared to the FNT model.

摘要
结合语音和语言信息的端到端模型，如神经Transducer，已经在识别性能方面取得了出色的成绩。然而，将这些模型适应文本数据是一项挑战。 factorized neural Transducer (FNT) 目的是解决这个问题，它引入了一个独立的词典解码器来预测词语，这有效地实现了传统文本数据适应。然而，这种方法在融合语音和语言信息的方面存在限制。此外，我们在通用测试集上观察到了单词错误率 (WER) 的下降，这引发了对其总性能的忧虑。为了解决这个挑战，我们提出了改进的 factorized neural Transducer (IFNT) 模型结构，用于全面融合语音和语言信息，同时允许有效的文本适应。我们通过在 GigaSpeech 上进行域内实验和在 EuroParl、TED-LIUM 和医疗数据集上进行对应适应测试，证明了我们的提议方法的性能优势。在文本适应后，IFNT 对于标准神经Transducer 的浅合并而言，实现了7.9% 到 28.5% 的相对 WER 改进，并在三个测试集上实现了1.6% 到 8.2% 的相对 WER 下降。

paper_url: http://arxiv.org/abs/2309.09508
repo_url: None
paper_authors: Jinsheng Pan, Zichen Wang, Weihong Qi, Hanjia Lyu, Jiebo Luo
for: This paper aims to address the gap in our understanding of the disparities in framing political issues between news media and social media outlets.
methods: The authors conduct a comprehensive investigation, focusing on the nuanced distinctions in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. They use both qualitative and quantitative methods to compare the framing of these issues across different media platforms.
results: The authors find that while there is some overlap in framing between social media and traditional media outlets, there are substantial differences in the way these issues are framed, particularly in terms of polarization. They observe that social media platforms tend to present more polarized stances across all framing categories, while traditional news media tend to exhibit more consensus on the topic of student loans, but more polarization on affirmative action and abortion rights. These findings have significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.Here is the same information in Simplified Chinese text:
for: 本研究旨在填补新媒体和传统媒体之间政治问题帧幕的差异不足的知识空白。
methods: 作者采用了丰富的资料和方法，对美国最高法院的一系列决定（包括奖学金、学生贷款和堕胎权）进行了精心的分析和比较。
results: 作者发现，虽然新媒体和传统媒体之间有一定的帧幕相似性，但是存在重要的差异，尤其是在各个主题上的极化程度更高。社交媒体平台比传统新闻媒体更加具有极化的立场，而传统新闻媒体则在奖学金和堕胎权上更加具有政治倾向。这些发现对公众意见形成、政策决策和政治景观产生了重要的影响。

Abstract
Understanding the framing of political issues is of paramount importance as it significantly shapes how individuals perceive, interpret, and engage with these matters. While prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

摘要
理解政治问题的帧定对于个人的认知、解释和参与政治决策具有 paramount importance。 although prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.Here's a word-for-word translation of the text into Simplified Chinese:理解政治问题的帧定对于个人的认知、解释和参与政治决策具有paramount importance。 although prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

paper_url: http://arxiv.org/abs/2309.09506
repo_url: https://github.com/projectnuwa/layoutnuwa
paper_authors: Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan
for: 该论文主要针对 Graphic layout generation 领域的研究, 它的目的是提高用户参与度和信息吸收。
methods: 该论文提出了一种新的 LayoutNUWA 模型，它将布局生成视为代码生成任务，以增强 semantic 信息和利用大型自然语言模型（LLMs）中隐藏的布局专家知识。该模型包括三个相互连接的模块：1）代码初始化（CI）模块，计算数值条件并将其转换为 HTML 代码中策略性地隐藏的掩码; 2）代码完成（CC）模块，使用形式知识来填充掩码内的部分; 3）代码渲染（CR）模块，将完成后的代码转换为最终的布局输出，确保了高度可读性和透明度的布局生成过程。
results: 该论文在多个数据集上达到了状态之 искусственный智能性能的显著提高（超过 50%），展示了 LayoutNUWA 模型的强大能力。

Abstract
Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50\% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at https://github.com/ProjectNUWA/LayoutNUWA.

摘要
GRAPHIC LAYOUT生成，一个快速发展的研究领域，对用户参与度和信息感受有着重要的作用。现有方法主要对layout生成视为数值优化任务，强调数值方面而忽视layout中的semantic信息，如每个元素之间的关系。在这篇论文中，我们提出了LayoutNUWA，首个对layout生成视为代码生成任务，以增强semantic信息和利用大语言模型（LLMs）隐藏的布局专家知识。更具体地，我们开发了Code Instruct Tuning（CIT）方法，包括三个相连的模块：1）Code Initialization（CI）模块量化数值条件，并将其转化为HTML代码中策略性地置入masks；2）Code Completion（CC）模块利用LLMs的格式知识填充masked部分 dentroHTML代码；3）Code Rendering（CR）模块将完成代码转化为最终的布局输出，确保了高度可读性和透明度的布局生成过程，直接将代码映射到可视化的布局。我们在多个数据集上达到了STATE-OF-THE-ART性能（超过50%提升），展示了LayoutNUWA的强大能力。我们的代码可以在https://github.com/ProjectNUWA/LayoutNUWA上获取。

Search and Learning for Unsupervised Text Generation

paper_url: http://arxiv.org/abs/2309.09497
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Lili Mou
for: 这篇论文是为了探讨深度学习技术在人工智能（AI）领域中的文本生成，以及它对应用和社会影响的可能性。
methods: 这篇论文使用搜寻和学习方法来实现不监督的文本生成，其中一个变量目标函数估计候选句子的质量，而排序算法则将句子生成来最大化搜寻目标。一个机器学习模型也从搜寻结果中学习来平滑化噪音和提高效率。
results: 这篇论文的结果显示，这种搜寻和学习的方法可以实现高品质的文本生成，并且可以减少人类标注努力和处理低资源语言的时间。这种方法具有实际应用和社会影响的重要性，特别是在建立新任务的最小可行产品和减少人类努力的方面。

Abstract
With the advances of deep learning techniques, text generation is attracting increasing interest in the artificial intelligence (AI) community, because of its wide applications and because it is an essential component of AI. Traditional text generation systems are trained in a supervised way, requiring massive labeled parallel corpora. In this paper, I will introduce our recent work on search and learning approaches to unsupervised text generation, where a heuristic objective function estimates the quality of a candidate sentence, and discrete search algorithms generate a sentence by maximizing the search objective. A machine learning model further learns from the search results to smooth out noise and improve efficiency. Our approach is important to the industry for building minimal viable products for a new task; it also has high social impacts for saving human annotation labor and for processing low-resource languages.

摘要
随着深度学习技术的发展，文本生成在人工智能（AI）社区中吸引了越来越多的关注，因为它的广泛应用和因为它是AI的重要组件。传统的文本生成系统通常在监督方式下训练，需要大量标注的平行 corpus。在这篇论文中，我将介绍我们最近的搜索和学习方法来实现无监督文本生成，其中一个启发目标函数估算候选句子的质量，而排序算法在搜索目标上生成句子。一个机器学习模型从搜索结果中学习，以缓解噪音并提高效率。我们的方法对于建立新任务的最小可行产品非常重要，同时具有高度的社会影响，因为它可以节省人类注释劳动力和处理低资源语言。

Investigating Zero- and Few-shot Generalization in Fact Verification

paper_url: http://arxiv.org/abs/2309.09444
repo_url: https://github.com/teacherpeterpan/fact-checking-generalization
paper_authors: Liangming Pan, Yunxiang Zhang, Min-Yen Kan
for: 这个研究探索了零或几shot普遍化 для факт验证（FV），目的是将FV模型从具有资源的领域（例如Wikipedia）扩展到低资源领域，并将其应用于这些领域中。
methods: 我们首先建立了一个FV数据集集合，包括11个FV数据集，代表6个领域。我们进行了一个实验分析，发现现有模型在这些FV数据集上的普遍性很差。我们的分析发现了一些影响普遍性的因素，包括数据集大小、证据长度和宣告型。
results: 最后，我们显示了两种方向的工作可以提高普遍性：1）在特殊领域上预训模型，和2）通过自动生成训练数据来实现宣告生成。

Abstract
In this paper, we explore zero- and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.

摘要
在这篇论文中，我们探讨零和几个shot泛化 для事实验证（FV），目的是将FV模型在具有资源的领域（例如Wikipedia）上训练后，应用到缺乏人工标注的低资源领域。为此，我们首先构建了一个FV数据集合，包含11个FV数据集，代表6个领域。我们进行了FV数据集间的一般化分析，发现现有模型的一般化能力不佳。我们的分析发现一些因素影响一般化，包括数据集大小、证据长度和声明类型。最后，我们表明两种方向的工作可以提高一般化：1）在特殊领域中预训练模型，并2）通过自动生成训练数据来提高模型的一般化能力。

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

paper_url: http://arxiv.org/abs/2309.09443
repo_url: None
paper_authors: Song Li, Yongbin You, Xuezhi Wang, Ke Ding, Guanglu Wan
for: 提高多语言人工智能助手的表现，扩展其应用领域和国际交流
methods: 提议两种简单效果的方法：语言提示调整和帧级语言适配器，分别提高语言可配和无语言适配的多语言音频识别性能
results: 实验表明，使用我们提议的方法可以在七种语言上提高音频识别性能，显著提高多语言人工智能助手的表现

Abstract
Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence assistants and facilitate international communication, it is essential to enhance the performance of multilingual speech recognition, which is a crucial component of speech interaction. In this paper, we propose two simple and parameter-efficient methods: language prompt tuning and frame-level language adapter, to respectively enhance language-configurable and language-agnostic multilingual speech recognition. Additionally, we explore the feasibility of integrating these two approaches using parameter-efficient fine-tuning methods. Our experiments demonstrate significant performance improvements across seven languages using our proposed methods.

摘要
多语言智能助手，如ChatGPT，在最近受欢迎。为了进一步扩展多语言人工智能助手的应用和国际交流，我们需要提高多语言语音识别的性能，这是语音互动的重要组件。在这篇论文中，我们提出了两种简单和参数效率高的方法：语言提示调整和帧级语言适配器，以提高语言可配置和语言共享的多语言语音识别。此外，我们还探讨了这两种方法的结合使用方式，并使用参数效率的细化调整方法进行评估。我们的实验表明，使用我们提posed方法可以在七种语言中获得显著的性能提升。

Fun Paper

2023-09-18

cs.CL - 2023-09-18

Few-Shot Adaptation for Parsing Contextual Utterances with LLMs

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Hierarchy Builder: Organizing Textual Spans into a Hierarchy to Facilitate Navigation

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

Instruction-Following Speech Recognition

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation

When Large Language Models Meet Citation: A Survey

Dealing with negative samples with multi-task learning on span-based joint entity-relation extraction

Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels

Do learned speech symbols follow Zipf’s law?

Multi-turn Dialogue Comprehension from a Topic-aware Perspective

A Novel Method of Fuzzy Topic Modeling based on Transformer Processing

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

Summarization is (Almost) Dead

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

Adapting Large Language Models via Reading Comprehension

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

Search and Learning for Unsupervised Text Generation

Investigating Zero- and Few-shot Generalization in Fact Verification

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

2023-09-18

Few-Shot Adaptation for Parsing Contextual Utterances with LLMs

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Hierarchy Builder: Organizing Textual Spans into a Hierarchy to Facilitate Navigation

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

Instruction-Following Speech Recognition

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation

When Large Language Models Meet Citation: A Survey

Dealing with negative samples with multi-task learning on span-based joint entity-relation extraction

Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels

Do learned speech symbols follow Zipf’s law?

Multi-turn Dialogue Comprehension from a Topic-aware Perspective

A Novel Method of Fuzzy Topic Modeling based on Transformer Processing

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

Summarization is (Almost) Dead

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

Adapting Large Language Models via Reading Comprehension

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

Search and Learning for Unsupervised Text Generation

Investigating Zero- and Few-shot Generalization in Fact Verification

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter