cs.CL - 2023-09-15

An Empirical Study on Instance Selection Strategies in Self-training for Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2309.08777
  • repo_url: None
  • paper_authors: Haochen Liu, Sai Krishna Rallabandi, Yijing Wu, Parag Pravin Dakle, Preethi Raghavan
  • for: 本研究旨在 investigate the influence of instance selection strategies and hyper-parameters on the performance of self-training in various few-shot settings for sentiment analysis.
  • methods: 本研究使用了自适应学习技术,并对不同的实例选择策略和超参数进行了 empirical study。
  • results: 研究发现,不同的实例选择策略和超参数对自适应学习的性能有很大的影响,并且在不同的几个 shot 设置下,不同的策略和超参数具有不同的最佳性能。
    Abstract Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a larger amount of unlabeled data. However, the performance of a self-training procedure heavily relies on the choice of the instance selection strategy, which has not been studied thoroughly. This paper presents an empirical study on various instance selection strategies for self-training on two public sentiment datasets, and investigates the influence of the strategy and hyper-parameters on the performance of self-training in various few-shot settings.
    摘要 自然语言处理中的情感分析是一项重要任务,它涉及到从文本中提取主观情感。自我培训是一种经济高效的技术,可以使用少量标注数据和更多的无标注数据来开发情感分析模型。然而,自我培训过程中的实例选择策略的选择对模型性能产生很大影响。这篇论文通过对两个公共情感数据集上的不同实例选择策略进行实证研究,探讨自我培训在不同几个尝试设置下的性能影响。

Generating Semantic Graph Corpora with Graph Expansion Grammar

  • paper_url: http://arxiv.org/abs/2309.08714
  • repo_url: None
  • paper_authors: Eric Andersson, Johanna Björklund, Frank Drewes, Anna Jonsson
  • for: 创建Semantic graphs的 corps
  • methods: 使用图解析语言,让用户通过定义 grammar来控制生成的图集
  • results: 可以生成符合 grammar 的图集,用于增强现有 corpus 和教学正式语言理论
    Abstract We introduce Lovelace, a tool for creating corpora of semantic graphs. The system uses graph expansion grammar as a representational language, thus allowing users to craft a grammar that describes a corpus with desired properties. When given such grammar as input, the system generates a set of output graphs that are well-formed according to the grammar, i.e., a graph bank. The generation process can be controlled via a number of configurable parameters that allow the user to, for example, specify a range of desired output graph sizes. Central use cases are the creation of synthetic data to augment existing corpora, and as a pedagogical tool for teaching formal language theory.
    摘要 我们介绍Lovelace,一个用于建立Semantic Graph的工具。这个系统使用图像扩展语法来描述图像的描述语言,因此让用户可以透过定义语法来制定图像的描述。当 given 这个语法为输入时,系统会生成一个符合语法的图像集合,即图像银行。生成过程可以通过一些可配置的参数控制,例如指定出力图像的大小范围。主要用途包括创建增强现有数据库的实验数据,以及教学正式语言理论的教学工具。

Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning

  • paper_url: http://arxiv.org/abs/2309.08708
  • repo_url: https://github.com/mlsw/dynamic-embedding-pruning
  • paper_authors: Miles Williams, Nikolaos Aletras
  • for: 这篇论文目的是简化预训练语言模型(PLM)的内存占用,以便在内存受限的云端环境或设备上部署。
  • methods: 论文使用嵌入矩阵来表示广泛的词汇,这些矩阵形成了模型参数的大部分。过往的工作已经对transformer层中的参数进行了删除,但是对嵌入矩阵的删除则没有被探讨。
  • results: 我们首先显示出,在这些情况下,词汇中有许多不会被使用。我们然后提出了一个简单 yet effective的方法,利用这个发现来删除嵌入矩阵中的部分参数。我们显示了这个方法可以在各种模型和任务上提供内存使用率的重要删除。值得注意的是,我们的方法可以保持下游任务的性能,并且让计算资源的使用更加有效率。
    Abstract The extensive memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings, such as cloud environments or on-device. PLMs use embedding matrices to represent extensive vocabularies, forming a large proportion of the model parameters. While previous work towards parameter-efficient PLM development has considered pruning parameters within the transformer layers, pruning the embedding matrix as part of fine-tuning or inference has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused in these scenarios. We then propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach maintains equivalent downstream task performance while allowing a more efficient use of compute resources.
    摘要 PLMs的庞大内存占用率可能会阻碍部署在内存受限的环境中,如云端环境或设备上。PLMs使用 embedding 矩阵来表示广泛的词汇表,占据模型参数的大部分。而以前的工作在开发减少 PLM 参数时已经考虑过杜refix 层中的参数,但是在 fine-tuning 或 inference 阶段对 embedding 矩阵进行减少还没有被探讨。我们首先表明,在这些场景下,许多词汇 remained 未使用。我们然后提出了一种简单 yet effective 的方法,利用这个发现来减少 embedding 矩阵的内存占用。我们显示了这种方法可以在各种模型和任务上提供了重要的内存占用减少,而且保持下游任务性能相同,使 compute 资源的使用更加高效。

Sparse Autoencoders Find Highly Interpretable Features in Language Models

  • paper_url: http://arxiv.org/abs/2309.08600
  • repo_url: https://github.com/hoagyc/sparse_coding
  • paper_authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
  • for: 本研究旨在解决神经网络内部具有多义性的问题,以提高神经网络的内部工作方式的理解。
  • methods: 本研究使用稀疏自编码器来重建语言模型的内部活动,并从中提取更有意义和单义的特征集。
  • results: 研究发现,通过稀疏自编码器来解决神经网络中的超position问题,可以提高模型的可解释性和可控性,并且可以精准地编辑模型。
    Abstract One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
    摘要 To address this issue, we use "sparse autoencoders" to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by other approaches. By ablating these features, we can precisely edit the model, for example, by removing capabilities such as pronoun prediction, while disrupting the model's behavior less than prior techniques.This work demonstrates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our approach may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

“Merge Conflicts!” Exploring the Impacts of External Distractors to Parametric Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2309.08594
  • repo_url: https://github.com/qiancheng0/ekd_impacts_pkg
  • paper_authors: Cheng Qian, Xinran Zhao, Sherry Tongshuang Wu
  • for: 这 paper 探讨了大语言模型(LLM)在与用户交互时如何处理外部知识的问题。
  • methods: 作者们提出了一个框架,用于系统地探索 LLM 的 parametric knowledge 和外部知识之间的交互。他们构建了一个 parametric knowledge graph,以透视 LLM 的不同知识结构,并通过不同的方法、位置和格式引入外部知识。
  • results: 实验结果表明,当 LLM 遇到直接冲突或信息变化时,它们很可能会偏离其 parametric knowledge 提供的答案。它们还发现,即使外部知识的真实性高,LLM 仍可能受到不相关信息的干扰。这些发现指出了现有 LLM 在交互时 инте格外部知识时存在风险的问题。所有数据和结果都公开可用。
    Abstract Large language models (LLMs) acquire extensive knowledge during pre-training, known as their parametric knowledge. However, in order to remain up-to-date and align with human instructions, LLMs inevitably require external knowledge during their interactions with users. This raises a crucial question: How will LLMs respond when external knowledge interferes with their parametric knowledge? To investigate this question, we propose a framework that systematically elicits LLM parametric knowledge and introduces external knowledge. Specifically, we uncover the impacts by constructing a parametric knowledge graph to reveal the different knowledge structures of LLMs, and introduce external knowledge through distractors of varying degrees, methods, positions, and formats. Our experiments on both black-box and open-source models demonstrate that LLMs tend to produce responses that deviate from their parametric knowledge, particularly when they encounter direct conflicts or confounding changes of information within detailed contexts. We also find that while LLMs are sensitive to the veracity of external knowledge, they can still be distracted by unrelated information. These findings highlight the risk of hallucination when integrating external knowledge, even indirectly, during interactions with current LLMs. All the data and results are publicly available.
    摘要 Specifically, 我们 constructed a parametric knowledge graph 来揭露 LLMs 的不同知识结构,并通过对 LLMs 进行不同程度的外部知识引入,以发现它们在不同情况下的对应方式。我们的实验结果显示,当 LLMs 遇到直接冲突或干扰变化的情况时,它们往往会产生与 parametric knowledge 不符的回应。此外,我们发现 LLMS 对外部知识的敏感性可以在不同的情况下发挥作用,但是它们仍可以受到无关的信息所干扰。这些结果显示,当 LLMS 与外部知识进行互动时,存在诱导现象的风险。所有的数据和结果都公开可用。

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

  • paper_url: http://arxiv.org/abs/2309.08591
  • repo_url: https://github.com/UKPLab/maps
  • paper_authors: Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, Iryna Gurevych
  • for: 这 paper investigate whether multilingual language models (mLLMs) can reason with proverbs and sayings in a conversational context, and how well they understand these cultural references.
  • methods: The authors use a variety of state-of-the-art mLLMs to test the models’ ability to reason with proverbs and sayings, and they create a new evaluation dataset called MAPS (MulticultrAl Proverbs and Sayings) for six different languages.
  • results: The authors find that mLLMs have limited knowledge of proverbs and struggle to reason with figurative proverbs and sayings, and there is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages.
    Abstract Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in situational context, human expectations vary depending on the relevant cultural common ground. As human languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs 'knows' limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a "culture gap" in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.
    摘要
  1. mLLMs have limited knowledge of proverbs and simply memorizing proverbs does not mean understanding them in a conversational context.2. mLLMs struggle to reason with figurative proverbs and sayings, and often choose the wrong answer when asked to select a response.3. There is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages.To address these challenges, we have created and released an evaluation dataset called MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding in six different languages. This dataset will help researchers to better understand the limitations and potential of mLLMs when it comes to reasoning across cultures.

Neural Machine Translation Models Can Learn to be Few-shot Learners

  • paper_url: http://arxiv.org/abs/2309.08590
  • repo_url: None
  • paper_authors: Raphael Reinauer, Patrick Simianer, Kaden Uhlig, Johannes E. M. Mosig, Joern Wuebker
  • for: 这篇论文旨在探讨大语言模型在新领域和任务中具有快速学习能力,以及如何通过特殊的训练目标来减少模型的大小。
  • methods: 本研究使用了精心设计的训练目标,以实现在几个示例的情况下进行域 adapted 学习。
  • results: 研究结果表明,使用本方法可以实现高质量的翻译和快速适应率,并且在混合域批处理中进行批处理时能够更高效。
    Abstract The emergent ability of Large Language Models to use a small number of examples to learn to perform in novel domains and tasks, also called in-context learning (ICL). In this work, we show that a much smaller model can be trained to perform ICL by fine-tuning towards a specialized training objective, exemplified on the task of domain adaptation for neural machine translation. With this capacity for ICL, the model can take advantage of relevant few-shot examples to adapt its output towards the domain. We compare the quality of this domain adaptation to traditional supervised techniques and ICL with a 40B-parameter Large Language Model. Our approach allows efficient batch inference on a mix of domains and outperforms state-of-the-art baselines in terms of both translation quality and immediate adaptation rate, i.e. the ability to reproduce a specific term after being shown a single example.
    摘要 大型语言模型的新兴能力,即使用少量示例来在新领域和任务中学习,也称为内容学习(ICL)。在这项工作中,我们示出了一种较小的模型可以通过特殊化训练目标来进行ICL,并 exemplified 在神经机器翻译领域中进行领域适应。通过这种ICL能力,模型可以利用相关的几个示例来适应领域。我们与传统的直接训练技术和ICL的40B参数大型语言模型进行比较,并发现我们的方法可以具有高效批处理能力,并在混合领域下进行批处理。此外,我们的方法还可以在翻译质量和快速适应率(即在看到单个示例后能够重新生成特定词汇)两个方面超越当前的基elines。

ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer

  • paper_url: http://arxiv.org/abs/2309.08583
  • repo_url: https://github.com/asaakyan/explain-st
  • paper_authors: Arkadiy Saakyan, Smaranda Muresan
  • for: 这个论文的目的是提出一种扩展和改进形式式转换数据集的解释框架,以便使用ChatGPT模型进行模型精炼,并通过人工指导来进一步修改生成的解释。
  • methods: 该论文使用了ChatGPT模型进行模型精炼,并通过ICLEF(In-Context Learning from Expert Feedback)技术来捕捉专家反馈。
  • results: 研究发现,现有的公开分布的 instruciton-tuned 模型(以及在某些设置下的ChatGPT)在这个任务上表现不佳,而通过 fine-tuning 在我们的高质量数据集上得到了显著提高。人工评估表明,比ChatGPT更小的模型在我们的数据集上进行 fine-tuning 后,与专家偏好更加相似。最后,论文还讨论了使用模型在解释式风格转换任务中的两种应用:可读性作者识别和可读性AI生成文本检测器的可读性针对攻击。
    Abstract While state-of-the-art language models excel at the style transfer task, current work does not address explainability of style transfer systems. Explanations could be generated using large language models such as GPT-3.5 and GPT-4, but the use of such complex systems is inefficient when smaller, widely distributed, and transparent alternatives are available. We propose a framework to augment and improve a formality style transfer dataset with explanations via model distillation from ChatGPT. To further refine the generated explanations, we propose a novel way to incorporate scarce expert human feedback using in-context learning (ICLEF: In-Context Learning from Expert Feedback) by prompting ChatGPT to act as a critic to its own outputs. We use the resulting dataset of 9,960 explainable formality style transfer instances (e-GYAFC) to show that current openly distributed instruction-tuned models (and, in some settings, ChatGPT) perform poorly on the task, and that fine-tuning on our high-quality dataset leads to significant improvements as shown by automatic evaluation. In human evaluation, we show that models much smaller than ChatGPT fine-tuned on our data align better with expert preferences. Finally, we discuss two potential applications of models fine-tuned on the explainable style transfer task: interpretable authorship verification and interpretable adversarial attacks on AI-generated text detectors.
    摘要 当前最先进的语言模型在Style Transfer任务上表现出色,但现有工作并没有解释Style Transfer系统的可读性。我们提出一个框架,使用ChatGPT模型协助生成Style Transfer数据集的解释,通过模型液化(distillation)来提高数据质量。为了进一步细化生成的解释,我们提出一种新的方法,通过在Context Learning from Expert Feedback(ICLEF)中提供专家反馈来进一步改进ChatGPT的输出。我们使用这些解释Style Transfer实例(e-GYAFC)来证明,当前公开分布的开源 instrucion-tuned 模型(以及在某些设置下的ChatGPT)在这个任务上表现不佳,而 fine-tuning 在我们的高质量数据集上导致了显著的改进,如自动评估中所示。在人工评估中,我们发现使用我们数据集进行 fine-tuning 的模型比ChatGPT更好地与专家偏好相吻合。最后,我们讨论了基于解释Style Transfer任务的模型在涉及性作者鉴别和AI生成文本检测器的可读性攻击中的两个可能应用。

Casteist but Not Racist? Quantifying Disparities in Large Language Model Bias between India and the West

  • paper_url: http://arxiv.org/abs/2309.08573
  • repo_url: None
  • paper_authors: Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, Scott A. Hale
  • for: 本研究旨在评估大语言模型(LLMs)中存在的偏见问题,以及这些偏见在印度上的表现。
  • methods: 该研究采用了一种新的数据集——印度偏见评估 dataset(Indian-BhED),包含了印度社会中的阶层和宗教上的偏见和反偏见示例。通过对多种popular LLMs进行测试,研究人员发现了大多数LLMs在印度上具有强烈的偏见倾向。
  • results: 研究人员发现,在印度上,LLMs中的偏见倾向主要表现在阶层和宗教上,特别是与西方上的偏见倾向相比。此外,研究人员还发现了一种简单的调教技术——指令推荐——可以有效地减少LLMs中的偏见和反偏见。
    Abstract Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame and compare bias levels between the Indian and Western contexts. To do this, we develop a novel dataset which we call Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. We find that the majority of LLMs tested are strongly biased towards stereotypes in the Indian context, especially as compared to the Western context. We finally investigate Instruction Prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for GPT-3.5. The findings of this work highlight the need for including more diverse voices when evaluating LLMs.
    摘要 大型语言模型(LLM),每天使用了 millions of users,可以储存社会偏见,使用者接触到表现性危害。学术研究中存在大量LLM偏见,但这些研究通常采用西方中心的框架,对于全球南方的偏见水平和潜在危害相对较少关注。本文使用一个新的数据集——印度偏见评估集(Indian-BhED),测量各LLM在印度和西方上的偏见水平。我们发现大多数测试的LLM强烈储存印度上的偏见,特别是与西方上的偏见相比。最后,我们调查了“指示提示”作为简单的 Mitigation 方法,发现其可以有效地减少大多数情况下的偏见和反偏见偏见。这些发现强调了包括更多多样化的声音在LLM评估中的重要性。

Augmenting conformers with structured state space models for online speech recognition

  • paper_url: http://arxiv.org/abs/2309.08551
  • repo_url: None
  • paper_authors: Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara Sainath
  • for: 本研究探讨了在线语音识别系统中使用神经网络模型,只访问左侧上下文。
  • methods: 本文提出了一种基于结构化状态空间序列模型(S4)的增强神经网络模型,以提高在线ASR系统的性能。 authorsperform了系统的ablation Study来比较不同的S4模型变体,并提出了两种新的方法, combinig S4模型和卷积。
  • results: results show that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. 最佳设计是将一小个S4模型与实数权重的卷积相结合,以实现它们的补充作用。 authors的best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
    Abstract Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), which are a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We perform systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We find that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
    摘要 online speech recognition,其中模型只能访问左侧上下文,是ASR系统中的重要和挑战性用case。在这项工作中,我们研究了将神经编码器引入在线ASR中,通过结构化状态空间序列模型(S4)来提供高效的左侧上下文访问方式。我们进行了系统性的减少研究,比较不同的S4模型变体,并提出了两种新的方法,其中一种将S4模型与卷积结合。我们发现最有效的设计是将一小个S4模型使用实数Recurrent权重与地方卷积结合,使其在不同上下文中工作衔接地。我们的最佳模型在Librispeech测试集上 achieve WERs of 4.01%/8.53%,超过了经过了大量调整的卷积Conformers。

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

  • paper_url: http://arxiv.org/abs/2309.08531
  • repo_url: https://github.com/ms-dot-k/Image-to-Speech-Captioning
  • paper_authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
  • for: 这 paper 的目的是提出一种强大和高效的图像到语音captioning(Im2Sp)模型。
  • methods: 该 paper 使用了一种基于大规模预训练视觉语言模型的视觉语言概念和语言模型知识进行 imports,并将 Im2Sp 的输出设置为精度化的语音特征,以便 incorporate 语言模型化能力。
  • results: 通过使用视觉语言预训练策略,该 paper 在 COCO 和 Flickr8k 两个广泛使用的标准数据库上实现了新的 Im2Sp 性能记录。此外, paper 还提出了一种提高 Im2Sp 模型的效率的方法。
    Abstract In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning.
    摘要 在这篇论文中,我们提出了构建一个强大和高效的图像到语音描述(Im2Sp)模型的方法。为此,我们从大规模预训练视觉语言模型中导入了丰富的图像理解和语言模型化知识。我们设置了Im2Sp的输出为步骤化的语音特征,即一种自适应语音模型的量化语音特征。这些语音特征主要包含语言信息,同时压缩其他语音特征。这样可以将预训练视觉语言模型中的语言模型化能力integrated into Im2Sp的语音模型。通过视觉语言预训练策略,我们在COCO和Flickr8k两个广泛使用的数据库上设置了新的Im2Sp性能记录。然后,我们进一步提高了Im2Sp模型的效率。与语音单元类似,我们将原始图像转换为图像单元,它们通过Raw image的vector quantization来 derivation。通过这些图像单元,我们可以压缩图像数据的存储需求,从原始图像数据的bits比例来看,减少了99.2%。 demo页面:https://ms-dot-k.github.io/Image-to-Speech-Captioning。

SilverRetriever: Advancing Neural Passage Retrieval for Polish Question Answering

  • paper_url: http://arxiv.org/abs/2309.08469
  • repo_url: None
  • paper_authors: Piotr Rybak, Maciej Ogrodniczuk
  • for: 这篇论文目的是为了开发一种基于神经网络的波兰语问答系统,以提高问答系统的准确率和效率。
  • methods: 这篇论文使用了神经网络来实现问答系统的检索部分,并在多个手动或弱 Label 的数据集上训练。
  • results: 根据论文的描述,SilverRetriever 比其他波兰语模型更好,并与大型多语言模型相当。同时,论文还开源了五个新的检索数据集。
    Abstract Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present SilverRetriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. SilverRetriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.
    摘要 现代开放领域问答系统经常利用准确和高效的检索组件来找到包含问题答案所需的信息。近年来,神经检索器在英语或中文等Popular语言中得到了广泛的应用,但是对其他语言,如波兰语,有限的模型是可用的。在这项工作中,我们介绍SilverRetriever,一种基于神经网络的波兰语检索器,通过手动或弱 Label的数据集进行训练。SilverRetriever在波兰语检索方面达到了较好的结果,与大型多语言模型竞争。此外,我们还开源了五个新的段落检索数据集。

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

  • paper_url: http://arxiv.org/abs/2309.08454
  • repo_url: None
  • paper_authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
  • for: 这项研究旨在提高自动语音识别(ASR)系统的性能,特别是处理重叠的语音场景。
  • methods: 这项研究使用了一种新的混合编码器,该编码器利用原始重叠的语音来减少由语音分离引入的噪声的影响。
  • results: 实验结果表明,使用这种混合编码器可以在LibriCSS数据集上达到顶峰性能,并且表明TF-GridNet模型具有强大的分离能力, largely closing the gap between previous methods and oracle separation.
    Abstract Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.
    摘要 许多实际应用中的自动语音识别(ASR)需要处理重叠的语音。一种常见方法是首先将语音分解成不重叠的流程,然后对得到的信号进行 ASR 处理。最近,一种将混合编码器添加到 ASR 模型中的方法被提议。这种混合编码器利用原始的重叠语音来减轻由语音分离引入的artefacts的影响。然而,之前的方法只处理了两个人的场景。在这种工作中,我们扩展了这种方法,以适应更自然的会议场景,包括任意数量的说话者和动态重叠。我们使用不同的语音分离器进行评估,包括强大的 TF-GridNet 模型。我们的实验结果表明,我们的方法在 LibriCSS 数据集上达到了状态机器的性能,并且强调混合编码器的优势。此外,它们也证明了 TF-GridNet 的强大分离能力, largely 关闭了之前方法和oracle分离之间的差距。

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

  • paper_url: http://arxiv.org/abs/2309.08448
  • repo_url: https://github.com/mtkresearch/mr-models
  • paper_authors: Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-shan Shiu
  • for: 评估大语言模型的能力是语言理解和生成领域的关键任务。
  • methods: 我们提出了一种新的评估框架,利用英文数据集创建了一系列特有的 benchmark,用于评估语言模型在traditional Chinese 中的多种能力。
  • results: 我们在这些 benchmark 上评估了GPT-3.5、Taiwan-LLaMa-v1.0和我们自己的模型Model 7-C,结果显示我们的模型在一些评估能力上与GPT-3.5相当。
    Abstract The evaluation of large language models is an essential task in the field of language understanding and generation. As language models continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of language models, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of language models' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of language models in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.
    摘要 大型语言模型的评估是现场语言理解和生成领域的重要任务。随着语言模型不断进步,评估其性能的需求日益增加。在传统汉字中,对于语言模型的评估标准仅有一些限定的测试,如DRCD、TTQA、CMDQA和FGC数据集。为了填补这个空白,我们提出了一个新的测试集,利用现有的英文数据集,并特别针对传统汉字评估语言模型的多种能力。这个测试集包括了各种任务,例如对话问题答案、摘要、分类和表格理解。这些测试集提供了一个全面的评估框架,允许评估语言模型在不同任务上的能力。在这篇论文中,我们将评估GPT-3.5、Taiwan-LLaMa-v1.0和我们的专业模型(Model 7-C)的性能。评估结果显示,我们的模型(Model 7-C)在一部分评估能力方面与GPT-3.5相似。为了推进传统汉字语言模型的评估和促进这个领域的进一步研究,我们将测试集开源和模型公开试用。

Unleashing Potential of Evidence in Knowledge-Intensive Dialogue Generation

  • paper_url: http://arxiv.org/abs/2309.08380
  • repo_url: None
  • paper_authors: Xianjie Wu, Jian Yang, Tongliang Li, Di Liang, Shiwei Zhang, Yiyang Du, Zhoujun Li
  • for: 提高对话回答的正确性,增强对话生成系统的知识内容。
  • methods: 利用大语言模型挖掘可靠的证据真实标签,并在对话生成过程中使用证据标签进行可靠的证据标识和集中注意力。
  • results: 在MultiDoc2Dial上实验表明,提供证据标签的增强和调整注意力机制可以提高模型性能,比基eline高3-5点,并且进一步验证了模型的可靠性和事实一致性。
    Abstract Incorporating external knowledge into dialogue generation (KIDG) is crucial for improving the correctness of response, where evidence fragments serve as knowledgeable snippets supporting the factual dialogue replies. However, introducing irrelevant content often adversely impacts reply quality and easily leads to hallucinated responses. Prior work on evidence retrieval and integration in dialogue systems falls short of fully leveraging existing evidence since the model fails to locate useful fragments accurately and overlooks hidden evidence labels within the KIDG dataset. To fully Unleash the potential of evidence, we propose a framework to effectively incorporate Evidence in knowledge-Intensive Dialogue Generation (u-EIDG). Specifically, we introduce an automatic evidence generation framework that harnesses the power of Large Language Models (LLMs) to mine reliable evidence veracity labels from unlabeled data. By utilizing these evidence labels, we train a reliable evidence indicator to effectively identify relevant evidence from retrieved passages. Furthermore, we propose an evidence-augmented generator with an evidence-focused attention mechanism, which allows the model to concentrate on evidenced segments. Experimental results on MultiDoc2Dial demonstrate the efficacy of evidential label augmentation and refined attention mechanisms in improving model performance. Further analysis confirms that the proposed method outperforms other baselines (+3~+5 points) regarding coherence and factual consistency.
    摘要 通过 incorporating 外部知识 into 对话生成 (KIDG) 中的对话回复,可以提高对话回复的正确性。然而,引入不相关的内容可能会消耗对话质量和导致幻想回复。现有的对话系统中的证据检索和整合方法未能充分利用现有的证据,因为模型无法准确地检索有用的断片和忽略掉隐藏在 KIDG 数据集中的证据标签。为了全面发挥证据的潜力,我们提出了一个框架,称为 u-EIDG(知识Intensive对话生成框架)。具体来说,我们提出了一个自动生成证据框架,利用大型自然语言模型 (LLMs) 来挖掘可靠的证据真实标签。通过这些证据标签,我们训练了一个可靠的证据指标,以确定有用的证据从 retrieved 段落中选择。此外,我们提出了一个带有证据专注注意机制的证据扩充生成器,使模型能够专注于证据段落。实验结果表明,证据标签增强和专注注意机制可以提高模型性能。进一步分析表明,我们的方法在coherence和事实一致性方面 (+3~+5 点) 表现出色。

PatFig: Generating Short and Long Captions for Patent Figures

  • paper_url: http://arxiv.org/abs/2309.08379
  • repo_url: None
  • paper_authors: Dana Aubakirova, Kim Gerdes, Lufei Liu
  • for: 该论文提出了一个新的大规模专利图像数据集,包含11,000多个欧洲专利申请的30,000多个专利图像。
  • methods: 该数据集每个图像都提供了短和长标题、参考 numerals、它们所对应的术语和图像中组件之间的最小索引。
  • results: 通过在Qatent PatFig上训练LVLM模型,可以生成短和长的描述,并 investigate了在专利图像描述过程中使用不同的文本基于cue的影响。
    Abstract This paper introduces Qatent PatFig, a novel large-scale patent figure dataset comprising 30,000+ patent figures from over 11,000 European patent applications. For each figure, this dataset provides short and long captions, reference numerals, their corresponding terms, and the minimal claim set that describes the interactions between the components of the image. To assess the usability of the dataset, we finetune an LVLM model on Qatent PatFig to generate short and long descriptions, and we investigate the effects of incorporating various text-based cues at the prediction stage of the patent figure captioning process.
    摘要

DiaCorrect: Error Correction Back-end For Speaker Diarization

  • paper_url: http://arxiv.org/abs/2309.08377
  • repo_url: None
  • paper_authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky
  • for: 这个论文是为了提高 диари化系统的输出精度而设计的一种错误修正框架。
  • methods: 该方法基于自动语音识别中的错误修正技术,使用两个并行的卷积encoder和一个基于变换的decoder,通过利用输入录音和初始系统的输出之间的交互,自动修正初始说话人的活动,以最小化 диари化错误。
  • results: 对2个说话人电话数据进行实验表明,提案的 DiaCorrect 可以有效地提高初始模型的结果。
    Abstract In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect.
    摘要 在这个研究中,我们提出了一个错误修正框架,名为DiaCorrect,用于简化 диари化系统的输出。这种方法 draws inspiration from automatic speech recognition 的错误修正技术。我们的模型包括两个并行的卷积Encoder和一个基于 transform的解码器。通过利用输入录音和初始系统的输出之间的互动,DiaCorrect可以自动 correctionspeaker activities,以最小化 диари化错误。实验结果表明,我们的提posed DiaCorrect可以有效地提高初始模型的结果。我们的源代码可以在https://github.com/BUTSpeechFIT/diacorrect中获取。

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

  • paper_url: http://arxiv.org/abs/2309.08351
  • repo_url: https://github.com/NathanGodey/headless-lm
  • paper_authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot
  • for: 这篇研究旨在提出一种新的自主预训语言模型方法,它不再是预测字串probability分布,而是通过对输入嵌入重新构建的方式进行对比。
  • methods: 我们提出了一种叫做对比负载绑定(Contrastive Weight Tying,CWT)的方法,它可以在不同语言上预训头less language model。
  • results: 我们发现这种方法可以大幅提高GLUE分数和LAMBADA准确率,相比类别的语言模型在相似的计算预算下,具有更好的下游性能和数据效率。
    Abstract Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
    摘要 自我超级预训练语言模型通常是预测Token词汇的概率分布。在这项研究中,我们提出了一种创新的方法,即通过对比绑定权重(CWT)来重建输入嵌入。我们在单语言和多语言上应用这种方法来预训Headless语言模型。我们的方法具有实用优势,可以减少训练计算需求,同时提高下游性能和数据效率。我们发现在相同的计算预算下,our方法可以提高+1.6 GLUE分数和+2.7 LAMBADA准确率。

Reward Engineering for Generating Semi-structured Explanation

  • paper_url: http://arxiv.org/abs/2309.08347
  • repo_url: https://github.com/jiuzhouh/reward-engineering-for-generating-seg
  • paper_authors: Jiuzhou Han, Wray Buntine, Ehsan Shareghi
  • for: 本研究旨在解决语言模型生成结构化解释的挑战,尤其是不太大的语言模型(LM)在生成结构化解释时的问题。
  • methods: 本研究使用了强化学习(RL)和奖励工程学习(RE)来解决这个问题,并 investigate了多种奖励汇总方法。
  • results: 研究发现RL可以更好地解决生成结构化解释的问题,并在两个semi-structured解释生成Benchmark(ExplaGraph和COPA-SSE)上达到了新的状态体系。
    Abstract Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify model's true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs, as the reasoner is expected to couple a sequential answer with a structured explanation which embodies both the correct presentation and the correct reasoning process. In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed reward on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.
    摘要 semi-structured 解释描述了推理者的隐式过程,并且这种解释强调了根据特定查询提供的信息以及reasoner内部的权重来生成答案。 Despite 最近的语言模型生成能力的改进,生成结构化解释以验证模型的真正推理能力仍然是一大挑战。 特别是对于小型LM,因为推理者需要同时生成序列答案和结构化解释,这种问题更加突出。 在这种工作中,我们首先强调了监督练习(SFT)的局限性,然后引入了仪器学习(RL)中的奖励工程学方法,以更好地解决这个问题。 我们 investigate多种奖励汇聚方法,并提供了详细的讨论,这有助于探讨RL在未来研究中的潜在潜力。 我们的提议的奖励在两个 semi-structured 解释生成 benchmark(ExplaGraph 和 COPA-SSE)上实现了新的状态 искусственный智能领域的最佳成绩。

Distributional Inclusion Hypothesis and Quantifications: Probing Hypernymy in Functional Distributional Semantics

  • paper_url: http://arxiv.org/abs/2309.08325
  • repo_url: None
  • paper_authors: Chun Hei Lo, Guy Emerson
  • for: 本文探讨了函数分布semantics(FDS)模型词义的方法,以及如何通过这种方法学习词义的不同层次结构。
  • methods: 本文使用了FDS模型,并对其进行了训练,以便学习词义的不同层次结构。
  • results: 实验结果表明,当文本资料集 strictly follows Distributional Inclusion Hypothesis时,FDS模型就可以学习词义的层次结构,并且可以处理简单的通用量化。
    Abstract Functional Distributional Semantics (FDS) models the meaning of words by truth-conditional functions. This provides a natural representation for hypernymy, but no guarantee that it is learnt when FDS models are trained on a corpus. We demonstrate that FDS models learn hypernymy when a corpus strictly follows the Distributional Inclusion Hypothesis. We further introduce a training objective that allows FDS to handle simple universal quantifications, thus enabling hypernymy learning under the reverse of DIH. Experimental results on both synthetic and real data sets confirm our hypotheses and the effectiveness of our proposed objective.
    摘要 功能分布 semantics (FDS) 模型表示词语意义通过真理条件函数。这提供了自然的表示方式,但无 garantía que se aprenda hiperonimia when FDS 模型在一个 corpus 上训练。我们证明了 FDS 模型在 strictly following the Distributional Inclusion Hypothesis 的 corpus 上学习 hiperonimia。我们还引入了一个培训目标,allowing FDS 处理简单的通用量化,因此允许 hiperonimia 学习 under the reverse of DIH。实验结果表明我们的假设成立,并且我们的提议的目标有效。

Bridging Topic, Domain, and Language Shifts: An Evaluation of Comprehensive Out-of-Distribution Scenarios

  • paper_url: http://arxiv.org/abs/2309.08316
  • repo_url: None
  • paper_authors: Andreas Waldis, Iryna Gurevych
  • for: 本研究旨在评估语言模型(LMs)在各种异常情况下的泛化能力,包括主题、领域和语言方面的偏差。
  • methods: 研究人员采用了各种方法,包括准备分析、推荐策略和语言模型的练习,以评估LMs的泛化能力。
  • results: 研究发现,在各种异常情况下,提示基本 fine-tuning 表现最佳,特别是当训练和测试数据主要差异 semantic 时。同时,在 context 学习比 prompt-based fine-tuning 和 vanilla fine-tuning 更有效,尤其是在训练数据中存在重要差异的情况下。这表明,梯度学习带来了一定的结构性偏见。
    Abstract Language models (LMs) excel in in-distribution (ID) scenarios where train and test data are independent and identically distributed. However, their performance often degrades in real-world applications like argument mining. Such degradation happens when new topics emerge, or other text domains and languages become relevant. To assess LMs' generalization abilities in such out-of-distribution (OOD) scenarios, we simulate such distribution shifts by deliberately withholding specific instances for testing, as from the social media domain or the topic Solar Energy. Unlike prior studies focusing on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization. We define three metrics to pinpoint generalization flaws and propose eleven classification tasks covering topic, domain, and language shifts. Overall, we find superior performance of prompt-based fine-tuning, notably when train and test splits primarily differ semantically. Simultaneously, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks when training data embodies heavy discrepancies in label distribution compared to testing data. This reveals a crucial drawback of gradient-based learning: it biases LMs regarding such structural obstacles.
    摘要 Unlike previous studies that focused on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization by defining three metrics to identify generalization flaws. We also propose eleven classification tasks covering topic, domain, and language shifts. Our results show that prompt-based fine-tuning performs better than other methods, especially when the train and test splits differ semantically. Additionally, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks with significant differences in label distribution between training and testing data. This highlights a limitation of gradient-based learning, as it can bias LMs towards such structural obstacles.

Self-Consistent Narrative Prompts on Abductive Natural Language Inference

  • paper_url: http://arxiv.org/abs/2309.08303
  • repo_url: https://github.com/hkust-knowcomp/alpha-pace
  • paper_authors: Chunkit Chan, Xin Liu, Tsz Ho Chan, Jiayang Cheng, Yangqiu Song, Ginny Wong, Simon See
  • for: 本研究旨在提高αNLI任务(即叙述语言推理任务)中的自适应性和叙述连续性。
  • methods: 本研究提出了一种Prompt Tuning模型(α-PACE),该模型考虑了自适应性和叙述连续性。此外,本研究还提出了一种通用自适应框架,该框架可以指导预训练语言模型理解输入叙述文本的叙述Context。
  • results: 本研究通过广泛的实验和细化的降级研究表明了α-PACE模型的效果。与普通竞争对手相比,α-PACE模型的性能显著提高。
    Abstract Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($\alpha$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $\alpha$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $\alpha$-PACE. The performance of our method shows significant improvement against extensive competitive baselines.
    摘要 <>translate "Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($\alpha$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $\alpha$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $\alpha$-PACE. The performance of our method shows significant improvement against extensive competitive baselines." into Simplified Chinese.以下是文本的中文翻译:<>往日,强制被视为叙事理解和日常情境理解中的关键因素。 $\alpha$NLI任务已经被提出,这是基于文本的叙事任务,旨在从候选假设中选择最有可能性的假设。然而,之前的工作未能充分利用文本间的一致性和模型一致性。在这种情况下,我们提出了一种适应模型$\alpha$-PACE,该模型考虑了自我一致性和文本间一致性。此外,我们还提出了一种通用自一致框架,该框架考虑了不同的叙事顺序(例如,直线叙事和倒计时间顺序),以帮助预训练语言模型理解输入的叙事背景。我们进行了广泛的实验和细致的折衣研究,以证明 $\alpha$-PACE 的必要性和有效性。我们的方法在与多种竞争性基准模型进行比较时表现出了显著的改善。

Structural Self-Supervised Objectives for Transformers

  • paper_url: http://arxiv.org/abs/2309.08272
  • repo_url: https://github.com/lucadiliello/transformers-framework
  • paper_authors: Luca Di Liello
  • for: 本研究旨在提高自然语言模型的预训练,使其更加效率和下游应用更加一致。
  • methods: 本研究引入了三种代替BERT的Masked Language Modeling(MLM)目标,namely Random Token Substitution(RTS)、Cluster-based Random Token Substitution(C-RTS)和Swapped Language Modeling(SLM)。这些目标使用Token替换而不是遮盖,RTS和C-RTS预测Token的原始性,SLM预测原始Token的值。 results show que RTS和C-RTS需要较少的预训练时间, yet maintains performance comparable to MLM。Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget。
  • results: 本研究还提出了一些自然语言模型的自我超vised预训练任务,以适应下游应用。通过使用大量的文本数据,如Wikipedia和CC-News,我们训练模型可以识别文本段的来源,以及文本段是否来自同一篇文章或文档。通过不断的预训练,我们从现有的模型如RoBERTa、ELECTRA、DeBERTa、BART和T5开始,并示出了对各种任务的显著性提高,如 Fact Verification、Answer Sentence Selection和概要。这些提高尤其明显在有限的标注数据available。此外,我们还实现了多种标准 benchmark datasets的state-of-the-art results,包括 FEVER(开发集)、ASNQ、WikiQA和TREC-QA,以及提高概要的质量。
    Abstract This thesis focuses on improving the pre-training of natural language models using unsupervised raw data to make them more efficient and aligned with downstream applications. In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget. In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.
    摘要

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

  • paper_url: http://arxiv.org/abs/2309.08255
  • repo_url: None
  • paper_authors: Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai, Sebastian Cygert, Kamil Pokora, Georgi Tinchev, Ziyao Zhang, Kayoko Yanagisawa
  • for: 这个研究旨在提出一个跨语言语音合成框架,用于将原始语言的语音转换为目标语言的声音,以提高语音识别度和准确性。
  • methods: 本研究使用了一个四个阶段的框架,包括:在第一个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音,在第二个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音,在第三个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音,最后一个阶段则是使用一个无地域预测器进行训练。
  • results: 本研究的实验结果显示,提出的框架在比较于现有的方法时表现较好,并且在不同的架构、语言、说话者和资料量下都能够获得良好的效果。此外,本研究的方法特别适合低资源环境。
    Abstract In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.
    摘要 在这个研究中,我们提出了一种跨语言语音合成框架,包括4个阶段。在第一两个阶段,我们使用一个语音转换(VC)模型将目标地区的语音转换为目标说话人的voice。在第三个阶段,转换后的数据与目标语言的语音特征和持续时间从录音中提取出来,并用于训练单个说话人的听音模型。最后一个阶段是训练无关地区的 vocoder。我们的评估表明,我们的方法超过了当前状态的方法,基于大量多语言 TTS 模型的训练。此外,我们的实验还证明了我们的方法在不同的模型架构、语言、说话人和数据量下都具有 robustness。此外,我们的解决方案特别有利于低资源的设置。

Investigating Answerability of LLMs for Long-Form Question Answering

  • paper_url: http://arxiv.org/abs/2309.08210
  • repo_url: None
  • paper_authors: Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, Semih Yavuz
  • for: 了解大量LLMs(如ChatGPT)和小型开源LLMs的差异,以及它们的抽象和缩短版本的不同特点。
  • methods: 基于抽象摘要生成问题的方法,用于测试LLMs的理解和推理能力。
  • results: 研究结果显示,使用抽象摘要生成问题可以为LLMs提供一个挑战性的测试环境,并显示了大量LLMs和开源LLMs之间的性能差异,特别是在 longer contexts(>1024 tokens)下。
    Abstract As we embark on a new era of LLMs, it becomes increasingly crucial to understand their capabilities, limitations, and differences. Toward making further progress in this direction, we strive to build a deeper understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. To this end, we specifically focus on long-form question answering (LFQA) because it has several practical and impactful applications (e.g., troubleshooting, customer service, etc.) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that: (1) our proposed method of generating questions from abstractive summaries pose a challenging setup for LLMs and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama) (2) open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries -- especially for longer contexts (>1024 tokens)
    摘要 As we enter a new era of LLMs, it becomes increasingly important to understand their capabilities, limitations, and differences. To make further progress in this area, we aim to deepen our understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. Specifically, we focus on long-form question answering (LFQA) as it has many practical and impactful applications (e.g., troubleshooting, customer service) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that:1. Our proposed method of generating questions from abstractive summaries poses a challenging setup for LLMs, and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama).2. Open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries, especially for longer contexts (>1024 tokens).

  • paper_url: http://arxiv.org/abs/2309.08187
  • repo_url: None
  • paper_authors: Vu Tran, Minh Le Nguyen, Satoshi Tojo, Ken Satoh
  • for: 这个论文是为了解决法律案例检索任务而提出的方法。
  • methods: 该方法利用深度神经网络来编码文档,将文档摘要化成连续的向量空间中。同时,该方法还利用神经网络生成的含义特征和语言特征来提高检索系统的性能。
  • results: 实验结果表明,利用提供的摘要和编码摘要可以提高检索系统的性能。此外,该方法的实验结果还表明,神经网络生成的含义特征和语言特征可以补充each other,以提高检索系统的性能。该方法在法律案例检索任务上达到了F1分数的65.6%和57.6%。
    Abstract We present our method for tackling a legal case retrieval task by introducing our method of encoding documents by summarizing them into continuous vector space via our phrase scoring framework utilizing deep neural networks. On the other hand, we explore the benefits from combining lexical features and latent features generated with neural networks. Our experiments show that lexical features and latent features generated with neural networks complement each other to improve the retrieval system performance. Furthermore, our experimental results suggest the importance of case summarization in different aspects: using provided summaries and performing encoded summarization. Our approach achieved F1 of 65.6% and 57.6% on the experimental datasets of legal case retrieval tasks.
    摘要 我们提出了一种方法来解决法律案件检索任务,通过我们的文档编码方法,将文档摘要到连续向量空间中。我们利用深度神经网络来实现文档编码,并 explore了将 lexical 特征和 latent 特征结合使用的好处。我们的实验结果表明,lexical 特征和 latent 特征在不同方面进行补做,可以提高检索系统的性能。此外,我们的实验结果还表明,案件摘要在不同方面具有重要性:使用提供的摘要和自动生成摘要。我们的方法在法律案件检索任务上实现了 F1 分数的 65.6% 和 57.6%。

Multilingual Sentence-Level Semantic Search using Meta-Distillation Learning

  • paper_url: http://arxiv.org/abs/2309.08185
  • repo_url: None
  • paper_authors: Meryem M’hamdi, Jonathan May, Franck Dernoncourt, Trung Bui, Seunghyun Yoon
  • for: 本研究旨在提高多语言Semantic Search的精度和效率,使其能够更好地理解用户的意图和含义。
  • methods: 本研究使用了Meta-distillation学习方法,特性是利用Teacher模型T-MAML来传递知识到Student模型S-MAML,从而提高Student模型在多语言Semantic Search中的性能。
  • results: 实验结果表明,相比基础模型和naive fine-tuning方法, meta-distillation方法可以大幅提高MAML的性能,并且在未看到的语言上也有较好的一致性。
    Abstract Multilingual semantic search is the task of retrieving relevant contents to a query expressed in different language combinations. This requires a better semantic understanding of the user's intent and its contextual meaning. Multilingual semantic search is less explored and more challenging than its monolingual or bilingual counterparts, due to the lack of multilingual parallel resources for this task and the need to circumvent "language bias". In this work, we propose an alignment approach: MAML-Align, specifically for low-resource scenarios. Our approach leverages meta-distillation learning based on MAML, an optimization-based Model-Agnostic Meta-Learner. MAML-Align distills knowledge from a Teacher meta-transfer model T-MAML, specialized in transferring from monolingual to bilingual semantic search, to a Student model S-MAML, which meta-transfers from bilingual to multilingual semantic search. To the best of our knowledge, we are the first to extend meta-distillation to a multilingual search application. Our empirical results show that on top of a strong baseline based on sentence transformers, our meta-distillation approach boosts the gains provided by MAML and significantly outperforms naive fine-tuning methods. Furthermore, multilingual meta-distillation learning improves generalization even to unseen languages.
    摘要 多语言Semantic搜索是查询表达在不同语言组合中的相关内容。这需要更好地理解用户的意图和其语言上下文意义。由于多语言Semantic搜索相比于单语言或双语搜索更加不explored和更加挑战,因为缺乏多语言平行资源 для这个任务,并且需要绕过“语言偏见”。在这种工作中,我们提出了一种对齐方法:MAML-Align,专门针对低资源场景。我们的方法利用了meta-distillation学习基于MAML,一种基于Model-Agnostic Meta-Learner的优化算法。MAML-Align从一个特有的Teacher meta-传播模型T-MAML,该模型专门从单语言Semantic搜索转移到双语Semantic搜索,将知识传播到一个Student模型S-MAML,该模型从双语Semantic搜索转移到多语言Semantic搜索。根据我们所知,我们是首次将meta-distillation应用于多语言搜索应用。我们的实验结果表明,在一个强大的基础模型基于 sentence transformers 上,我们的meta-distillation方法可以提高MAML的效果,并且明显超过了简单的微调方法。此外,多语言meta-distillation学习还能提高到未看到的语言上的总体性能。

Large Language Models for Failure Mode Classification: An Investigation

  • paper_url: http://arxiv.org/abs/2309.08181
  • repo_url: https://github.com/nlp-tlp/chatgpt-fmc
  • paper_authors: Michael Stewart, Melinda Hodkiewicz, Sirui Li
  • for: 这个研究旨在评估大语言模型(LLMs)在失败模式分类(FMC)任务中的效果。
  • methods: 我们采用了提示工程来使一个GPT-3.5模型(F1=0.80)预测给定观察的失败模式使用限定的编码列表。
  • results: 我们发现,使用精心预处理的数据集进行高质量的 fine-tuning可以提高GPT-3.5模型的性能(F1=0.80),并且超过了现有的文本分类模型(F1=0.60)和尝试模型(F1=0.46)。
    Abstract In this paper we present the first investigation into the effectiveness of Large Language Models (LLMs) for Failure Mode Classification (FMC). FMC, the task of automatically labelling an observation with a corresponding failure mode code, is a critical task in the maintenance domain as it reduces the need for reliability engineers to spend their time manually analysing work orders. We detail our approach to prompt engineering to enable an LLM to predict the failure mode of a given observation using a restricted code list. We demonstrate that the performance of a GPT-3.5 model (F1=0.80) fine-tuned on annotated data is a significant improvement over a currently available text classification model (F1=0.60) trained on the same annotated data set. The fine-tuned model also outperforms the out-of-the box GPT-3.5 (F1=0.46). This investigation reinforces the need for high quality fine-tuning data sets for domain-specific tasks using LLMs.
    摘要 在本文中,我们提出了大语言模型(LLM)的效果调查在故障模式分类(FMC)任务中。 FMC 是维保领域中的一项重要任务,它可以减少可靠工程师的时间 manually 分析工作订单。我们详细介绍了我们的激励程序工程来使得 LLM 可以使用限定的代码列表预测给定观察的故障模式。我们展示了一个 GPT-3.5 模型(F1=0.80)在注释数据上进行精度调整后的性能明显提高,与现有的文本分类模型(F1=0.60)在同一个注释数据集上进行训练后的性能相比。此外,我们还证明了 fine-tuned 模型在原始 GPT-3.5 模型(F1=0.46)上也表现出了显著的改善。这一调查证明了在域pecific任务中使用 LLM 需要高质量的 fine-tuning 数据集。

  • paper_url: http://arxiv.org/abs/2309.08173
  • repo_url: https://github.com/yuelinan/fedjudge
  • paper_authors: Linan Yue, Qi Liu, Yichao Du, Weibo Gao, Ye Liu, Fangzhou Yao
  • for: 这篇论文旨在解决大语言模型在法律智能领域中的数据隐私问题,通过融合法律大语言模型和联邦学习方法。
  • methods: 这篇论文提出了一个名为 FedJudge 的框架,它使用了优化的法律大语言模型,并使用联邦学习方法进行本地化训练,以确保数据隐私。
  • results: 实验结果显示,FedJudge 能够有效地训练法律大语言模型,并且可以适应不同的数据分布。
    Abstract Large Language Models (LLMs) have gained prominence in the field of Legal Intelligence, offering potential applications in assisting legal professionals and laymen. However, the centralized training of these Legal LLMs raises data privacy concerns, as legal data is distributed among various institutions containing sensitive individual information. This paper addresses this challenge by exploring the integration of Legal LLMs with Federated Learning (FL) methodologies. By employing FL, Legal LLMs can be fine-tuned locally on devices or clients, and their parameters are aggregated and distributed on a central server, ensuring data privacy without directly sharing raw data. However, computation and communication overheads hinder the full fine-tuning of LLMs under the FL setting. Moreover, the distribution shift of legal data reduces the effectiveness of FL methods. To this end, in this paper, we propose the first Federated Legal Large Language Model (FedJudge) framework, which fine-tunes Legal LLMs efficiently and effectively. Specifically, FedJudge utilizes parameter-efficient fine-tuning methods to update only a few additional parameters during the FL training. Besides, we explore the continual learning methods to preserve the global model's important parameters when training local clients to mitigate the problem of data shifts. Extensive experimental results on three real-world datasets clearly validate the effectiveness of FedJudge. Code is released at https://github.com/yuelinan/FedJudge.
    摘要 大型语言模型(LLM)在法律智能领域的应用优势吸引了广泛的关注,可以帮助法律专业人员和非专业人员。然而,中央训练这些法律 LLM 会引起数据隐私问题,因为法律数据分散在各个机构中,每个机构都包含敏感个人信息。本文解决这个挑战,通过探讨法律 LLM 与联合学习(FL)方法的结合。通过使用 FL,法律 LLM 可以在设备或客户端上进行本地微调,并将参数集中到中央服务器上,保证数据隐私而无需直接分享原始数据。然而,在 FL 设置下 computation 和通信开销妨碍了法律 LLM 的全面微调。此外,法律数据的分布差shift 也减少了 FL 方法的有效性。为此,本文提出了首个 Federated Legal Large Language Model(FedJudge)框架,可以高效地微调法律 LLM。特别是,FedJudge 使用 parameter-efficient 微调方法来在 FL 训练中更新只几个额外参数。此外,我们还探讨了连续学习方法,以保持全局模型中重要参数的稳定性,从而 Mitigate 数据差shift 问题。实验结果表明,FedJudge 在三个实际数据集上具有极高的有效性。代码可以在 上下载。

LASER: LLM Agent with State-Space Exploration for Web Navigation

  • paper_url: http://arxiv.org/abs/2309.08172
  • repo_url: None
  • paper_authors: Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu
  • for: 这个论文是为了解决大型语言模型在交互决策任务中的问题,例如网络浏览。
  • methods: 这个论文使用了模型州空间探索的方法,将大型语言模型 Agent 转移到一组已定义的状态中,并通过行动完成任务。
  • results: 实验结果显示,这个方法可以让大型语言模型 Agent 在网络浏览任务中表现出色,并且与人类性能更近。
    Abstract Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. While achieving decent performance, previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples to teach the model how to reason in the interactive environment. Consequently, the model could not handle more challenging scenarios not covered in the in-context examples, e.g., mistakes, leading to sub-optimal performance. To address this issue, we propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible back-tracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task.
    摘要 To address this issue, we propose modeling the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible back-tracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task.Here's the text in Simplified Chinese:大型语言模型(LLM)已经成功地应用到互动决策任务中,如网络浏览。虽然取得了不错的表现,但前一些方法都是假设LLM模型在前进方式下执行,即只提供了oracle路径作为互动环境中的示范例子,教导模型在互动环境中如何思考。这限制了模型的能力,不能处理更加具体的情况,导致表现不佳。为了解决这个问题,我们提议将互动任务模型为州空间探索, LLM代理在预先定义的状态集中转移,通过执行动作完成任务。这种形式允许灵活的回溯,让模型轻松地复原自错误。我们将我们的LASER代理评估在WebShop任务上。实验结果显示,我们的LASER代理与前一些方法相比,表现出色,几乎与人类表现相同。

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

  • paper_url: http://arxiv.org/abs/2309.08168
  • repo_url: None
  • paper_authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra
  • for: 加速大型自然语言模型(LLMs)的推理过程,无需额外模型。
  • methods: 提出了一种新的推理方案,即自我推测解oding,通过在推理过程中 selectively 跳过某些中间层来快速生成稿件,然后使用原始 LLMA 进行验证。
  • results: 对 LLaMA-2 和其精度模型进行了测试,获得了最高速up到 1.73 倍的加速效果。
    Abstract We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73$\times$.
    摘要 我团队提出了一种新的推理方案,自我推敲,用于加速大语言模型(LLM),无需额外模型。这种方法包括两个阶段:稿件阶段和验证阶段。在稿件阶段,我们选择性地跳过某些中间层,以更快速地生成稿件,但是这些稿件的质量可能会下降些。然后,验证阶段使用原始的 LLM 来验证这些稿件输出token,并在一个前进 pass 中确认它们的正确性。这个过程保证了最终输出的质量与原始 LLM 输出的质量一样,因此不需要进行额外的神经网络训练和额外的存储空间。我们在 LLMA-2 和其精度调整模型上进行了 benchmark,并达到了 1.73 倍的速度提升。

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

  • paper_url: http://arxiv.org/abs/2309.08156
  • repo_url: None
  • paper_authors: Zhengliang Shi, Weiwei Sun, Shuo Zhang, Zhen Zhang, Pengjie Ren, Zhaochun Ren
  • For: 评估开放领域对话系统的自动评估方法,解决一个问题,即一个回答中有多种可能性。* Methods: 提出了Reference-Assisted Dialogue Evaluation(RADE)方法,利用预创建的对话utterance作为参考,相比金标签回答,解决一元多个问题。具体来说,RADE将参考和候选回答进行直接比较,预测回答的总分。此外,还添加了一个辅助回答生成任务,通过共享编码器提高预测。* Results: 在三个 dataset和两个现有的benchmark上进行了实验,与人类评估相比,Pearson、Spearman和Kendall相关度都高于现有基eline。
    Abstract Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores. Moreover, an auxiliary response generation task enhances prediction via a shared encoder. To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation. Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.
    摘要 评估开放领域对话系统具有一些挑战,如一对多问题,即许多合适的回答而不仅是理想的回答。目前,自动评估方法需要更好的一致性与人类,而可靠的人类评估可能是时间和成本占用的。为此,我们提出了参考助力对话评估(RADE)方法,它利用预创建的话语作为参考而不是理想的回答来解决一对多问题。具体来说,RADE直接比较参考和候选答案的总分。此外,一个辅助回答生成任务通过共享Encoder来增强预测。为支持RADE,我们将三个数据集扩展为包括人类标注的多个评估答案。我们的实验表明,我们的方法可以在我们的三个数据集和两个现有的标准 benchmarke上具有更高的各种Spearman、Pearson和Kendall相关性 coefficient与人类评估,超过当前的基elines。

Unimodal Aggregation for CTC-based Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.08150
  • repo_url: https://github.com/Audio-WestlakeU/UMA-ASR
  • paper_authors: Ying Fang, Xiaofei Li
  • for: 这 paper 是关于非autoregressive自动语音识别的研究,旨在学习更好的特征表示以提高识别精度和计算复杂度。
  • methods: 提议的方法是基于encoder获取帧 wise features和权重,然后通过decoder进行集成和处理。另外,还应用了CTC损失函数进行训练。
  • results: 对三个普通话 dataset 进行实验表明,提议的方法可以与其他高级非 autoregressive方法相比,并且可以降低识别错误率和计算复杂度。此外,通过将self-conditioned CTC integrate到提议的框架中,可以进一步提高性能。
    Abstract This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Compared to the regular CTC, the proposed method learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity. Experiments on three Mandarin datasets show that UMA demonstrates superior or comparable performance to other advanced non-autoregressive methods, such as self-conditioned CTC. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.
    摘要 这篇论文工作在非autoregressive自动语音识别领域。我们提议一种单modal聚合(UMA)来段化和集成相同文本 токен的特征帧,从而学习更好的特征表示。特征帧和权重都来自Encoder。然后,通过Decoder进行进一步处理。使用Connectionist Temporal Classification(CTC)损失来训练。与常规CTC相比,我们的方法学习的特征表示更好,序列长度更短,识别错误和计算复杂性都更低。在三个普通话 datasets上进行了实验,UMA表现出优于其他高级非autoregressive方法,如自conditioned CTC。此外,通过将自conditioned CTC integrate到我们的框架中,可以进一步提高表现。

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

  • paper_url: http://arxiv.org/abs/2309.08140
  • repo_url: None
  • paper_authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana
  • for: This paper is written for researchers and developers interested in text-to-speech (TTS) synthesis, particularly those looking to control speaker identity using natural language descriptions.
  • methods: The paper proposes a prompt-based TTS synthesis system called PromptTTS++, which utilizes a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. The system also introduces the concept of speaker prompts, which describe voice characteristics such as gender-neutral, young, old, and muffled.
  • results: The subjective evaluation results show that the proposed method can better control speaker characteristics than previous methods without the speaker prompt. The authors also provide audio samples to demonstrate the effectiveness of their approach.
    Abstract We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.
    摘要 我们提出PromptTTS++,一种基于提示的文本译为语音(TTS)生成系统,允许通过自然语言描述控制发音人的身份。在基于提示的TTS框架中控制发音人的身份,我们引入发音人提示,描述语言特征(例如,中性、年轻、老、嘴巴覆盖),这些特征被设计为基本独立于说话风格。由于没有大规模的发音人提示数据集,我们首先基于LibriTTS-R corpus构建了一个数据集,并手动标注了发音人提示。然后,我们使用扩散基于音频模型和混合密度网络来模型训练数据中的多个发音人因素。不同于先前的研究,我们的方法不仅通过说话风格提示(例如,音高、说话速度和能量)控制发音人的个性,而是通过添加一个额外的发音人提示,以更好地学习自然语言描述到不同发音人的声学特征的映射。我们的主观评估结果表明,我们的方法可以更好地控制发音人的特征,比以前没有提示的方法。听样本可以在https://reppy4620.github.io/demo.promptttspp/中找到。

Audio Difference Learning for Audio Captioning

  • paper_url: http://arxiv.org/abs/2309.08141
  • repo_url: None
  • paper_authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda
  • for: 本研究提出了一种新的训练方法,即音频差异学习,用于改进音频描述。
  • methods: 该方法基于创建一个保持音频关系的特征表示空间,以生成详细的音频信息描述。方法使用一个参考音频和输入音频,通过共享编码器转换为特征表示。然后,从这些差异特征生成描述。此外,提出了一种混合输入音频和其他音频的技术,使得混合后的音频与参考音频的差异恢复回原输入音频。
  • results: 在使用Clotho和ESC50数据集的实验中,提出的方法比传统方法提高了SPIDEr分数7%。
    Abstract This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
    摘要 这种研究引入了一种新的训练方法,即音频差异学习,以提高音频描述。该方法的基本概念是创建一个保持音频关系的特征表示空间,以便从音频中生成详细的描述。该方法使用一个参照音频以及输入音频,两者都经过共享编码器转换成特征表示。然后,从这些差异特征中生成描述。此外,该方法还提出了一种独特的技术,即将输入音频混合到其他音频中,并使用这个混合音频作为参照音频。这会使得混合音频与参照音频之间的差异恢复回原始输入音频,从而消除了需要额外注释的差异。在使用 clotho 和 esc50 数据集进行实验时,提出的方法在 SPIDEr 分数上提高了7%,比传统方法更高。

Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection

  • paper_url: http://arxiv.org/abs/2309.08099
  • repo_url: https://github.com/zhu00121/universal-representation-dynamics-of-deepfake-speech
  • paper_authors: Yi Zhu, Saurabh Powar, Tiago H. Falk
  • for: 本研究旨在提高现有深伪语音检测系统的普适性,以便在训练时未看到的攻击样本上进行检测。
  • methods: 本研究使用了新的方法来评估表示性动态,以提高检测深伪语音的能力。
  • results: 实验结果表明,使用该方法可以在训练时未看到的攻击样本上提高深伪语音检测的性能,并在ASVspoof 2019和2021 datasets上达到了显著的改进。
    Abstract Existing deepfake speech detection systems lack generalizability to unseen attacks (i.e., samples generated by generative algorithms not seen during training). Recent studies have explored the use of universal speech representations to tackle this issue and have obtained inspiring results. These works, however, have focused on innovating downstream classifiers while leaving the representation itself untouched. In this study, we argue that characterizing the long-term temporal dynamics of these representations is crucial for generalizability and propose a new method to assess representation dynamics. Indeed, we show that different generative models generate similar representation dynamics patterns with our proposed method. Experiments on the ASVspoof 2019 and 2021 datasets validate the benefits of the proposed method to detect deepfakes from methods unseen during training, significantly improving on several benchmark methods.
    摘要 现有的深伪演说检测系统缺乏对未经训练的攻击(即由生成算法生成的样本)的普适性。近年来的研究强调使用通用的speech表示方法来解决这个问题,并取得了激进的结果。然而,这些工作均将注意力集中在下游分类器的创新上,而忽略了表示自身的改进。在本研究中,我们 argue that描述长期时间的speech表示动态是普适性的关键,并提出了一种新的方法来评估表示动态。实际上,我们发现了不同的生成模型在我们提出的方法下都会生成相似的表示动态模式。在ASVspoof 2019和2021 datasets上进行了实验,并证明了我们提出的方法可以很好地检测未经训练的深伪演说,与许多标准方法相比有显著提高。