cs.CL - 2023-09-14

Connecting the Dots in News Analysis: A Cross-Disciplinary Survey of Media Bias and Framing

  • paper_url: http://arxiv.org/abs/2309.08069
  • repo_url: None
  • paper_authors: Gisela Vallejo, Timothy Baldwin, Lea Frermann
  • for: 本研究旨在探讨新闻报道中的偏见现象,以及这种偏见对社会的影响。
  • methods: 本研究使用社会科学方法和NLP技术,对媒体偏见的影响进行分析和评估。
  • results: 本研究发现,现有的NLP方法在检测媒体偏见方面存在一些缺陷和局限性,需要更多的研究来解决这些问题。
    Abstract The manifestation and effect of bias in news reporting have been central topics in the social sciences for decades, and have received increasing attention in the NLP community recently. While NLP can help to scale up analyses or contribute automatic procedures to investigate the impact of biased news in society, we argue that methodologies that are currently dominant fall short of addressing the complex questions and effects addressed in theoretical media studies. In this survey paper, we review social science approaches and draw a comparison with typical task formulations, methods, and evaluation metrics used in the analysis of media bias in NLP. We discuss open questions and suggest possible directions to close identified gaps between theory and predictive models, and their evaluation. These include model transparency, considering document-external information, and cross-document reasoning rather than single-label assignment.
    摘要 新闻报导中的偏见的表现和影响在社会科学领域已经是长期的研究主题,近年来在自然语言处理领域也得到了更多的关注。虽然NLTP可以帮助扩大分析或提供自动化的过程来研究偏见新闻对社会的影响,但我们认为现有的方法ologies fall short of addressing the complex questions and effects addressed in theoretical media studies。在这篇评论稿中,我们回顾社会科学的方法和从Typical task formulations, methods, and evaluation metrics used in media bias analysis in NLP中着重比较。我们讨论的开Question和建议可能的方向来填充已知的漏洞,包括模型透明度、考虑外部文档信息和跨文档逻辑 reasoning而不是单一标签分配。

Investigating Gender Bias in News Summarization

  • paper_url: http://arxiv.org/abs/2309.08047
  • repo_url: None
  • paper_authors: Julius Steen, Katja Markert
  • For: The paper is written to investigate the presence of harmful social biases in language models (LLMs) and their impact on summarization models.* Methods: The paper introduces several definitions for biased behaviors in summarization models and proposes a method to generate input documents with controlled demographic attributes to sidestep the issue of biases inherent in the input document.* Results: The paper finds that content selection in single document summarization is largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.Here is the information in Simplified Chinese text:
  • for: 本文是为了研究 LLM 中的危险社会偏见,以及它们对摘要模型的影响。
  • methods: 本文提出了一些定义偏见行为的方法,并提议使用控制性的人口特征生成输入文档,以避免输入文档中的偏见问题。
  • results: 本文发现,单文档摘要的内容选择几乎不受偏见影响,而插入式偏见则在生成摘要中存在证据。
    Abstract Summarization is an important application of large language models (LLMs). Most previous evaluation of summarization models has focused on their performance in content selection, grammaticality and coherence. However, it is well known that LLMs reproduce and reinforce harmful social biases. This raises the question: Do these biases affect model outputs in a relatively constrained setting like summarization? To help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical measures to quantify them. Since we find biases inherent to the input document can confound our analysis, we additionally propose a method to generate input documents with carefully controlled demographic attributes. This allows us to sidestep this issue, while still working with somewhat realistic input documents. Finally, we apply our measures to summaries generated by both purpose-built summarization models and general purpose chat models. We find that content selection in single document summarization seems to be largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.
    摘要 <>使用大型语言模型(LLM)进行概要是一项重要应用。大多数之前的评估概要模型的研究都集中在内容选择、 grammaticality 和 coherence 等方面。然而, LLM 会重复和加强社会偏见。这引起了问题:这些偏见会影响模型输出在狭义 Setting 中吗? 为回答这个问题,我们首先介绍了概要模型中偏见行为的定义,以及实际量化这些偏见的方法。由于输入文档中的偏见可能会混淆我们的分析,我们还提出了一种方法来生成具有控制的人口特征的输入文档。这样,我们可以 circumvent 这个问题,而仍然可以使用一些具有实际感的输入文档。 最后,我们应用我们的度量方法来评估概要模型生成的概要。我们发现,单文档概要选择显然不受偏见影响,而 hallucinations 中的偏见则会传递到生成的概要中。

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.08030
  • repo_url: None
  • paper_authors: Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu
  • for: 本研究旨在提高audio-visual speech enhancement(AVSE)的性能,因为现实世界中的训练数据不具备净语音数据,从而降低了AVSE的研发难度。
  • methods: 本文提出了一种基于散射模型的AVSE方法,使用神经质量估计器从audio-visual dataset中提取了一个较好的净语音子集,然后使用这些子集训练散射模型,使其能够生成conditioned on continuous speech representations from AV-HuBERT的waveforms。
  • results: 研究表明,使用连续语音表示(continous speech representation)可以保留语速和发音信息,并且与masking-based基线相比,本方法的自适应性和净语音评价都有所提高。此外,通过细化散射模型,可以进一步提高AVSE的性能。
    Abstract Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.
    摘要 听音提升系统通常通过干净和噪声的对照来训练。在音频视频听音提升(AVSE)中, however, there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, which makes the development of AVSE more challenging. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We use a neural quality estimator to obtain a subset of nearly clean speech from an audio-visual corpus, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at [https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html](https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html).

DiariST: Streaming Speech Translation with Speaker Diarization

  • paper_url: http://arxiv.org/abs/2309.08007
  • repo_url: https://github.com/mu-y/diarist
  • paper_authors: Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka
  • for: 这个论文主要目的是提出一种批处理 streaming speech translation 和 speaker diarization 的方法。
  • methods: 这个方法基于神经网络抽象器,并使用 token-level 序列化输出训练和 t-vector,以提高 streaming ST 和 SD 性能。
  • results: 对于 DiariST 系统,我们在对 AliMeeting 数据集进行了评估,并提出了一些新的评价指标,如 speaker-agnostic BLEU 和 speaker-attributed BLEU,以衡量 ST 质量和 SD 精度。我们的系统在对 overlap 的情况下进行了流式推理,并与 offline 系统基于 Whisper 进行了比较。
    Abstract End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.
    摘要 听写末端到听写的speech翻译(ST)在对话录音中存在许多未经探索的挑战,如 speaker化(SD)无法准确地获取单词时间戳和对另一个人的语音进行流式处理。在这项工作中,我们提出了DiariST,第一个流式ST和SD解决方案。它基于神经网络抽象器基于流式ST系统,并 интегриру了token级别的序列化输出训练和t-vector,这些原始是为多个说话人的语音识别所开发。由于这个领域没有评估指标,我们开发了一个新的评估数据集,DiariST-AliMeeting,通过翻译中文笔录 AliMeeting 资料集的参考转录为英语。我们还提出了一些新的度量方法,称为 speaker-agnostic BLEU 和 speaker-attributed BLEU,以衡量 ST 质量并考虑 SD 精度。我们的系统在与 Whisper 的离线系统进行比较后,在流式推理中处理重叠的语音时达到了强大的 ST 和 SD 能力。为便于这个新方向的研究,我们发布了评估数据、离线基准系统和评估代码。

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

  • paper_url: http://arxiv.org/abs/2309.07998
  • repo_url: None
  • paper_authors: Sarah E. Finch, James D. Finch, Jinho D. Choi
  • for: 这篇论文主要是为了检验对话系统的评估方法。
  • methods: 这篇论文使用了4个不同的评估者组来评估4种现状的对话系统,并分析了评估者组的影响。
  • results: 研究发现,对于 likert 评估,评估者组的影响不大,但是有些特定的对话指标存在评估者差异。此外,研究还发现了一些限制这种 Robustness,包括评估者对对话机器人专业程度的差异和评估者对对话指标的主观性。
    Abstract Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and professional annotators have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between evaluators with different levels of chatbot expertise and indicate that evaluator objectivity is beneficial for certain dialogue metrics.
    摘要 人工评估已被广泛接受为对话系统评估的标准。然而,以前的工作中选择评估人员的方法存在显著的差异。各种评估者组,如领域专家、大学学生和职业标注人员,已经用来评估和比较对话系统,却不清楚这些选择对结果的影响。这篇文章分析对话系统评估中评估者组的影响,通过使用4种当前最佳对话系统和4个不同的评估者组进行测试。我们的分析发现,对于 likert 评估,评估者组对对话系统的影响相对较强,但对于 pairwise 评估,只有小范围内的差异被观察到。此外,我们还发现了两点有关这种Robustness的限制性,一是评估人员对对话系统的专业程度的差异,二是评估人员对某些对话指标的客观性的影响。

Leveraging Contextual Information for Effective Entity Salience Detection

  • paper_url: http://arxiv.org/abs/2309.07990
  • repo_url: None
  • paper_authors: Rajarshi Bhowmik, Marco Ponza, Atharva Tendle, Anant Gupta, Rebecca Jiang, Xingyu Lu, Qian Zhao, Daniel Preotiuc-Pietro
  • for: 本研究旨在检测文档中突出的实体,以便提高搜索、排名和实体中心概要等下游应用的性能。
  • methods: 本研究使用了中等规模的自然语言处理模型,并采用了cross-encoder风格的 architecture,以提高performance。
  • results: 研究发现,通过 fine-tuning 中等规模的预训练语言模型,可以获得substantial performance gain,而 feature engineering 方法则无法达到这个目标。
    Abstract In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
    摘要 文档中的内容和关键事件通常涉及一 subset of 所有提到的实体。这些实体,经常被称为突出的实体,对文档的关键信息提供有用的提示。识别突出实体的存在有助于多个下游应用程序,如搜索、排名和实体中心摘要等。现有的突出实体检测主要基于机器学习模型,需要大量的特征工程。我们显示,将中型语言模型进行精度调整可以得到显著性能提升。为此,我们对公共可用的四个数据集进行了广泛的 benchmarking,并显示了代表中型预训练语言模型家族的模型的性能。此外,我们还示出了零样本提示的语言模型训练结果为 inferior,表明任务的独特性和复杂性。

Ambiguity-Aware In-Context Learning with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.07900
  • repo_url: None
  • paper_authors: Lingyu Gao, Aditi Chaudhary, Krishna Srinivasan, Kazuma Hashimoto, Karthik Raman, Michael Bendersky
  • for: 本研究希望通过几个任务特定示例来提高大型语言模型(LLM)的下游性能,但是选择好示例是一个关键问题。
  • methods: 研究人员使用文本检索器来选择Semantic similarity between ICL demonstrations and test inputs,但这并不考虑LLM对这个任务的已有知识。
  • results: 通过对三个文本分类任务进行广泛的实验,研究人员发现,不仅选择semantic similarity的ICL示例,还需要选择能够解决测试示例周围的自然标签抖抖的示例,可以得到最大性能提升。
    Abstract In-context learning (ICL) i.e. showing LLMs only a few task-specific demonstrations has led to downstream gains with no task-specific fine-tuning required. However, LLMs are sensitive to the choice of prompts, and therefore a crucial research question is how to select good demonstrations for ICL. One effective strategy is leveraging semantic similarity between the ICL demonstrations and test inputs by using a text retriever, which however is sub-optimal as that does not consider the LLM's existing knowledge about that task. From prior work (Min et al., 2022), we already know that labels paired with the demonstrations bias the model predictions. This leads us to our hypothesis whether considering LLM's existing knowledge about the task, especially with respect to the output label space can help in a better demonstration selection strategy. Through extensive experimentation on three text classification tasks, we find that it is beneficial to not only choose semantically similar ICL demonstrations but also to choose those demonstrations that help resolve the inherent label ambiguity surrounding the test example. Interestingly, we find that including demonstrations that the LLM previously mis-classified and also fall on the test example's decision boundary, brings the most performance gain.
    摘要 具体学习(ICL),即只显示一些任务特定示例,已经导致下游成果,无需任务特定微调。然而,LLM很敏感于选择示例的选择,因此一个重要的研究问题是如何选择好的示例 дляICL。一种有效的策略是利用示例和测试输入之间的semantic similarity,使用文本检索器。然而,这种方法并不理想,因为它不考虑LLM对该任务的已有知识。根据之前的研究(Min et al., 2022),我们已知标签与示例的匹配会偏导模型预测。这导我们到我们的假设,那么考虑LLM对该任务的已有知识,特别是输出标签空间中的知识,可以帮助实现更好的示例选择策略。通过对三个文本分类任务进行了广泛的实验,我们发现,不仅选择semantic similar的ICL示例,还要选择能够解决测试示例上的自然标签抖障的示例,能够带来最大的性能提升。 Interestingly,包括LLM之前错分的示例和测试示例的决策边缘示例,可以带来最大的性能提升。

Anchor Points: Benchmarking Models with Much Fewer Examples

  • paper_url: http://arxiv.org/abs/2309.08638
  • repo_url: https://github.com/rvivek3/anchorpoints
  • paper_authors: Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela
  • for: 本文旨在探讨如何使用小型评估集来评估语言模型的表现。
  • methods: 作者提出了一种名为”Anchor Point Selection”的技术,可以选择小型数据集来评估模型的表现。
  • results: 作者表明,使用 anchor points 可以准确地评估语言模型的表现,并且只需要使用几个 anchor points 可以估算模型在整个数据集中的表现。
    Abstract Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.
    摘要 现代语言模型经常表现出强大 pero 脆弱的行为,导致开发更大和多样化的测试集来可靠地评估其行为。在这里,我们建议可以使用 much smaller evaluation sets 来评估模型性能。我们首先发现,在六个流行的语言分类标准测试集中,模型对正确类别的信任度在许多对点对是强相关的。我们基于这一现象提出了 Anchor Point Selection,一种技术来选择数据集中的小集合,以捕捉模型在整个数据集中的行为。 anchor points 可靠地排名模型:在 87 种语言模型-提示对的测试集中,使用 1-30 anchor points 来评估模型,比 uniform sampling 和其他基elines 更高效地准确地排名模型。此外,只需几个 anchor points 可以用来估算模型每个类别预测值的所有其他点在数据集中的低均绝对误差,足够用于评估模型在哪些点上可能会失败。最后,我们提出了 Anchor Point Maps,用于可见地表示这些 Insights 并且方便对不同模型在不同区域内数据集分布中的性能进行比较。

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

  • paper_url: http://arxiv.org/abs/2309.07875
  • repo_url: https://github.com/vinid/instruction-llms-safety-eval
  • paper_authors: Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou
  • for: 这个论文探讨了大语言模型是否只专注于帮助性能而忽视安全性的问题。
  • methods: 作者使用了多种实验方法来评估大语言模型的帮助性和安全性。
  • results: 研究发现,许多流行的指令调整模型具有高度的危险性,而在训练集中添加3%的安全示例可以提高模型的安全性,但是过度的安全调整可能会使模型拒绝合理的提示。
    Abstract Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.
    摘要 培训大型自然语言模型遵循 instrucion 可以使其在各种任务上表现更好,通常变得更有用。然而,一个完全有用的模型会遵循任何有恶意 instrucion 并快速生成有害内容。在这篇论文中,我们表达对模型仅强调有用性,不强调安全性的担忧。我们发现了许多流行的 instrucion-tuned 模型具有高度不安全的问题。此外,我们发现,在 fine-tuning 一个模型如 LLMA 时,只需添加一些安全示例(几百个示例)可以大幅提高其安全性。我们的安全调整不会使模型变得显著不 capable 或不有用,如按照标准标准准则测试。然而,我们发现了一种安全偏见行为,即在安全调整过后,模型会拒绝处理有些安全提示,它们superficially resemble unsafe prompts。我们的研究揭示了培训 LLMs 遵循 instrucion 和表现安全行为之间的负担。

Agents: An Open-source Framework for Autonomous Language Agents

  • paper_url: http://arxiv.org/abs/2309.07870
  • repo_url: https://github.com/aiwaves-cn/agents
  • paper_authors: Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan
  • for: 这篇论文旨在探讨大型自然语言模型 (LLMs) 在语言代理方面的最新进展,并提供一个开源库 Agents,以便非专家可以轻松地建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互。
  • methods: 这篇论文使用了大型自然语言模型 (LLMs),并提供了一个开源库 Agents,以便非专家可以轻松地使用这些模型来建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互。
  • results: 这篇论文提供了一个开源库 Agents,可以帮助非专家轻松地建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互,并且可以支持规划、记忆、工具使用、多代理交流和细致的符号控制等重要特性。
    Abstract Recent advances on large language models (LLMs) enable researchers and developers to build autonomous language agents that can automatically solve various tasks and interact with environments, humans, and other agents using natural language interfaces. We consider language agents as a promising direction towards artificial general intelligence and release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience. Agents is carefully engineered to support important features including planning, memory, tool usage, multi-agent communication, and fine-grained symbolic control. Agents is user-friendly as it enables non-specialists to build, customize, test, tune, and deploy state-of-the-art autonomous language agents without much coding. The library is also research-friendly as its modularized design makes it easily extensible for researchers. Agents is available at https://github.com/aiwaves-cn/agents.
    摘要 Agents is carefully engineered to support important features such as planning, memory, tool usage, multi-agent communication, and fine-grained symbolic control. The library is user-friendly, allowing non-specialists to build, customize, test, tune, and deploy state-of-the-art autonomous language agents with minimal coding. Additionally, the modularized design of Agents makes it easily extensible for researchers.Agents is available at the following link: .

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

  • paper_url: http://arxiv.org/abs/2309.07822
  • repo_url: https://github.com/ukplab/catfood
  • paper_authors: Rachneet Sachdeva, Martin Tutek, Iryna Gurevych
  • for: 增强小语言模型(SLMs)在不同预测任务中的表现,特别是在生成文本的任务中。
  • methods: 使用大语言模型(LLMs)生成自动生成的对话实例(CF),以增强小语言模型在不同预测任务中的表现。
  • results: 通过多种LLM生成器,对CF实例进行数据增强后,可以提高小语言模型在不同预测任务中的表现,并且可以改善模型的准确性和排序能力。
    Abstract In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.
    摘要 Recently, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.Here's the translation in Traditional Chinese:过去的年份,大语言模型(LLM)在大规模下显示了杰出的能力,特别是在基于提示的文本生成。在我们的工作中,我们 investigate 使用 LLM 对小语言模型(SLM)的训练数据进行增强,以增强 OOD 性能。我们发现,各种 LLM 生成器都可以对 SLM 的 OOD 性能进行增强,并且可以改善模型的准确性。此外,这些性能改善与 CF 实例的多样性有很高的相关性,包括表面形式和 semantic 内容。最后,我们发现,可以轻松对 CF 实例进行混合的模型也会表现出较低的 entropy,这表明了 rational 扩展的准确性。

Text Classification of Cancer Clinical Trial Eligibility Criteria

  • paper_url: http://arxiv.org/abs/2309.07812
  • repo_url: None
  • paper_authors: Yumeng Yang, Soumya Jayaraj, Ethan B Ludmir, Kirk Roberts
  • for: 本研究旨在解决许多临床试验中的患者参与问题,因为参与条件通常表示为自然语言。
  • methods: 本研究使用文本分类方法来处理常见排除条件。
  • results: 我们的结果表明可以自动确定临床试验患者参与条件的可能性。此外,我们还发现一个专门为临床试验预先训练的语言模型可以为临床试验提供最高的平均性能。
    Abstract Automatic identification of clinical trials for which a patient is eligible is complicated by the fact that trial eligibility is stated in natural language. A potential solution to this problem is to employ text classification methods for common types of eligibility criteria. In this study, we focus on seven common exclusion criteria in cancer trials: prior malignancy, human immunodeficiency virus, hepatitis B, hepatitis C, psychiatric illness, drug/substance abuse, and autoimmune illness. Our dataset consists of 764 phase III cancer trials with these exclusions annotated at the trial level. We experiment with common transformer models as well as a new pre-trained clinical trial BERT model. Our results demonstrate the feasibility of automatically classifying common exclusion criteria. Additionally, we demonstrate the value of a pre-trained language model specifically for clinical trials, which yields the highest average performance across all criteria.
    摘要 自动确定临床试验中患者是否符合参与条件是由于参与条件表示在自然语言中,这种问题的解决方案之一是使用文本分类方法来处理常见的参与条件。在本研究中,我们关注了7种常见排除条件在肿瘤试验中:先前的肿瘤、人类免疫缺陷病毒、乙型肝炎、丙型肝炎、心理疾病、药物/化学成瘾和自体免疫疾病。我们的数据集包括764个相关III肿瘤试验,这些排除条件在试验水平进行了标注。我们对常见变换器模型以及一个新的临床试验BERT模型进行实验。我们的结果表明自动分类常见排除条件的可能性,而且还表明特制的临床试验BERT模型在所有标准的参与条件上的平均性能最高。

Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?

  • paper_url: http://arxiv.org/abs/2309.07804
  • repo_url: None
  • paper_authors: Terry Yue Zhuo, Xiaoning Du, Zhenchang Xing, Jiamou Sun, Haowei Quan, Li Li, Liming Zhu
  • for: 本研究的目的是探讨现有的预训练代码模型在自动化API使用方面的表现,以及如何提高代码智能实践中的代码表示和自动化API使用。
  • methods: 本研究使用了知识探测技术,通过cloze-style测试来评估模型内存储的知识。研究从两个不同的角度探讨了预训练代码模型对 Fully Qualified Names (FQNs) 的理解能力,包括API调用和API导入。
  • results: 研究发现,当前的预训练代码模型在理解FQNs方面存在困难,尤其是在预训练策略方面对API名称学习产生了显著的影响。研究还发现,自然语言上下文可以帮助代码模型在找到Python API名称方面做出更好的表现,并且可以将Python API名称知识推广到未见数据上。这些发现为代码智能实践提供了指导和方向,并建议将API结构纳入预训练过程以提高自动化API使用和代码表示。
    Abstract Recent breakthroughs in pre-trained code models, such as CodeBERT and Codex, have shown their superior performance in various downstream tasks. The correctness and unambiguity of API usage among these code models are crucial for achieving desirable program functionalities, requiring them to learn various API fully qualified names structurally and semantically. Recent studies reveal that even state-of-the-art pre-trained code models struggle with suggesting the correct APIs during code generation. However, the reasons for such poor API usage performance are barely investigated. To address this challenge, we propose using knowledge probing as a means of interpreting code models, which uses cloze-style tests to measure the knowledge stored in models. Our comprehensive study examines a code model's capability of understanding API fully qualified names from two different perspectives: API call and API import. Specifically, we reveal that current code models struggle with understanding API names, with pre-training strategies significantly affecting the quality of API name learning. We demonstrate that natural language context can assist code models in locating Python API names and generalize Python API name knowledge to unseen data. Our findings provide insights into the limitations and capabilities of current pre-trained code models, and suggest that incorporating API structure into the pre-training process can improve automated API usage and code representations. This work provides significance for advancing code intelligence practices and direction for future studies. All experiment results, data and source code used in this work are available at \url{https://doi.org/10.5281/zenodo.7902072}.
    摘要 近期的预训CodeBERT和Codex等模型已经显示了在不同的下游任务中的优秀性能。在这些代码模型中,正确和不ambiguous的API使用是达到愉悦的程序功能的关键,它们需要学习不同的API完全限定名称的结构和含义。然而,当代领先的预训代码模型在代码生成时间建议正确的API时会遇到困难。然而,这些原因几乎未经调查。为解决这个挑战,我们提议使用知识探测来解释代码模型,该方法使用cloze-style测试来测量模型中的知识。我们的全面研究检查了代码模型理解API完全限定名称的两个不同的角度:API调用和API导入。我们发现,当前的代码模型在理解API名称方面存在困难,而预训策略对API名称学习质量产生了显著影响。我们还发现,自然语言上下文可以帮助代码模型在Python API名称上进行定位,并且可以将Python API名称知识扩展到未看到的数据上。我们的发现提供了预训代码模型的局限性和能力,并建议将API结构纳入预训过程以提高自动API使用和代码表示。这项工作提供了代码智能实践的进步和未来研究的指导。所有实验结果、数据和源代码在\url{https://doi.org/10.5281/zenodo.7902072}上可以获得。

The Dynamical Principles of Storytelling

  • paper_url: http://arxiv.org/abs/2309.07797
  • repo_url: None
  • paper_authors: Isidoros Doxas, James Meiss, Steven Bottone, Tom Strelich, Andrew Plummer, Adrienne Breland, Simon Dennis, Kathy Garvin-Doxas, Michael Klymkowsky
  • for: 研究开头1800篇短篇小说的起始部分,发现大多数篇章遵循行动原理,如arXiv:2309.06600所定义。
  • methods: 研究者使用了混淆序列的方法,检查开头篇章的顺序对故事的semantic空间的影响。
  • results: 结果表明,在启动故事时,我们倾向于采取一定的方向在semantic空间中,可能与西方故事传统有关,如阿里斯多德在《诗学》中所暗示的。
    Abstract When considering the opening part of 1800 short stories, we find that the first dozen paragraphs of the average narrative follow an action principle as defined in arXiv:2309.06600. When the order of the paragraphs is shuffled, the average no longer exhibits this property. The findings show that there is a preferential direction we take in semantic space when starting a story, possibly related to a common Western storytelling tradition as implied by Aristotle in Poetics.
    摘要 (考虑1800短篇故事的开头部分,我们发现平均的 Narraative 的前 dozen 段落遵循 arXiv:2309.06600 中定义的行动原则。当排序段落的顺序时,平均不再显示这种性质。发现在 semantic space 中开始故事时,有一个偏好的方向,可能与西方故事创作传统相关,如阿里斯托丰提到在《诗学》中。)

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary tasks

  • paper_url: http://arxiv.org/abs/2309.07794
  • repo_url: None
  • paper_authors: Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras
  • for: 本研究旨在提高社交媒体文章中的多Modal信息利用,以便进行各种下游任务,如情感分析、讽刺检测和仇恨言语分类。
  • methods: 本文提议在微博模型 fine-tuning 过程中使用两个辅助任务:图像文本对比(ITC)和图像文本匹配(ITM),以直接模型图像文本之间的相互关系。
  • results: 通过对五种多Modal模型进行组合,本研究在四个社交媒体数据集上表现出了一致性的提高。此外,通过细化分析,我们发现每个辅助任务在具体情况和案例中的效果最为显著。
    Abstract Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection and hate speech classification. However, combining text and image information is challenging because of the idiosyncratic cross-modal semantics with hidden or complementary information present in matching image-text pairs. In this work, we aim to directly model this by proposing the use of two auxiliary losses jointly with the main task when fine-tuning any pre-trained multimodal model. Image-Text Contrastive (ITC) brings image-text representations of a post closer together and separates them from different posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates the understanding of semantic correspondence between images and text by penalizing unrelated pairs. We combine these objectives with five multimodal models, demonstrating consistent improvements across four popular social media datasets. Furthermore, through detailed analysis, we shed light on the specific scenarios and cases where each auxiliary task proves to be most effective.
    摘要 通过有效地利用社交媒体文章中的多Modal信息,可以提高各种下游任务,如情感分析、讲究和仇恨言语识别。但是,将文字和图像信息结合起来是一项挑战,因为它们之间存在特殊的跨Modal semantics和隐藏或补做信息。在这项工作中,我们提议使用两个辅助损失函数,一起与主任务进行调整已经预训练的任意多Modal模型。图像文本对比(ITC)使图像文本对的 representations更加相似,并将它们与不同的文本对分开,捕捉到它们之间的依赖关系。图像文本匹配(ITM)促进了图像和文本之间的含义相似性的理解,使得不相关的对进行惩罚。我们将这些目标与五种多Modal模型结合,在四个流行的社交媒体数据集上进行了详细的分析,并证明了这些辅助任务在不同的场景和情况下的效果。

Spoken Humanoid Embodied Conversational Agents in Mobile Serious Games: A Usability Assessment

  • paper_url: http://arxiv.org/abs/2309.07773
  • repo_url: None
  • paper_authors: Danai Korre, Judy Robertson
  • for: 这项研究旨在检验 spoken Humanoid Embodied Conversational Agents (HECAs) 在移动严重游戏 (MSG) 应用中是否可以提高可用性。
  • methods: 研究使用了两种代理人示例:一种高度人类化的 HECA 和一种文本示例。实验评估了多个代理人和人类化的假设对交互质量的影响。
  • results: 实验结果显示用户更偏好与 HECA 交互,两个版本之间的差异为 statistically significant 大效果大(d=1.01),许多参与者表示人类化特征使得版本更加吸引人。这项研究为未来移动严重游戏的设计提供了重要信息。
    Abstract This paper presents an empirical investigation of the extent to which spoken Humanoid Embodied Conversational Agents (HECAs) can foster usability in mobile serious game (MSG) applications. The aim of the research is to assess the impact of multiple agents and illusion of humanness on the quality of the interaction. The experiment investigates two styles of agent presentation: an agent of high human-likeness (HECA) and an agent of low human-likeness (text). The purpose of the experiment is to assess whether and how agents of high humanlikeness can evoke the illusion of humanness and affect usability. Agents of high human-likeness were designed by following the ECA design model that is a proposed guide for ECA development. The results of the experiment with 90 participants show that users prefer to interact with the HECAs. The difference between the two versions is statistically significant with a large effect size (d=1.01), with many of the participants justifying their choice by saying that the human-like characteristics of the HECA made the version more appealing. This research provides key information on the potential effect of HECAs on serious games, which can provide insight into the design of future mobile serious games.
    摘要

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

  • paper_url: http://arxiv.org/abs/2309.07765
  • repo_url: None
  • paper_authors: Sizhou Chen, Songyang Gao, Sen Fang
  • for: 这个论文的目的是提高自动语音识别(ASR)任务中的模型性能。
  • methods: 这个论文使用了变长注意力机制,以适应不同的语音样本duration和复杂度。
  • results: 根据我们的实验结果,将Echo-MSA模块 integrate到主模型的训练过程中,可以显著提高word error rate(WER)性能,同时保持原始模型的内在稳定性。
    Abstract The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.
    摘要 《Transformer架构在自动语音识别(ASR)任务中表现出非常高效,成为了这个领域的基础组件。历史上,许多方法都是采用固定长度注意窗口,这会导致不同的语音样本duration和复杂度的数据过滤和重要的长期连接被忽略。为了解决这个限制,我们介绍了Echo-MSA模块,这是一个具有可变长度注意机制的Variable-Length Attention(VLA)模块。这个模块可以捕捉不同的语音特征,从帧和音频到单词和话语等多个级别。我们的设计captures the variable length feature of speech和解决了固定长度注意的局限性。我们的评估使用了并行的注意架构和动态闭合机制,将传统注意与Echo-MSA模块输出结合。我们的实验证明,将Echo-MSA模块包含在主模型的训练过程中可以显著提高word error rate(WER)性能,同时保持原始模型的内在稳定性。》

PROGrasp: Pragmatic Human-Robot Communication for Object Grasping

  • paper_url: http://arxiv.org/abs/2309.07759
  • repo_url: None
  • paper_authors: Gi-Cheon Kang, Junghyun Kim, Jaein Kim, Byoung-Tak Zhang
  • for: 本研究旨在提出一种新的人机共同协作任务—— Pragmatic-IOG,以及相应的数据集—— Intention-oriented Multi-modal Dialogue (IM-Dial)。
  • methods: 我们提出了一种新的机器人系统—— Pragmatic Object Grasping (PROGrasp),该系统通过视觉固定、问答、物体抓取和答案解释等模块来解决 Pragmatic-IOG 任务。
  • results: 我们的实验结果表明,PROGrasp 在线上和离线上都能够有效地完成 Pragmatic-IOG 任务。
    Abstract Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings.
    摘要 <>translate "Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings."into Simplified Chinese.Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). 在当前的IOG系统中,人类用户会首先指定目标物的类别(例如瓶子)。 drawing inspiration from pragmatics, where humans often convey their intentions by relying on context to achieve goals, we propose a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial).In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. 在我们的提议的任务场景中,一个意向带有的语音(例如 "我是喝喝的")会首先被给 robot。 robot 然后应该通过与人类用户交互来确定目标物。基于任务设置,我们提议一种新的机器人系统,可以理解用户的意图,并将目标物pick up,即 Pragmatic Object Grasping (PROGrasp)。PROGrasp 实现 Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference.Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings. 实验结果表明,PROGrasp 在offline(即目标物发现)和 online(即与物理机器人臂进行IOG)的设置下都是有效的。

The complementary roles of non-verbal cues for Robust Pronunciation Assessment

  • paper_url: http://arxiv.org/abs/2309.07739
  • repo_url: None
  • paper_authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
  • for: 本研究旨在评估非本地语言(L2)发音系统,拓展现有研究的 fonetic和phonological特征,同时充分利用非语言表征。
  • methods: 本研究提出了一种新的发音评估框架,IntraVerbalPA,该框架兼容细致的帧级和抽象词级非语言特征,并引入 ‘’Goodness of phonemic-duration’’ 度量来有效地模型duration分布。
  • results: 研究结果证明IntraVerbalPA框架和其组件的有效性,其性能与或超过了现有研究工作。
    Abstract Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verbal cues, alongside the conventional speech and phoneme representations. Additionally, we introduce ''Goodness of phonemic-duration'' metric to effectively model duration distribution within the framework. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that either matches or outperforms existing research works.
    摘要 研究声学评估系统通常强调非本地语言(L2)的音律和音位方面,经常忽略非语音表达的丰富信息。本研究提出了一种新的声学评估框架,IntraVerbalPA。该框架创新地结合了细致的帧级和抽象的语音级非语音示唆,并与传统的语音和音位表达结合使用。此外,我们还提出了''声音持续时间质量''指标,有效地模型duration分布。我们的结果证明了提案的IntraVerbalPA框架和其组件的有效性,其性能与或超过了现有研究成果。

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features

  • paper_url: http://arxiv.org/abs/2309.07733
  • repo_url: None
  • paper_authors: Eliana Pastor, Alkis Koudounas, Giuseppe Attanasio, Dirk Hovy, Elena Baralis
  • for: 本研究旨在解释语音分类模型的内部工作方式,以便更好地理解和信任这些模型。
  • methods: 本研究使用输入杂化来生成易于理解的解释,包括单词层和para linguistic特征层。单词层解释显示每个单词相关的音频段带来了哪些影响,而para linguistic特征层解释则回答了对话式问题:“如果将音频信号编辑得这样,模型的预测会如何变化?”
  • results: 我们验证了这种方法,使用两种语言(英语和意大利语)的两个语音分类任务来解释两个现代SLU模型。我们的结果表明,这些解释准确地反映了模型的内部工作方式,并且对人类来说是可理解的。
    Abstract Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings demonstrate that the explanations are faithful to the model's inner workings and plausible to humans. Our method and findings pave the way for future research on interpreting speech models.
    摘要 Note:* 可解释AI (XAI) is simplified as "eXplainable AI" in the text.* "speech models" is simplified as "speech classification models" in the text.* "spoken language understanding" is simplified as "SLU" in the text.* "word-related audio segment" is simplified as "word" in the text.* "paralinguistic features" is simplified as "paralinguistic" in the text.* "counterfactual" is simplified as "what if" in the text.

PerPLM: Personalized Fine-tuning of Pretrained Language Models via Writer-specific Intermediate Learning and Prompts

  • paper_url: http://arxiv.org/abs/2309.07727
  • repo_url: None
  • paper_authors: Daisuke Oba, Naoki Yoshinaga, Masashi Toyoda
    for: 这个研究旨在提高文本理解任务的准确率,通过个性化PLM的精度调整 для特定作者。methods: 我们使用作者特定的提示来个性化一个统一的PLM,以避免多个用户的PLM存储和训练成本。我们还提出了一种基于遮盲语言模型的中间学习方法,以提取作者特定的文本特征。results: 我们的实验结果显示了不同的提示类型的特点,以及我们的中间学习方法的效果。我们使用多个任务、数据集和PLM进行了实验,并发现了个性化调整的优势。
    Abstract The meanings of words and phrases depend not only on where they are used (contexts) but also on who use them (writers). Pretrained language models (PLMs) are powerful tools for capturing context, but they are typically pretrained and fine-tuned for universal use across different writers. This study aims to improve the accuracy of text understanding tasks by personalizing the fine-tuning of PLMs for specific writers. We focus on a general setting where only the plain text from target writers are available for personalization. To avoid the cost of fine-tuning and storing multiple copies of PLMs for different users, we exhaustively explore using writer-specific prompts to personalize a unified PLM. Since the design and evaluation of these prompts is an underdeveloped area, we introduce and compare different types of prompts that are possible in our setting. To maximize the potential of prompt-based personalized fine-tuning, we propose a personalized intermediate learning based on masked language modeling to extract task-independent traits of writers' text. Our experiments, using multiple tasks, datasets, and PLMs, reveal the nature of different prompts and the effectiveness of our intermediate learning approach.
    摘要 文本中的意思不仅取决于其使用场景(上下文),还取决于作者(写者)。预训言语模型(PLM)是一种强大的工具,可以捕捉上下文,但它们通常是通用的,需要进行 universal 的预训练和精度调整。这个研究的目标是通过个性化 PLM 的精度调整来提高文本理解任务的准确性。我们关注一般情况下,只有目标作者的平面文本可用于个性化。为了避免多个用户的 PLM 预训练和存储成本,我们对writer-specific 的提示进行了探索。由于设计和评估这些提示的领域还是未发展的,我们引入了不同类型的提示,并对它们进行比较。为了最大化个性化提示基于隐藏语言模型的学习效果,我们提议了个性化中间学习,以抽取任务不виси的作者文本特征。我们的实验,使用多个任务、数据集和 PLM,显示了不同类型的提示的性质和我们的中间学习方法的效果。

L1-aware Multilingual Mispronunciation Detection Framework

  • paper_url: http://arxiv.org/abs/2309.07719
  • repo_url: None
  • paper_authors: Yassine El Kheir, Shammur Absar Chwodhury, Ahmed Ali
  • for: 本研究旨在提出一种多语言声音识别(MDD)框架,以优化声音识别性能。
  • methods: 该框架基于一种新的多语言声音识别模型,即L1-MultiMDD模型,其中包含了语言一价声音表示。在该模型中,一个注意机制将输入音频与参考音频序列进行对应,然后使用多语言声音嵌入从一个辅助模型中提取,并将其与主网络相结合。最后,模型通过 Connectionist Temporal Classification(CTC)损失函数进行优化。
  • results: 实验结果表明,L1-MultiMDD模型在多种目标语言(英语、阿拉伯语和普通话)上具有稳定的性能,并且在各种声音识别任务上均显示出了领先的性能。
    Abstract The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
    摘要 “对话者的Native语言(L1)和非Native语言(L2)之间的音系学差异作为主要因素,导致误对。本文提出了一个新的多语言MDD架构,L1-MultiMDD,其中包含了L1-意识的语音表现。一个终端处理器是对入力讯号和它的对应的音节序列进行对齐。接着,从副架构中提取L1-L2-语音嵌入,并将其与主网络相结合。最后,L1-MultiMDD是透过 Connectionist Temporal Classification(CTC)损失来优化一个多语言音节识别任务。我们的实验表明,提案的L1-MultiMDD架构在seen和unseen数据集上都有显著的性能提升,PER和false rejection rate(FRR)在所有目标语言上都有相似的下降。”

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

  • paper_url: http://arxiv.org/abs/2309.07707
  • repo_url: None
  • paper_authors: Heng-Jui Chang, Ning Dong, Ruslan Mavlyutov, Sravya Popuri, Yu-An Chung
  • for: 大规模自监督演算 speech 编码器在语音识别和翻译任务中表现出色,但由于开发这些大型模型的成本太高,新任务建立新模型和部署到设备应用程序是不可能的。
  • methods: 我们提出了一种新的知识填充方法,即做到层次预测和对比学习来训练学生模型模仿大教师模型的行为。
  • results: CoLLD 方法比前一代方法表现出色,在多语言语音文本翻译和识别 benchmark 上与小型模型几乎相当。
    Abstract Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.
    摘要 大规模自主学习预训练音频编码器超过传统方法在语音识别和翻译任务中表现出色。由于开发这些大型模型的成本很高,为新任务建立新的编码器并将其部署到设备应用程序是不可能的。先前的研究提出了模型压缩方法来解决这个问题,但这些方法主要关注小型模型和更为实际的任务。因此,我们提出了对比层次预训练知识填充(CoLLD),一种新的知识填充方法,通过使用遮盖预测和对比学习训练学生模型模仿大教师模型的行为。CoLLD超过了先前的方法,在多语言语音文本翻译和识别数据集上闭合了小型和大型模型之间的差距。

A Conversation is Worth A Thousand Recommendations: A Survey of Holistic Conversational Recommender Systems

  • paper_url: http://arxiv.org/abs/2309.07682
  • repo_url: https://github.com/lichuangnus/crs-paper-list
  • paper_authors: Chuang Li, Hengchang Hu, Yan Zhang, Min-Yen Kan, Haizhou Li
  • for: 这篇论文旨在探讨基于实际对话的会话推荐系统(CRS)的新趋势,即基于实际对话的holistic CRS方法。
  • methods: 这篇论文使用了一种结构化的方法来总结holistic CRS方法,其包括三个组成部分:1)基础语言模型,2)可选的外部知识,以及3)外部指导。
  • results: 论文提供了一个详细的分析对话推荐系统数据集和评价方法在实际应用场景中的现状,并提供了作者对现有挑战和未来趋势的评论。
    Abstract Conversational recommender systems (CRS) generate recommendations through an interactive process. However, not all CRS approaches use human conversations as their source of interaction data; the majority of prior CRS work simulates interactions by exchanging entity-level information. As a result, claims of prior CRS work do not generalise to real-world settings where conversations take unexpected turns, or where conversational and intent understanding is not perfect. To tackle this challenge, the research community has started to examine holistic CRS, which are trained using conversational data collected from real-world scenarios. Despite their emergence, such holistic approaches are under-explored. We present a comprehensive survey of holistic CRS methods by summarizing the literature in a structured manner. Our survey recognises holistic CRS approaches as having three components: 1) a backbone language model, the optional use of 2) external knowledge, and/or 3) external guidance. We also give a detailed analysis of CRS datasets and evaluation methods in real application scenarios. We offer our insight as to the current challenges of holistic CRS and possible future trends.
    摘要 对话式推荐系统(CRS)通过交互过程生成推荐。然而,不全CRS方法使用真实的人工对话作为交互数据的来源;大多数先前CRS工作通过交换实体级别信息来模拟交互。因此,先前CRS工作的声索不懂实际世界中的对话弯曲和对话理解不准确。为解决这个挑战,研究社区开始了探索全面CRS,这些方法通过真实场景中的对话收集的数据进行训练。虽然它们的出现,但这些整体方法还未得到充分探索。我们提供了一份全面CRS方法的系统性报告,通过结构化的方式总结了相关文献。我们认为全面CRS方法包括三个组件:1)基础语言模型,可选的2)外部知识,以及3)外部指导。我们还给出了CRS数据集和评估方法在真实应用场景中的详细分析。我们对现有的全面CRS挑战和未来趋势给出了我们的见解。

Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version)

  • paper_url: http://arxiv.org/abs/2309.07677
  • repo_url: None
  • paper_authors: Chen Gong, Peilin Wu, Jinho D. Choi
  • for: 这篇论文提出了一种新的语音基于文本speaker分类评估方法,旨在解决传统 metric 不考虑文本上下文信息的限制。
  • methods: 论文提出了两种新的评估指标:文本基于的分类错误率和分类 F1 指标,这两种指标在语音分类任务中进行了单词和语音级别的评估,并且能够捕捉更多类型的错误。
  • results: 论文引入了一种多序列对Alignment算法,可以处理多个参考序列,并使用动态计算来处理高维对假序列的对应。这两个工具可以帮助创建高质量数据,推动对话系统的进步。
    Abstract This paper presents a novel evaluation approach to text-based speaker diarization (SD), tackling the limitations of traditional metrics that do not account for any contextual information in text. Two new metrics are proposed, Text-based Diarization Error Rate and Diarization F1, which perform utterance- and word-level evaluations by aligning tokens in reference and hypothesis transcripts. Our metrics encompass more types of errors compared to existing ones, allowing us to make a more comprehensive analysis in SD. To align tokens, a multiple sequence alignment algorithm is introduced that supports multiple sequences in the reference while handling high-dimensional alignment to the hypothesis using dynamic programming. Our work is packaged into two tools, align4d providing an API for our alignment algorithm and TranscribeView for visualizing and evaluating SD errors, which can greatly aid in the creation of high-quality data, fostering the advancement of dialogue systems.
    摘要 To align tokens, a multiple sequence alignment algorithm is introduced that supports multiple sequences in the reference and handles high-dimensional alignment to the hypothesis using dynamic programming. The authors have developed two tools, align4d and TranscribeView, to facilitate the use of their alignment algorithm and to visualize and evaluate SD errors. These tools can help create high-quality data, which is essential for the development of dialogue systems.In Simplified Chinese:这篇论文提出了一种新的文本基于Speaker diarization(SD)评估方法,解决传统的评估方法不考虑文本中上下文信息的限制。该论文提出了两个新的评估指标:Text-based Diarization Error Rate和Diarization F1,它们在语音和词级别进行评估,并且使用了多个序列的对齐算法来对应语音和假设词的匹配。这些指标比现有的指标更加全面,可以对SD进行更加详细的分析。为了对token进行对齐,该论文引入了一种多序列对齐算法,该算法支持多个参照序列,并且使用动态编程来处理高维对齐。作者们还开发了两个工具:align4d和TranscribeView,它们可以帮助创建高质量的数据,这将对对话系统的发展起到关键作用。

Automatic Data Visualization Generation from Chinese Natural Language Questions

  • paper_url: http://arxiv.org/abs/2309.07650
  • repo_url: None
  • paper_authors: Yan Ge, Victor Junqiu Wei, Yuanfeng Song, Jason Chen Zhang, Raymond Chi-Wing Wong
  • for: 本研究旨在提出一个中文文本到视觉(Text-to-Vis)数据集,以便研究中文问题的数据视觉生成。
  • methods: 我们的模型使用多语言BERT作为编码器,提高了跨语言能力,并将ngram信息 интегрирован到单词表示学习中。
  • results: 我们的实验结果表明,我们的数据集具有挑战性,且值得进一步研究。
    Abstract Data visualization has emerged as an effective tool for getting insights from massive datasets. Due to the hardness of manipulating the programming languages of data visualization, automatic data visualization generation from natural languages (Text-to-Vis) is becoming increasingly popular. Despite the plethora of research effort on the English Text-to-Vis, studies have yet to be conducted on data visualization generation from questions in Chinese. Motivated by this, we propose a Chinese Text-to-Vis dataset in the paper and demonstrate our first attempt to tackle this problem. Our model integrates multilingual BERT as the encoder, boosts the cross-lingual ability, and infuses the $n$-gram information into our word representation learning. Our experimental results show that our dataset is challenging and deserves further research.
    摘要 “数据视化已经成为大量数据获得洞察的有效工具。由于数据视化编程语言的困难,自动从自然语言(文本)到数据视化(Text-to-Vis)的转化是越来越受欢迎。尽管英语 Text-to-Vis 的研究已经充满投入,但尚未对中文问题进行研究。我们在本文中提出了一个中文 Text-to-Vis 数据集,并在这个问题上进行了我们的首次尝试。我们的模型使用多语言BERT作为Encoder,提高了 crossed-lingual 能力,并将 $n$-gram 信息integrated into our word representation learning。我们的实验结果表明,我们的数据集是挑战性的,值得进一步研究。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

  • paper_url: http://arxiv.org/abs/2309.07648
  • repo_url: None
  • paper_authors: Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen
  • for: 提高END-to-END模型中名实体识别的能力
  • methods: combines class-based语言模型(LM) into factorized neural Transducer (FNT)
  • results: 显著降低名实体识别错误,不对通用词语识别造成影响
    Abstract In spite of the excellent strides made by end-to-end (E2E) models in speech recognition in recent years, named entity recognition is still challenging but critical for semantic understanding. In order to enhance the ability to recognize named entities in E2E models, previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in named entity recognition in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose a novel E2E model to incorporate class-based LMs into FNT, which is referred as C-FNT. In C-FNT, the language model score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT presents significant error reduction in named entities without hurting performance in general word recognition.
    摘要 尽管最近几年的终到终(E2E)模型在语音识别方面做出了优异的进步,但Named Entity Recognition(NER)仍然是一个挑战性的任务,对于含义理解是关键的。为了增强E2E模型中Named Entity的识别能力,先前的研究主要集中在不同的规则基于的或者关注基于的上下文偏好算法上。然而,其性能可能会受到偏好量的敏感性或者过度关注名称列表,同时也存在假触发的风险。drawing inspiration from the success of class-based language model(LM)在传统的混合系统中的Named Entity recognition和factorized neural Transducer(FNT)中的有效隔离语音和语言信息,我们提出了一种新的E2E模型,称为C-FNT。在C-FNT中,语言模型的名称分类得分可以与名称类型相关联,而不是其表面形式。实验结果显示,我们的提议的C-FNT可以减少Named Entities中的错误,而无需增加总体单词识别性能的影响。

Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation

  • paper_url: http://arxiv.org/abs/2309.07624
  • repo_url: None
  • paper_authors: Xiyan Fu, Anette Frank
  • for: 本研究旨在提高神经网络模型的结构化推理能力和通用性。
  • methods: 该研究提出了一种新的结构化解释生成任务设定,以便进行结构化推理研究。previous works使用预定的推理规则进行迭代推理,但这些方法仅适用于已定义的任务和固定的推理流程。因此,该研究提出了一种动态模块化推理模型,即MORSE,以提高神经网络模型的结构化通用性。
  • results: MORSE在两个benchmark上进行增长推理树的测试中,与其他竞争性基线相比,表现出了更好的结构化推理能力和通用性。模型减少和深入分析表明了动态推理模块的有效性和通用性。
    Abstract Despite the success of neural models in solving reasoning tasks, their compositional generalization capabilities remain unclear. In this work, we propose a new setting of the structured explanation generation task to facilitate compositional reasoning research. Previous works found that symbolic methods achieve superior compositionality by using pre-defined inference rules for iterative reasoning. But these approaches rely on brittle symbolic transfers and are restricted to well-defined tasks. Hence, we propose a dynamic modularized reasoning model, MORSE, to improve the compositional generalization of neural models. MORSE factorizes the inference process into a combination of modules, where each module represents a functional unit. Specifically, we adopt modularized self-attention to dynamically select and route inputs to dedicated heads, which specializes them to specific functions. We conduct experiments for increasing lengths and shapes of reasoning trees on two benchmarks to test MORSE's compositional generalization abilities, and find it outperforms competitive baselines. Model ablation and deeper analyses show the effectiveness of dynamic reasoning modules and their generalization abilities.
    摘要 即使神经网络模型在理解任务上取得成功,它们的组合泛化能力仍然未得到清晰定义。在这项工作中,我们提出了一种新的结构化解释生成任务设定,以便促进神经网络模型的组合泛化研究。先前的工作发现,符号方法可以通过预先定义的推理规则进行迭代推理,从而实现更好的组合泛化。但这些方法受到不可靠的符号传递的限制,只能在已定义的任务上进行。因此,我们提出了一种动态模块化推理模型,称为MORSE,以提高神经网络模型的组合泛化能力。MORSE将推理过程分解为一系列模块,每个模块都代表了特定的功能单元。我们采用模块化自注意力来动态选择和导向输入到特定的头部,以进行特定的功能特циализация。我们在两个 benchmark 上进行了不同的LENGTH和SHAPE的理解树长度和形态测试,并发现MORSE在组合泛化能力方面表现出色,超越了竞争对手的基eline。模型剥离和深入分析表明了动态推理模块的效果和泛化能力。

Zero-shot Audio Topic Reranking using Large Language Models

  • paper_url: http://arxiv.org/abs/2309.07606
  • repo_url: None
  • paper_authors: Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate M. Knill, Mark J. F. Gales
  • for: 这个项目(Multimodal Video Search by Examples)使用视频片段作为搜索关键词,而不是传统的文本查询。这允许更加丰富的搜索Modalities,如图像、说话人、内容、话题和情感。
  • methods: 这个过程中使用视频特征的嵌入表示来支持大型档案的快速搜索。这个工作的目标是通过评估重新排序方法来减少快速搜索中的性能损失。特别是使用大型自然语言模型的零批训练重新排序方法。
  • results: 在一个公共可用的视频档案(BBC Rewind corpus)上进行话题基于搜索,results显示重新排序可以获得改善的搜索排名,而无需任何任务特有的训练数据。
    Abstract The Multimodal Video Search by Examples (MVSE) project investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data.
    摘要 《多模式视频搜索示例(MVSE)项目》investigates using video clips as query terms for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data.Here's the breakdown of the translation:* 《多模式视频搜索示例(MVSE)项目》: The title of the project, "Multi-modal Video Search by Examples (MVSE) Project"* investigates: investigates* using video clips as query terms: 使用视频片段作为查询 термина* for information retrieval: для信息检索* rather than the more traditional text query: 而不是传统的文本查询* This enables far richer search modalities: 这使得搜索模式更加丰富* such as images, speaker, content, topic, and emotion: 如图像、speaker、内容、话题和情感* A key element for this process is highly rapid, flexible, search: 这个过程中的关键元素是高速灵活的搜索* to support large archives: 支持大型存档* which in MVSE is facilitated by representing video attributes by embeddings: 在MVSE中,通过表示视频特征用 embedding 来支持大型存档* This work aims to mitigate any performance loss from this rapid archive search: 这个工作目标是消除快速存档搜索中的性能损失* by examining reranking approaches: 通过研究重新排序方法* In particular, zero-shot reranking methods using large language models are investigated: 特别是使用大型自然语言模型的零shot重新排序方法* as these are applicable to any video archive audio content: 因为它们可以应用于任何视频存档的音频内容* Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus: 在公开可用的视频存档 BBC Rewind corpus 上进行话题基于检索性能评估* Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data: 结果表明,重新排序可以在无需任务特定训练数据的情况下实现改进的检索排名

Revisiting Supertagging for HPSG

  • paper_url: http://arxiv.org/abs/2309.07590
  • repo_url: None
  • paper_authors: Olga Zamaraeva, Carlos Gómez-Rodríguez
  • for: 这个论文的目的是为了开发新的HPSG超标注器,并用SVM和神经网络CRF-BERT模型来进行超标注。
  • methods: 这个论文使用了HPSG-based treebanks,这些treebanks具有高质量的注释和多种和复杂的测试数据集,包括WSJ部分23和Wikipedia数据。文章使用了MaxEnt-based模型,以及SVM和神经网络CRF-BERT模型,并证明了这些模型在超标注任务中的高精度性。
  • results: 文章的结果表明,使用SVM和神经网络CRF-BERT模型可以获得较高的超标注精度,比基eline模型高出许多。文章的最终BERT模型在1000个WSJ23句子上达到97.26%的精度,并在完全新领域的The Cathedral and the Bazaar(cb)数据集上达到93.88%的精度。
    Abstract We present new supertaggers trained on HPSG-based treebanks. These treebanks feature high-quality annotation based on a well-developed linguistic theory and include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline. Our fine-tuned BERT-based tagger achieves 97.26% accuracy on 1000 sentences from WSJ23 and 93.88% on the completely out-of-domain The Cathedral and the Bazaar (cb)). We conclude that it therefore makes sense to integrate these new supertaggers into modern HPSG parsers, and we also hope that the diverse and difficult datasets we used here will gain more popularity in the field. We contribute the complete dataset reformatted for token classification.
    摘要 我们提出新的超标签器,基于HPSG-based treebanks进行训练。这些treebanks具有高质量的注释,基于完善的语言理论,并包括多样化和挑战性的测试数据集,超出常见的WSJ部分23和Wikipedia数据。在过去,HPSG超标签ging依赖于MaxEnt-based模型。我们使用SVM和神经网络CRF-以及BERT-based方法,并证明两者均在基准点上表现较高精度。我们精心调整的BERT-based标签器在WSJ23上达到97.26%的准确率,并在完全不同领域的The Cathedral and the Bazaar(cb)上达到93.88%的准确率。我们认为,因此是合理的将这些新的超标签器 интеグри into modern HPSG parser。我们还希望,我们使用的多样化和挑战性的数据集会在领域中受到更多的推广。我们提供了complete dataset reformatted for token classification。

Adaptive Prompt Learning with Distilled Connective Knowledge for Implicit Discourse Relation Recognition

  • paper_url: http://arxiv.org/abs/2309.07561
  • repo_url: https://github.com/wangzl99/AdaptPrompt
  • paper_authors: Bang Wang, Zhenglin Wang, Wei Xiang, Yijun Mo
  • for: 本研究旨在提高无显式连接的干扰语言识别(IDRR)性能,通过连续提问和知识传递来减少人工设计努力。
  • methods: 本文提出了一种连续提问法(AdaptPrompt),通过自动选择适当的模板和答案空间来减少人工设计努力。此外,我们还设计了一种答案关系映射规则,以生成答案空间。
  • results: 我们在最新的PDTB Corpus V3.0上进行了实验,并证明了我们的设计目标的实现,即与state-of-the-art竞争对手相比,提高了干扰语言识别性能。
    Abstract Implicit discourse relation recognition (IDRR) aims at recognizing the discourse relation between two text segments without an explicit connective. Recently, the prompt learning has just been applied to the IDRR task with great performance improvements over various neural network-based approaches. However, the discrete nature of the state-art-of-art prompting approach requires manual design of templates and answers, a big hurdle for its practical applications. In this paper, we propose a continuous version of prompt learning together with connective knowledge distillation, called AdaptPrompt, to reduce manual design efforts via continuous prompting while further improving performance via knowledge transfer. In particular, we design and train a few virtual tokens to form continuous templates and automatically select the most suitable one by gradient search in the embedding space. We also design an answer-relation mapping rule to generate a few virtual answers as the answer space. Furthermore, we notice the importance of annotated connectives in the training dataset and design a teacher-student architecture for knowledge transfer. Experiments on the up-to-date PDTB Corpus V3.0 validate our design objectives in terms of the better relation recognition performance over the state-of-the-art competitors.
    摘要 假设论坛(IDRR)目的是识别文本段落之间的话语关系,而不需要显式的连接词。最近,推荐学习已经应用于IDRR任务中,并取得了较好的表现。然而,现有的状态 искусственный智能(AI)提示方法的精度性不够,需要手动设计模板和答案,这是实际应用中的一大障碍。在这篇论文中,我们提出了一种连续的提示学习方法,称之为AdaptPrompt,以减少手动设计尝试的努力,同时通过知识传输来提高表现。具体来说,我们设计了一些虚拟token来形成连续的模板,并通过梯度搜索在embedding空间中自动选择最适合的一个。我们还设计了一个答案关系映射规则来生成一些虚拟答案。此外,我们注意到了标注的连接词在训练集中的重要性,因此我们设计了一种教师-学生架构来进行知识传输。实验结果表明,我们的设计目标在现代PDTB Corpus V3.0上都得到了 validate。

  • paper_url: http://arxiv.org/abs/2309.07545
  • repo_url: https://github.com/uhh-lt/dblplink
  • paper_authors: Debayan Banerjee, Arefa, Ricardo Usbeck, Chris Biemann
  • for: 这个论文是关于构建DBLP学术知识图(DBLP scholarly knowledge graph)上的实体连接应用程序DBLPLink。
  • methods: 该应用程序使用文本到文本预训练语言模型,如T5,生成输入文本问题中的实体标签跨 span。实体候选者从数据库中提取基于标签,并使用实体嵌入模型,如TransE、DistMult和ComplEx,对实体进行排序。
  • results: 该应用程序可以在不同的KG嵌入模型下显示结果,让用户可以比较和对比不同模型的结果。示例可以在https://ltdemos.informatik.uni-hamburg.de/dblplink/上进行访问。
    Abstract In this work, we present a web application named DBLPLink, which performs entity linking over the DBLP scholarly knowledge graph. DBLPLink uses text-to-text pre-trained language models, such as T5, to produce entity label spans from an input text question. Entity candidates are fetched from a database based on the labels, and an entity re-ranker sorts them based on entity embeddings, such as TransE, DistMult and ComplEx. The results are displayed so that users may compare and contrast the results between T5-small, T5-base and the different KG embeddings used. The demo can be accessed at https://ltdemos.informatik.uni-hamburg.de/dblplink/.
    摘要 在这项工作中,我们介绍了一个名为DBLPLink的网络应用程序,它在DBLP学术知识图上进行实体链接。DBLPLink使用文本到文本预训练语言模型,如T5,生成输入文本问题中的实体标签跨 span。实体候选者从数据库中 fetch,并使用实体嵌入,如TransE、DistMult和ComplEx,对实体进行排序。结果显示在用户可以比较和对比不同的T5小、基础和KG嵌入使用的结果。演示可以在https://ltdemos.informatik.uni-hamburg.de/dblplink/中进行访问。

Direct Text to Speech Translation System using Acoustic Units

  • paper_url: http://arxiv.org/abs/2309.07478
  • repo_url: None
  • paper_authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret
  • for: 这篇论文提出了一种直接文本到语音翻译系统,使用不同源语言的文本作为输入,生成目标语言的语音无需该语言的文本转写。
  • methods: 该框架使用文本encoder和分 clustering算法提取了音频单元,然后使用encoder-decoder架构进行预测。最后,vocoder生成了语音从单元。
  • results: 对新的CVSS corpus进行测试,系统在大多数语言对比中表现竞争力强,并且在使用多语言预训练模型的情况下显示出了remarkable的提升。
    Abstract This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
    摘要 这篇论文提出了一种直接文本到语音翻译系统,使用分割的声音单元。这个框架使用不同的源语言文本作为输入,生成目标语言的语音,不需要目标语言的文本转写。受到之前的直接Speech-to-Speech翻译系统的成功所 inspirited,我们使用同样的管道来提取声音单元,使用语音编码器和分 clustering 算法。一旦单元被获得,我们使用编码器-解码器架构来预测它们。然后,一个 vocoder 生成语音从单元。我们的直接文本到语音翻译方法在新的 CVSS corpora 上进行了测试,并使用两种不同的 text mBART 模型作为初始化。系统显示了竞争性的表现,并且结果表明,当使用更多语言预训练的模型作为初始化时,有很大的改善。

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

  • paper_url: http://arxiv.org/abs/2309.07462
  • repo_url: None
  • paper_authors: Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
  • for: This paper aims to investigate the use of LLM-based evaluators for scaling up multilingual evaluation in NLP tasks, and to calibrate LLM-based evaluation against human judgments.
  • methods: The paper uses LLM-based evaluators to evaluate the performance of NLP models in eight languages, and compares the results with human judgments.
  • results: The study finds that LLM-based evaluators may exhibit bias towards higher scores, and should be used with caution, particularly in low-resource and non-Latin script languages. Additionally, the paper suggests that calibrating LLM-based evaluators with a dataset of native speaker judgments is important for ensuring accurate evaluation.Here is the same information in Simplified Chinese text:
  • for: 这篇论文旨在研究使用LLM-based评估器来扩大多语言评估的可行性,并对LMM-based评估器与人工评估的准确性进行均衡。
  • methods: 这篇论文使用LLM-based评估器对NLP模型在八种语言中的表现进行评估,并与人工评估进行比较。
  • results: 研究发现LMM-based评估器可能受到高分偏袋的影响,需要在低资源语言和非拉丁字符语言中使用时进行谨慎使用,同时也需要对Native speaker评估数据进行准确性的均衡。
    Abstract Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.
    摘要

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

  • paper_url: http://arxiv.org/abs/2309.07445
  • repo_url: https://github.com/dadelani/sib-200
  • paper_authors: David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee
  • for: 这个论文目的是为了提供一个大规模的多语言自然语言处理(NLP)评估测试集,以便测试多语言语言模型的性能。
  • methods: 这个论文使用了Flores-200机器翻译集的英文部分,并将其扩展到其他203种语言的句子级标注。
  • results: 这个论文的评估结果显示,在多语言评估中,高资源语言和低资源语言之间的性能差距仍然很大,特别是未在预训时期训练的语言、少数语言家族(如 nilotic 和 atlantic-Congo)、以及来自非洲、美洲、大洋洲和东南亚的语言,往往有最低的表现。
    Abstract Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200
    摘要 尽管在过去几年内我们在多语言自然语言处理方面做出了一些进步,但评估通常只限于一小组已有数据集的语言,这排除了大量的低资源语言。在这篇论文中,我们创建了SIB-200——一个大规模的开源测试集,用于评估200种语言和方言的主题分类。对于许多被覆盖的语言,这是首次公开可用的评估 dataset for Natural Language Understanding (NLU)。该数据集基于 Flores-200 机器翻译库。我们对英语部分进行了注释,并将 sentence-level 注释扩展到剩余的 203种语言。 despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages。我们发现,在模型预训练时未见过的语言、不充分代表的语言家族(如 nilotic 和 atlantic-congolese)以及来自非洲、美洲、大洋洲和东南亚的语言,经常有最低的表现在我们的主题分类数据集上。我们希望我们的数据集能够促进更加包容的评估多语言模型在更加多样化的语言上。更多信息请参考 https://github.com/dadelani/sib-200。

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

  • paper_url: http://arxiv.org/abs/2309.07430
  • repo_url: https://github.com/stanfordmimi/clin-summ
  • paper_authors: Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari
  • for: clinical text summarization across multiple tasks
  • methods: employ domain adaptation methods on eight large language models (LLMs) spanning six datasets and four distinct summarization tasks
  • results: the best adapted LLM outperforms human summaries in terms of completeness and correctness, and traditional quantitative NLP metrics are correlated with reader study scores.
    Abstract Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy across diverse clinical summarization tasks has not yet been rigorously examined. In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not lead to improved results. Further, in a clinical reader study with six physicians, we depict that summaries from the best adapted LLM are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis delineates mutual challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and other irreplaceable human aspects of medicine.
    摘要 通过庞大的文本数据进行筛选和摘要关键信息对临床医生的时间分配带来了巨大的压力。虽然大型自然语言处理(NLP)模型(LLM)在不同的NLP任务上表现出了很大的投入,但是它们在多个临床摘要任务上的效果尚未得到了系统性的评估。在这项工作中,我们使用领域适应方法在八个LLM上进行了八个数据集和四个不同的摘要任务的测试:诊断报告、病人问题、进度记录和医生与病人对话。我们的详细的量化评估表明了模型和适应方法之间的贸易offs,以及LLM在不同任务上的表现可能不是完美的。此外,我们在六位医生的读者研究中发现,最佳适应LLM的摘要比人类摘要更加完整和正确。我们的后续质量分析表明,LLM和人类专家面临着相似的挑战。最后,我们将传统的NLP量化指标与读者研究得分进行了相关性分析,以更好地理解这些指标与医生的偏好之间的关系。我们的研究表明,LLM可以在多个临床摘要任务上超越人类专家,这意味着将LLM integrate到临床工作流程中可以减轻文本记录的压力,让医生更能专注于个性化患者护理和其他不可取代的医学方面。

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

  • paper_url: http://arxiv.org/abs/2309.07423
  • repo_url: None
  • paper_authors: Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig
  • for: 本研究的目的是评估大语言模型(LLMs)在不同语言之间的翻译能力。
  • methods: 本研究使用的方法是使用FLORES-200benchmark进行实验,对204种语言进行了MT的评估。
  • results: 研究发现,GPT模型在一些高资源语言(HRLs)上表现比传统MT模型更好,但在低资源语言(LRLs)上表现较差,只有85.9%的语言表现比传统MT模型更好。研究还发现,语言资源水平是决定ChatGPT翻译某语言的能力的最重要因素,而且ChatGPT在非洲语言和低资源语言上表现较差。
    Abstract Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.
    摘要 大型语言模型(LLM)通过自动学习执行多种语言任务,包括机器翻译(MT)。先前的研究探讨了 LLM 的 MT 能力的不同方面。然而,存在许多语言,其 MT 性能尚未被最新的 LLM 评估。无published experimental evidence的情况下,世界各地语言使用者难以了解是否可以使用 LLM 翻译他们的语言。我们提供了第一个实验证据,对204种语言进行了MT成本分析,使用FLORES-200 benchmar。结果显示,GPT模型在一些高资源语言(HRLs)上 approaching或超过传统MT模型的性能,但在低资源语言(LRLs)上一直偏下,对84.1%的语言进行了下rance。我们的分析表明,语言资源水平是确定 ChatGPT 翻译其中的最重要因素,并表明 ChatGPT 对非洲语言和低资源语言表现出了劣势。

PromptASR for contextualized ASR with controllable style

  • paper_url: http://arxiv.org/abs/2309.07414
  • repo_url: https://github.com/k2-fsa/icefall
  • paper_authors: Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey
  • for: 这个论文的目的是提出一种基于提示的端到端自动语音识别(E2E ASR)系统,以实现基于提示的语音识别,并可以控制语音识别的样式。
  • methods: 该系统使用专门的文本Encoder来编码提示文本,然后将编码的特征与语音Encoder进行交叉对应,以实现语音识别的提示。此外,系统还可以使用文本提示来改善语音识别的准确率,并可以给予不同的样式提示来控制语音识别的样式。
  • results: 在一个书的阅读 dataset 和一个内部dataset上,相比基eline ASR系统,该系统使用真实的文本提示可以提高21.9%和6.8%的单词错误率。此外,系统还可以使用单词级别的偏好列表作为提示,以提高对罕见词的识别率。
    Abstract Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall.
    摘要 <> translation.googleapis.com/translate?sl=en&tl=zh-CN&text=Prompts%20are%20crucial%20to%20large%20language%20models%20as%20they%20provide%20context%20information%20such%20as%20topic%20or%20logical%20relationships.%20Inspired%20by%20this,%20we%20propose%20PromptASR,%20a%20framework%20that%20integrates%20prompts%20in%20end-to-end%20automatic%20speech%20recognition%20(E2E%20ASR)%20systems%20to%20achieve%20contextualized%20ASR%20with%20controllable%20style%20of%20transcriptions.%20Specifically,%20a%20dedicated%20text%20encoder%20encodes%20the%20text%20prompts%20and%20the%20encodings%20are%20injected%20into%20the%20speech%20encoder%20by%20cross-attending%20the%20features%20from%20two%20modalities.%20When%20using%20the%20ground%20truth%20text%20from%20preceding%20utterances%20as%20content%20prompt,%20the%20proposed%20system%20achieves%2021.9%25%20and%206.8%25%20relative%20word%20error%20rate%20reductions%20on%20a%20book%20reading%20dataset%20and%20an%20in-house%20dataset%20compared%20to%20a%20baseline%20ASR%20system.%20The%20system%20can%20also%20take%20word-level%20biasing%20lists%20as%20prompt%20to%20improve%20recognition%20accuracy%20on%20rare%20words.%20An%20additional%20style%20prompt%20can%20be%20given%20to%20the%20text%20encoder%20and%20guide%20the%20ASR%20system%20to%20output%20different%20styles%20of%20transcriptions.%20The%20code%20is%20available%20at%20icefall.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

CPPF: A contextual and post-processing-free model for automatic speech recognition

  • paper_url: http://arxiv.org/abs/2309.07413
  • repo_url: None
  • paper_authors: Lei Zhang, Zhengkun Tian, Xiang Chen, Jiaming Sun, Hongyu Xiang, Ke Ding, Guanglu Wan
  • for: 本研究旨在提高自然语言处理(NLP)领域的语音识别(ASR)系统的效果,通过将多种ASR处理任务与语音识别模型集成。
  • methods: 本研究使用了LLMs和Whisper等多种技术,把多种ASR处理任务与语音识别模型集成,以实现直接生成已经处理过的文本。
  • results: 研究表明,CPPF模型可以减少多stage管道,避免错误的协传,提高ASR的效果。
    Abstract ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.
    摘要

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

  • paper_url: http://arxiv.org/abs/2309.07412
  • repo_url: None
  • paper_authors: Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky
  • for: 研究Linear Recurrent Neural Networks (LRNNs) 的可能性,以实现Transformer-level的自然语言处理和长距离模型,同时提供快速并行训练和常规推理成本。
  • methods: 研究LRNNs 是否可以学习训练序列中的隐藏规则,如正则语言的 grammatical structures。对现有 LRNNs 进行理论分析,发现它们在正则语言上存在限制。基于分析,提出一种新的 LRNN,具有块状 диагональ 输入依赖的转移矩阵。
  • results: 实验表明,提出的模型可以在正则语言任务中进行长度 extrapolation,如和、偶对、模块加法等。
    Abstract In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language modeling and long-range modeling while offering rapid parallel training and constant inference costs. With the resurged interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations on regular language. Motivated by the analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN that can perform length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic.
    摘要

VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

  • paper_url: http://arxiv.org/abs/2309.07387
  • repo_url: None
  • paper_authors: Yunshui Li, Binyuan Hui, Zhaochao Yin, Wanwei He, Run Luo, Yuxing Long, Min Yang, Fei Huang, Yongbin Li
  • For: The paper aims to address the lack of a standardized evaluation framework for visually-grounded dialog systems by proposing a new benchmark called VDialogUE.* Methods: The paper introduces a novel evaluation metric called VDscore, based on the Analytic Hierarchy Process (AHP) method, to provide a comprehensive assessment of multi-modal dialogue systems. The authors also propose a baseline model named VISIT, which uses a two-stage pre-training strategy to progressively build its multi-modal foundation and dialogue capability.* Results: The paper presents the results of the VDialogUE benchmark on six datasets, demonstrating the effectiveness of the proposed evaluation metric and the baseline model. The authors believe that the VDialogUE benchmark and their proposed methods will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.Here is the information in Simplified Chinese text:* For: 这篇论文目标是解决视觉基立的对话系统评价框架的缺乏问题,提出了一个新的评价指标——VDscore。* Methods: 论文提出了一种新的评价指标——VDscore,基于分析层次法(AHP)方法,以提供多Modal对话系统的全面评价。同时, authors还提出了一个基线模型——VISIT,使用两个阶段预训练策略来逐渐建立多Modal基础和对话能力。* Results: 论文通过VDialogUE benchmark的六个数据集测试, demonstarte了VDscore的效果和基eline模型的可行性。作者认为,VDialogUE benchmark和提出的方法将加速视觉基立对话系统的发展,并促进更加复杂和有效的预训练模型的开发。
    Abstract Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.
    摘要 📝Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose 📝VDialogUE, a 📝Visually-grounded 📝Dialogue benchmark for 📝Unified 📝Evaluation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named 📝VISIT~(📝VISually-grounded d📝Ialog 📝Transformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the 📝VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.

An Interactive Framework for Profiling News Media Sources

  • paper_url: http://arxiv.org/abs/2309.07384
  • repo_url: None
  • paper_authors: Nikhil Mehta, Dan Goldwasser
  • for: 检测和评估社交媒体上的假新闻和偏见内容,以维护社会的健康发展。
  • methods: 提出了一种互动式新闻媒体评估框架,结合图基新闻媒体评估模型、大语言模型和人类专家意见,以 caracterize社交媒体上的社会背景。
  • results: 实验结果表明,只需要5次人类互动,该框架可以快速检测新闻媒体中的假和偏见内容,包括新闻事件的突然出现。
    Abstract The recent rise of social media has led to the spread of large amounts of fake and biased news, content published with the intent to sway beliefs. While detecting and profiling the sources that spread this news is important to maintain a healthy society, it is challenging for automated systems. In this paper, we propose an interactive framework for news media profiling. It combines the strengths of graph based news media profiling models, Pre-trained Large Language Models, and human insight to characterize the social context on social media. Experimental results show that with as little as 5 human interactions, our framework can rapidly detect fake and biased news media, even in the most challenging settings of emerging news events, where test data is unseen.
    摘要 最近社交媒体的崛起导致各种假和偏见新闻的扩散,这些新闻通常发布于影响人们信仰的目的。虽然检测和识别这些新闻的源是保持社会健康的重要任务,但是自动系统很难实现。在这篇论文中,我们提出了一个互动式新闻媒体 Profiling 框架。它结合基于图的新闻媒体 Profiling 模型、预训练的大语言模型以及人类智慧,以描述社交媒体上的社会背景。实验结果表明,我们的框架只需5次人类互动,就可以快速检测假和偏见新闻媒体,即使在新闻事件发生的最复杂的情况下也能够准确地识别。

Less is More for Long Document Summary Evaluation by LLMs

  • paper_url: http://arxiv.org/abs/2309.07382
  • repo_url: None
  • paper_authors: Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
  • for: 这个论文的目的是提出一种新的评估方法,以解决LLM在长文摘要评估任务中的计算成本高和 Lost-in-the-Middle 问题。
  • methods: 该方法首先提取长文摘要中的关键句子,然后使用LLM进行评估。
  • results: 实验结果显示,提出的方法不仅能减少评估成本,还能够更好地与人工评估相协调。此外,我们还提供了优化文档长度和句子提取方法的实践建议,以便开发更加Cost-effective yet accurate的LLM-based文本生成评估方法。
    Abstract Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.
    摘要 大型语言模型(LLM)在摘要评估任务中表现出色,但它们面临着高计算成本和中文混乱问题,中文混乱问题导致长文档中重要信息往往遗弃不了。为解决这些问题,本文提出了一种新的方法——提取然评估法,该方法首先从长源文档中提取关键句子,然后通过提问LLM进行评估。结果显示,提出的方法不仅有效减少评估成本,还与人工评估更高相关性。此外,我们还提供了优化文档长度和句子提取方法的实践建议,为LLM基于文本生成评估的成本减少而准确性提高作出贡献。

Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation

  • paper_url: http://arxiv.org/abs/2309.07369
  • repo_url: None
  • paper_authors: Shaoshi Ling, Guoli Ye, Rui Zhao, Yifan Gong
  • For: The paper is written for improving the text adaptation of attention-based encoder-decoder (AED) speech recognition models in industry settings.* Methods: The paper proposes a novel hybrid attention-based encoder-decoder (HAED) speech recognition model that separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.* Results: The proposed HAED model yields 21% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED models.Here’s the information in Simplified Chinese text:* For: 该论文是为了改进 attention-based encoder-decoder (AED) 语音识别模型在实际应用中的文本适应性。* Methods: 论文提出了一种新的 hybrid attention-based encoder-decoder (HAED) 语音识别模型,该模型将语音模型和语言模型分离开来,使得可以使用 conventional 文本基于语言模型适应技术。* Results: 提议的 HAED 模型在使用 out-of-domain 文本数据进行语言模型适应时,相比 conventional AED 模型,可以提高 Word Error Rate (WER) 21%。在一般测试集上,HAED 模型只有一定的负面影响。
    Abstract Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model.
    摘要 听力基于Encoder-Decoder(AED)语音识别模型在最近几年内获得了广泛的成功。然而,在末端协调语音模型和语言模型的结合方面,有些挑战需要解决,特别是快速、效率地适应文本。为解决这个问题,我们提议一种新的模型,即混合注意力基于Encoder-Decoder(HAED)语音识别模型。我们的HAED模型分离了语音模型和语言模型,因此可以使用传统的文本基于语言模型适应技术。我们示出了我们提议的HAED模型可以在使用不同文本预测集时实现21%的单词错误率(WER)提升,并且在一般测试集上只受到轻微的WER下降,相比于传统的AED模型。