results: 对个性化测试集进行比较,每种方法都提高了word error rate(WER)超过10%,而自然语言提示可以提高WER7%而无需任何训练,但有一定的泛化损失。总的来说,报纸方法得到了最佳效果,提高WER10%,同时也提高了一般测试集的WER1%。Abstract
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.
摘要
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.Here's the breakdown of the translation:* "Recognition of personalized content" becomes "个性化内容认知" (gèshìhùa néngyù jìngchí)* "remains a challenge" becomes "仍是一个挑战" (bìngshì yī gè tiǎozhàn)* "in end-to-end speech recognition" becomes "在端到端语音识别中" (shàng zhì zhèng yīn xiāngxīn)* "We explore three novel approaches" becomes "我们探索三种新的方法" (wǒmen tànsuǒ sān zhāng xīn de fāngédé)* "that use personalized content" becomes "使用个性化内容" (shǐyòu gèshìhùa néngyù)* "in a neural rescoring step" becomes "在神经网络重分配步骤中" (shàngjiāo wǎngluò zhòngfēngchēng zhèng)* "to improve recognition" becomes "以提高识别" (yǐ tígāng xiēngbì)* "We use internal de-identified en-US data" becomes "我们使用内部匿名的en-US数据" (wǒmen shǐyòu yùndào bèi mìngmíng de en-US shùjī)* "from interactions with a virtual voice assistant" becomes "从虚拟语音助手的互动中" (shàng zhìxìng yǔshǒu de xiāngxīn zhèng)* "supplemented with personalized named entities" becomes "补充了个性化命名实体" (bǔxiāng le gèshìhùa míngmíng shíwù)* "to compare these approaches" becomes "以比较这些方法" (yǐ bǐjiāo zhèxiē fāngédé)* "On a test set with personalized named entities" becomes "在个性化命名实体的测试集上" (shàng gèshìhùa míngmíng shíwù zhèng)* "we show that each of these approaches improves word error rate by over 10%" becomes "我们发现每一种方法都可以提高单词错误率超过10%" (wǒmen fāxìan mái yì zhèng zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "against a neural rescoring baseline" becomes "对神经网络重分配基准" (duì shàngjiāo wǎngluò zhòngfēngchēng jīzhì)* "We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization" becomes "我们还发现在这个测试集上,自然语言提示可以提高单词错误率7%,无需训练,且只有一定的权重损失" (wǒmen hái fāxìan zài zhè ge cè shì zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER)" becomes "总的来说,报纸找到最好,单词错误率提高10%" (zhòngde lái shuō, bàozhèng zhòngdào zuìhòu, zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "while also improving WER on a general test set by 1%" becomes "同时也提高了一般测试集上的单词错误率1%" (tóngshí yěnshì yěu gāo le yīgè zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)
Revisiting the DARPA Communicator Data using Conversation Analysis
results: 研究发现,communicator系统无法处理混合主动的话语结构层次导致一种类型的失败。Abstract
The state of the art in human computer conversation leaves something to be desired and, indeed, talking to a computer can be down-right annoying. This paper describes an approach to identifying ``opportunities for improvement'' in these systems by looking for abuse in the form of swear words. The premise is that humans swear at computers as a sanction and, as such, swear words represent a point of failure where the system did not behave as it should. Having identified where things went wrong, we can work backward through the transcripts and, using conversation analysis (CA) work out how things went wrong. Conversation analysis is a qualitative methodology and can appear quite alien - indeed unscientific - to those of us from a quantitative background. The paper starts with a description of Conversation analysis in its modern form, and then goes on to apply the methodology to transcripts of frustrated and annoyed users in the DARPA Communicator project. The conclusion is that there is at least one species of failure caused by the inability of the Communicator systems to handle mixed initiative at the discourse structure level. Along the way, I hope to demonstrate that there is an alternative future for computational linguistics that does not rely on larger and larger text corpora.
摘要
现代人机对话的状况还有一些缺点,实际上和计算机交流可以是很沮丧的。这篇论文描述了一种方法,通过寻找乱伤的语言来识别人机对话系统中的改进点。根据这篇论文,人们在计算机上发怒的时候会用荒语,这表示计算机不符合预期的行为。通过分析对话,我们可以找到系统的缺陷,并通过对话分析(CA)来推导出问题的起源。对话分析是一种质量方法论,可能对来自量化背景的人们来说可能看起来很陌生,甚至不科学。本文首先描述了现代对话分析的形式,然后应用这种方法分析沮丧和气愤的用户在DARPA通信器项目中的对话记录。结论是,communicator系统无法处理混合规划的问题导致了至少一种类型的失败。在这个过程中,我希望能够表明,计算机语言学不需要靠扩大文本 corpora 来发展。
Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models
paper_authors: Arman Sakif Chowdhury, G. M. Shahariar, Ahammed Tarik Aziz, Syed Mohibul Alam, Md. Azad Sheikh, Tanveer Ahmed Belal
For: 本研究旨在探讨如何在低资源语言如孟加拉语中推断假新闻文章。* Methods: 该研究提出了一种方法ología,包括四种不同的方法,使用五种预训练语言模型,以推断孟加拉语假新闻文章。该方法包括将英文新闻文章翻译成孟加拉语,并使用扩展技术来减少假新闻文章的不足。此外,研究还尝试了新闻概要,以解决BERT基于模型的токен长度限制。* Results: 经过广泛的实验和严格的评估,研究表明,摘要和扩展技术在孟加拉语假新闻推断中具有效果。研究使用三个独立的测试集进行评估,其中 BanglaBERT Base 模型,当与扩展技术结合使用时,在第一个测试集上达到了96%的准确率。在第二个测试集上,使用摘要和扩展新闻文章进行训练的 BanglaBERT 模型达到了97%的准确率。最后,使用 mBERT Base 模型在第三个测试集上达到了86%的准确率,用于通用性表现评估。测试集和实现可以在 GitHub 上找到。Abstract
With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection
摘要
随着社交媒体和在线新闻源的出现,假新闻已成为全球范围内的一个重要问题。然而,在LOW资源语言如孟加拉语方面,假新闻检测得到了研究的有限注意力。在这篇论文中,我们提出了一种方法ологи? consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection.
ChatGPT and Bard Responses to Polarizing Questions
results: 研究发现,ChatGPT和Bard在极化话题上具有左倾倾向,Bard更有可能在极化话题上提供响应。Bard还显示出 fewer guardrails around controversial topics,并且更�clinical and human-like responses。这些发现可以帮助各方利用LLMs,避免误导性和极化的回应。Abstract
Recent developments in natural language processing have demonstrated the potential of large language models (LLMs) to improve a range of educational and learning outcomes. Of recent chatbots based on LLMs, ChatGPT and Bard have made it clear that artificial intelligence (AI) technology will have significant implications on the way we obtain and search for information. However, these tools sometimes produce text that is convincing, but often incorrect, known as hallucinations. As such, their use can distort scientific facts and spread misinformation. To counter polarizing responses on these tools, it is critical to provide an overview of such responses so stakeholders can determine which topics tend to produce more contentious responses -- key to developing targeted regulatory policy and interventions. In addition, there currently exists no annotated dataset of ChatGPT and Bard responses around possibly polarizing topics, central to the above aims. We address the indicated issues through the following contribution: Focusing on highly polarizing topics in the US, we created and described a dataset of ChatGPT and Bard responses. Broadly, our results indicated a left-leaning bias for both ChatGPT and Bard, with Bard more likely to provide responses around polarizing topics. Bard seemed to have fewer guardrails around controversial topics, and appeared more willing to provide comprehensive, and somewhat human-like responses. Bard may thus be more likely abused by malicious actors. Stakeholders may utilize our findings to mitigate misinformative and/or polarizing responses from LLMs
摘要
Currently, there is no annotated dataset of ChatGPT and Bard responses around potentially polarizing topics, which hinders the development of effective regulations and interventions. To address this gap, we created a dataset of ChatGPT and Bard responses on highly polarizing topics in the US. Our results showed a left-leaning bias for both chatbots, with Bard more likely to provide responses around polarizing topics. Bard also appeared to have fewer guardrails around controversial topics and was more willing to provide comprehensive and human-like responses. This may make it more susceptible to being abused by malicious actors.Our findings can help stakeholders mitigate the potential for misinformative and polarizing responses from LLMs. By understanding the biases and limitations of these tools, we can develop targeted interventions and regulations to ensure that they are used responsibly and ethically.
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative
results: 这篇论文通过对MultiWOZ数据集进行实验,发现了AL在DPL中的问题,并提出了一种不使用AL的方法来估计奖励和学习对话策略。这种方法可以保持AL的优点,同时解决了模式折衔的问题。Abstract
Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.
摘要
对话策略,即根据对话状态选择系统的行为,是对话的关键成功因素。在过去几年,人工智能学习(RL)已成为对话策略学习(DPL)的一个有前途的选择。在RL基于的DPL中,对话策略会根据奖励更新。但是,在多个领域任务对话场景中,手动构建细腻的奖励,如状态动作对应的奖励,是具有挑战性的。一种可以从收集的数据中估计奖励的方法是通过对抗学习(AL)训练对话策略和奖励估计器。although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse.本文首先通过详细分析对话策略和奖励估计器的目标函数,描述了AL在DPL中的作用。接着,基于这些分析,我们提出了一种不使用AL的奖励估计和DPL方法,保留了AL的优点。我们使用MultiWOZ多个领域任务对话资源进行评估。
Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models
results: 结果显示,这些方法可以在不同的训练射数中优化 LLM,并超过不适应的模型。Abstract
A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data.
摘要
各种自然语言任务目前正在使用大规模语言模型(LLM)进行解决。这些模型通常通过大量无监督文本数据进行训练并使用方法如精度调整、准备调整或在线上学习来适应下游自然语言任务。在这项工作中,我们提议一种方法,可以在没有标签样本的情况下,通过调整前类分布来进行文本分类任务。这种方法对LMM进行黑盒子处理,并在模型 posterior 进行准备调整。结果显示,这种方法可以超越无适应模型,并且在提示和前一种方法中进行准备调整无需使用适应数据。
To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?
results: 研究发现,参与者在不同的隐私威胁水平下有不同的反应,并且可以通过适当的隐私预算(ε)来影响参与者的反应。这些结果 suggets that lay people’s willingness to share sensitive textual data is influenced by the framing of the risk perception and the privacy budget, and that the choice of ε should not be solely in the hands of researchers or system developers.Abstract
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
摘要
尽管NLG社区已经采用中心权限隐私作为隐私保护和数据共享的标准框架,但是选择和解释隐私预算参数ε的决定仍然是一个大量的自由选择。我们认为,决定ε值不应该完全由研究人员或系统开发人员决定,而应该考虑实际分享敏感数据的人。换句话说,你会为ε值10分享你的即时消息?我们在这篇研究中通过设计、实现和进行行为实验(311名参与者)来研究人们在不确定决策情况下对隐私威胁的反应。我们使用了两个实际的NLG场景来表述风险感受,并使用笔记式行为研究来确定ε阈值会让平民愿意分享敏感文本数据。到我们所知,这是首次进行这种类型的研究。
Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues
results: experiments on two open-domain dialogue datasets show that ICAST consistently outperforms baselines with 1%, 5%, and 10% labeled data, improving the F1 score by 2.06% and 1.00% respectively compared to the strongest baseline with only 5% labeled data.Abstract
Answer selection in open-domain dialogues aims to select an accurate answer from candidates. Recent success of answer selection models hinges on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose the intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We carry out extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST outperforms baselines consistently with 1%, 5% and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.
摘要
Answer selection in open-domain dialogues aims to select an accurate answer from candidates. Recent success of answer selection models relies on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose the intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We conduct extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST consistently outperforms baselines with 1%, 5%, and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.
Parmesan: mathematical concept extraction for education
paper_authors: Jacob Collard, Valeria de Paiva, Eswaran Subrahmanian
For: This paper is written for researchers who are not experts in mathematics, but need to understand mathematical concepts in order to conduct multidisciplinary research.* Methods: The paper uses natural language processing techniques such as concept extraction, relation extraction, definition extraction, and entity linking to develop a prototype system for searching and defining mathematical concepts in context, specifically in the field of category theory.* Results: The authors show that existing natural language processing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that perform well. They also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively.Abstract
Mathematics is a highly specialized domain with its own unique set of challenges that has seen limited study in natural language processing. However, mathematics is used in a wide variety of fields and multidisciplinary research in many different domains often relies on an understanding of mathematical concepts. To aid researchers coming from other fields, we develop a prototype system for searching for and defining mathematical concepts in context, focusing on the field of category theory. This system, Parmesan, depends on natural language processing components including concept extraction, relation extraction, definition extraction, and entity linking. In developing this system, we show that existing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that do perform well, though we expect the system to evolve over time. We also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively. The corpora have been annotated with dependency trees, lemmas, and part-of-speech tags.
摘要
mathematics 是一个高度专业化的领域,具有独特的挑战,在自然语言处理方面受到有限的研究。然而, mathematics 在各种领域中具有广泛的应用,跨学科研究frequently rely on 数学概念的理解。为帮助来自其他领域的研究人员,我们开发了一个搜索和定义数学概念的 прототип系统, focus 在 category theory 领域。该系统, Parmesan,基于自然语言处理组件,包括概念EXTRACTION、关系EXTRACTION、定义EXTRACTION和实体链接。在开发这个系统时,我们发现了现有技术无法直接应用于 category theory 领域,我们提出了 Hybrid 技术,它们在这里表现良好,但我们预期系统会在时间的推移中发展。我们还提供了两个清洁的数学 Corpora,它们基于期刊文章和 Wiki 页面,分别。这两个 Corpora 已经被标注了依赖树、词根和part-of-speech标签。
Going Beyond Local: Global Graph-Enhanced Personalized News Recommendations
results: 在两个公共新闻数据集上进行评估,模型表现出色,超过现有方法,并提供更多的推荐内容Abstract
Precisely recommending candidate news articles to users has always been a core challenge for personalized news recommendation systems. Most recent works primarily focus on using advanced natural language processing techniques to extract semantic information from rich textual data, employing content-based methods derived from local historical news. However, this approach lacks a global perspective, failing to account for users' hidden motivations and behaviors beyond semantic information. To address this challenge, we propose a novel model called GLORY (Global-LOcal news Recommendation sYstem), which combines global representations learned from other users with local representations to enhance personalized recommendation systems. We accomplish this by constructing a Global-aware Historical News Encoder, which includes a global news graph and employs gated graph neural networks to enrich news representations, thereby fusing historical news representations by a historical news aggregator. Similarly, we extend this approach to a Global Candidate News Encoder, utilizing a global entity graph and a candidate news aggregator to enhance candidate news representation. Evaluation results on two public news datasets demonstrate that our method outperforms existing approaches. Furthermore, our model offers more diverse recommendations.
摘要
传统的个性化新闻推荐系统总是面临着推荐候选新闻文章的准确性是核心挑战。现今大多数研究主要集中在使用高级自然语言处理技术来提取文本数据中的含义信息,采用基于本地历史新闻的内容基本方法。但这种方法缺乏全球视野,无法考虑用户隐藏的动机和行为,超 semantic 信息。为了解决这个挑战,我们提出了一种新的模型,即 GLORY(全球-本地新闻推荐系统),它将全球表示学习自其他用户与本地表示相结合,以提高个性化推荐系统。我们实现这一点通过构建全球新闻图和历史新闻汇聚器,并使用闭包图神经网络来增强新闻表示,以 fusion 历史新闻表示。此外,我们还扩展了这种方法到全球候选新闻表示器,使用全球实体图和候选新闻汇聚器来增强候选新闻表示。经过评估两个公共新闻数据集,我们的方法与现有方法进行比较,结果表明我们的方法在准确性和多样性两个方面都有所提高。
Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A Natural Language Processing Approach
results: 模型在测试集上达到了约0.73的macro-average F1分数,表示在正、中性和负情感方面具有平衡的表现,这些结果表明了CNN的效iveness для情感分析任务,并有实际应用在社交媒体分析、市场研究和政策研究等领域。Abstract
This study addressed the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN), offering a new approach to Natural Language Processing (NLP). The data, sourced from Baidu's PaddlePaddle AI platform, were meticulously preprocessed, tokenized, and categorized based on sentiment labels. A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification. The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments. The findings underscore the effectiveness of CNNs for sentiment analysis tasks, with implications for practical applications in social media analysis, market research, and policy studies. The complete experimental content and code have been made publicly available on the Kaggle data platform for further research and development. Future work may involve exploring different architectures, such as Recurrent Neural Networks (RNN) or transformers, or using more complex pre-trained models like BERT, to further improve the model's ability to understand linguistic nuances and context.
摘要
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
results: despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications。Abstract
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
摘要
Translation in Simplified Chinese:这篇论文探讨了大语言模型(LLM)在自动语音识别(ASR)系统中提高转录精度的可能性。随着LLM的不断发展,其在语言处理中的启发学习能力和指令遵循行为吸引了很多关注。我们的主要关注点是调查利用LLM的启发学习能力来提高ASR系统的性能,这些系统目前面临 ambient noise、说话者口音和复杂语言上下文等挑战。我们使用Aishell-1和LibriSpeech数据集,并使用ChatGPT和GPT-4作为LLM的参考。 unfortunately,我们的初始实验结果并不乐见,这表明利用LLM的启发学习来解决ASR应用程序的复杂性是一个复杂的任务。尽管我们进行了更多的设置和模型的调整,但 corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER),这说明LLMs在语音应用中的局限性。这篇论文提供了这些实验的详细描述、结果和意义,确立了现在阶段利用LLMs的启发学习来修正语音识别转录中的潜在错误仍然是一个挑战。
Agreement Tracking for Multi-Issue Negotiation Dialogues
results: 研究人员通过对小型训练集进行验证,发现使用T5模型进行转移学习可以提高妥协跟踪的性能,比较单独使用GPT-Negochat的表现提高21%和9%。Abstract
Automated negotiation support systems aim to help human negotiators reach more favorable outcomes in multi-issue negotiations (e.g., an employer and a candidate negotiating over issues such as salary, hours, and promotions before a job offer). To be successful, these systems must accurately track agreements reached by participants in real-time. Existing approaches either focus on task-oriented dialogues or produce unstructured outputs, rendering them unsuitable for this objective. Our work introduces the novel task of agreement tracking for two-party multi-issue negotiations, which requires continuous monitoring of agreements within a structured state space. To address the scarcity of annotated corpora with realistic multi-issue negotiation dialogues, we use GPT-3 to build GPT-Negochat, a synthesized dataset that we make publicly available. We present a strong initial baseline for our task by transfer-learning a T5 model trained on the MultiWOZ 2.4 corpus. Pre-training T5-small and T5-base on MultiWOZ 2.4's DST task enhances results by 21% and 9% respectively over training solely on GPT-Negochat. We validate our method's sample-efficiency via smaller training subset experiments. By releasing GPT-Negochat and our baseline models, we aim to encourage further research in multi-issue negotiation dialogue agreement tracking.
摘要
自动化谈判支持系统的目的是帮助人类谈判者达成更有利的结果在多个问题谈判中(例如,雇主和候选人在薪资、工时和晋升等问题上进行谈判 перед就业提供emploi)。为了成功,这些系统必须能够在实时内跟踪参与者达成的协议。现有的方法可以是任务导向对话或生成无结构的输出,这些方法无法满足这个目标。我们的工作引入了多方 Issue 谈判协议跟踪任务,需要在结构化状态空间中不断监控协议。由于现有的注释 corpora 中的多个问题谈判对话不够实际,我们使用 GPT-3 构建 GPT-Negochat 数据集,并将其公开发布。我们提出了一个强大的初始基线,通过在 MultiWOZ 2.4 词汇库上转移 T5 模型来实现。在单独在 GPT-Negochat 上进行training 的情况下,预训练 T5-small 和 T5-base 在 MultiWOZ 2.4 上 DST 任务上进行预训练,可以提高结果的21%和9%。我们通过小样本训练 subsets 实验 validate 我们的方法的样本效率。通过发布 GPT-Negochat 和我们的基线模型,我们希望能够鼓励更多关于多个问题谈判对话协议跟踪研究。
National Origin Discrimination in Deep-learning-powered Automated Resume Screening
results: 研究发现,如果依赖深度学习powered自动化简历层次化工具,可能会导致决策偏袋或排斥某些人群,从而引发伦理和法律问题。为解决这个问题,我们提出了一种偏袋缓解方法。经验 validate our study on real candidate resumes.Abstract
Many companies and organizations have started to use some form of AIenabled auto mated tools to assist in their hiring process, e.g. screening resumes, interviewing candi dates, performance evaluation. While those AI tools have greatly improved human re source operations efficiency and provided conveniences to job seekers as well, there are increasing concerns on unfair treatment to candidates, caused by underlying bias in AI systems. Laws around equal opportunity and fairness, like GDPR, CCPA, are introduced or under development, in attempt to regulate AI. However, it is difficult to implement AI regulations in practice, as technologies are constantly advancing and the risk perti nent to their applications can fail to be recognized. This study examined deep learning methods, a recent technology breakthrough, with focus on their application to automated resume screening. One impressive performance of deep learning methods is the represen tation of individual words as lowdimensional numerical vectors, called word embedding, which are learned from aggregated global wordword cooccurrence statistics from a cor pus, like Wikipedia or Google news. The resulting word representations possess interest ing linear substructures of the word vector space and have been widely used in down stream tasks, like resume screening. However, word embedding inherits and reinforces the stereotyping from the training corpus, as deep learning models essentially learn a probability distribution of words and their relations from history data. Our study finds out that if we rely on such deeplearningpowered automated resume screening tools, it may lead to decisions favoring or disfavoring certain demographic groups and raise eth ical, even legal, concerns. To address the issue, we developed bias mitigation method. Extensive experiments on real candidate resumes are conducted to validate our study
摘要
This study examined deep learning methods, a recent technology breakthrough, with a focus on their application to automated resume screening. One impressive performance of deep learning methods is the representation of individual words as low-dimensional numerical vectors, called word embedding, which are learned from aggregated global word-word cooccurrence statistics from a corpus, such as Wikipedia or Google news. The resulting word representations possess interesting linear substructures of the word vector space and have been widely used in downstream tasks, such as resume screening. However, word embedding inherits and reinforces the stereotyping from the training corpus, as deep learning models essentially learn a probability distribution of words and their relations from historical data.Our study finds that if we rely on such deep-learning-powered automated resume screening tools, it may lead to decisions favoring or disfavoring certain demographic groups and raise ethical, even legal, concerns. To address this issue, we developed a bias mitigation method. Extensive experiments on real candidate resumes were conducted to validate our study.
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews
for: 这个研究是为了 automatizing the screening phase of Systematic Reviews (SR),以提高 SR 的效率和准确性。
methods: 这个研究使用了 ChatGPT,一种基于语言模型的对话式 AI chatbot,来自动化 SR 的排除阶段。
results: 研究结果显示 ChatGPT 可以实现高度的一致性和分类性,并且可以与传统的 SR 自动化方法相比。但是,当开发人员在 integrating ChatGPT 到 SR 工具时,需要小心考虑一些因素,以确保 ChatGPT 的性能。Abstract
By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
摘要
通过团队知识在研究领域内组织,系统性评估(SR)提供了价值的导航,用于导向研究。证据表明,SR已成为软件工程中的首选 artifact。然而,屏选阶段的手动努力,使SR变得昂贵和容易出错。屏选传统上被视为不易自动化,但是新的生成AI驱动的聊天机器人, backing with大型语言模型,将在这个领域中推翻现状。在本报告中,我们提议利用这些新的技术发展,自动化SR的屏选阶段。我们评估了ChatGPT在屏选SR文章中的一致性、分类性能和普适性,并与传统的分类器用于SR自动化相比较。我们的结果表明,ChatGPT是可以自动化SR过程的可靠选择,但是在开发者将ChatGPTintegrated into SR工具时,需要仔细考虑。
results: 本文提供了一系列的概念和模型,包括各种 LLMs 模型、数据集和主要发现,并且尝试了对这些研究成果进行了系统的梳理和总结。Abstract
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context length improvements, model alignment, training datasets, benchmarking, efficiency and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides that overview to the research community. It not only focuses on a systematic treatment of the existing literature on a broad range of LLM related concept, but also pays special attention to providing comprehensive summaries with extensive details about the individual existing models, datasets and major insights. We also pay heed to aligning our overview with the emerging outlook of this research direction by accounting for the other recently materializing reviews of the broader research direction of LLMs. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of this research direction. This review article is intended to not only provide a systematic survey, but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research direction.
摘要
受大语言模型(LLM)的成功启发,近期有大量研究贡献在此领域。这些研究涵盖了各种主题,如下面 neural network 的建设、上下文长度优化、模型对齐、训练数据集、标准化、效率等。随着 LLM 技术的快速发展和常见的突破,了解这个领域的进展已经变得非常困难。鉴于这些研究的急速涌现和不断增长,研究者需要一篇 concise yet comprehensive 的评论来了解这个领域的最新进展。这篇文章提供了这样的评论,不仅系统地处理了各种 LLM 相关的概念,而且特别注意于提供详细的概念概述和现有模型、数据集和主要发现的全面摘要。我们还尽量与当前研究方向的其他新出现的评论相协调,以便更好地反映这个研究方向的发展趋势。本文涵盖了 LLM 的相关背景知识和前沿领域的高级主题,是一篇系统的评论和快速参考手册,为研究者和实践者提供了丰富的信息和思路,以便更好地发展 LLM 研究方向。
The Acquisition of Semantic Relationships between words
for: 这 paper investigate 语言 morphology 和 semantic relationships 之间的关系,以了解如何语言结构影响语言理解。
methods: 该 paper 使用 linguistic 方法,包括 morphological analysis 和 semantic analysis,来研究语言 morphology 和 semantic relationships 的关系。
results: 研究发现,语言 morphology 和 semantic relationships 之间存在紧密的关系,这种关系对于语言理解和生成具有重要作用。Abstract
The study of semantic relationships has revealed a close connection between these relationships and the morphological characteristics of a language. Morphology, as a subfield of linguistics, investigates the internal structure and formation of words. By delving into the relationship between semantic relationships and language morphology, we can gain deeper insights into how the underlying structure of words contributes to the interpretation and comprehension of language. This paper explores the dynamic interplay between semantic relationships and the morphological aspects of different languages, by examining the intricate relationship between language morphology and semantic relationships, valuable insights can be gained regarding how the structure of words influences language comprehension.
摘要
研究 semantic 关系的研究发现,这些关系与语言 morphology 之间存在紧密的连接。 morphology 是语言学的一个子领域,它研究 слова的内部结构和形成方式。通过研究 semantic 关系和语言 morphology 之间的关系,我们可以更深入地了解如何 слова的结构对语言理解和解释产生影响。这篇论文探讨不同语言中 morphology 和 semantic 关系之间的动态互动,以获得更深入的理解,掌握语言理解的途径。
MMBench: Is Your Multi-modal Model an All-around Player?
paper_authors: Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin for:MMBench is a novel multi-modality benchmark designed to evaluate the various abilities of vision-language models.methods:MMBench uses a meticulously curated dataset and a novel CircularEval strategy, which incorporates the use of ChatGPT to convert free-form predictions into pre-defined choices, facilitating a more robust evaluation of the model’s predictions.results:MMBench provides a comprehensive evaluation pipeline for vision-language models, allowing for a more objective and robust assessment of their abilities. It is expected to assist the research community in better evaluating their models and encourage future advancements in this domain.Abstract
Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.
摘要
大型视言语模型在最近几年内已取得了很大的进步,具有出色的视觉理解和逻辑能力。然而,如何有效评估这些大型视言语模型仍然是一个主要的障碍,阻碍未来模型的发展。传统的 bencmarks 如 VQAv2 或 COCO Caption 提供了量化性能评估,但它们缺乏细化的能力评估和不可靠的评估指标。最近的主观 bencmarks 如 OwlEval 提供了对模型能力的全面评估,但它们不可扩展和受到偏见的影响。为了解决这些挑战,我们提出了 MMBench,一种新的多模态 bencmark。MMBench 设计了一个丰富的评估流水线,主要由两个元素组成。第一个元素是一个仔细筛选的数据集,超过现有类似 bencmarks 的评估问题和能力多样性。第二个元素是一种新的 CircularEval 策略和 ChatGPT 的使用,可以将自由预测转化为预定的选择,从而促进模型预测的robust评估。MMBench 是一种系统化的对象 bencmark,用于robust评估视言语模型的多种能力。我们希望 MMBench 能够帮助研究人员更好地评估他们的模型,并促进未来这个领域的进步。项目页面:https://opencompass.org.cn/mmbench。