results: 这篇论文显示,使用了Logistic模型可以得到最佳的适应。Abstract
The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the hapax rate. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the rate of hapaxes is a simple function of the text size. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
摘要
文章介绍了对Zipf和堆法的修正,基于系统性模型的唯一urn模型和文本大小的函数模型。两个假设是:首先,假设短文本中的单词分布遵循着 longer text中的样本采样;其次,假设 hapax 的速率是文本大小的简单函数。文章提出了四种函数模型:常数模型、Davis模型、线性模型和ilogistic模型。结果表明,ilogistic模型得到了最佳的适应。
Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables
paper_authors: Ali Araabi, Vlad Niculae, Christof Monz
for: 提高低资源语言对翻译机器翻译的性能
methods: 使用联合Dropout方法,将短语替换为变量,提高翻译机器翻译的可组合性
results: 对低资源语言对翻译机器翻译进行了重要改进,为语言对翻译机器翻译带来了显著提高,并且在不同领域中也具有了更好的鲁棒性和适应性。Abstract
Despite the tremendous success of Neural Machine Translation (NMT), its performance on low-resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine translation by substituting phrases with variables, resulting in significant enhancement of compositionality, which is a key aspect of generalization. We observe a substantial improvement in translation quality for language pairs with minimal resources, as seen in BLEU and Direct Assessment scores. Furthermore, we conduct an error analysis, and find Joint Dropout to also enhance generalizability of low-resource NMT in terms of robustness and adaptability across different domains
摘要
尽管神经机器翻译(NMT)已经取得了很大的成功,但它在低资源语言对的表现仍然较差,一个原因是对未经见过的输入的处理能力有限,即通用性。在这篇论文中,我们提出了一种方法called Joint Dropout,该方法通过将短语替换为变量,从而提高了语言对的复合性,这是通用性的关键特征。我们发现,对具有最少资源的语言对,使用Joint Dropout可以得到显著提高翻译质量,按照BLEU和直接评估得分来看。此外,我们进行了错误分析,发现Joint Dropout还可以提高低资源NMT的通用性,包括鲁棒性和适应性 across different domains。
Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis
paper_authors: Jan Trienes, Paul Youssef, Jörg Schlötterer, Christin Seifert
for: automatization of radiology report summarization to reduce clinicians’ manual work and improve reporting consistency
methods: variable-length extractive summaries as a domain-agnostic guidance signal, competitive with domain-specific methods
results: improved summarization quality compared to unguided summarization, but still limited by content selection and corpus-level inconsistenciesAbstract
Automatically summarizing radiology reports into a concise impression can reduce the manual burden of clinicians and improve the consistency of reporting. Previous work aimed to enhance content selection and factuality through guided abstractive summarization. However, two key issues persist. First, current methods heavily rely on domain-specific resources to extract the guidance signal, limiting their transferability to domains and languages where those resources are unavailable. Second, while automatic metrics like ROUGE show progress, we lack a good understanding of the errors and failure modes in this task. To bridge these gaps, we first propose a domain-agnostic guidance signal in form of variable-length extractive summaries. Our empirical results on two English benchmarks demonstrate that this guidance signal improves upon unguided summarization while being competitive with domain-specific methods. Additionally, we run an expert evaluation of four systems according to a taxonomy of 11 fine-grained errors. We find that the most pressing differences between automatic summaries and those of radiologists relate to content selection including omissions (up to 52%) and additions (up to 57%). We hypothesize that latent reporting factors and corpus-level inconsistencies may limit models to reliably learn content selection from the available data, presenting promising directions for future work.
摘要
自动概括 radiology 报告可以减少临床医生的手动劳动和提高报告的一致性。过去的工作是通过引导抽象SUMMARIZATION提高内容选择和事实性。然而,两个关键问题仍然存在。首先,当前的方法听命于域特定资源提取指导信号,限制其在领域和语言中的传输性。其次,虽然自动度量器Like ROUGE表现出进步,但我们对这个任务中的错误和失败模式几乎没有良好的理解。为了bridging这些差距,我们首先提议一种域无关的引导信号,即变量长抽取SUMMARIES。我们的实验结果表明,这种引导信号可以超过无引导抽取SUMMARIES,并与域特定方法竞争。此外,我们运行了四种系统的专家评估,根据报告11种细腻错误的税onomy。我们发现,自动报告与医生的报告之间最主要的差异在于内容选择,包括漏掉(最多52%)和添加(最多57%)。我们推测,隐藏的报告因素和 corpus 级别的不一致性可能限制模型从可用数据中学习内容选择,提供了可能的未来工作方向。
results: RRAML可以减少LLMs的训练和重新训练的需求,同时也可以避免访问LLMs的梯度,从而提高其应用的效率和可扩展性。此外,RRAML还可以减少检索结果中的幻见和不相关信息,提高检索的准确率和有用性。Abstract
The emergence of large language models (LLMs) has revolutionized machine learning and related fields, showcasing remarkable abilities in comprehending, generating, and manipulating human language. However, their conventional usage through API-based text prompt submissions imposes certain limitations in terms of context constraints and external source availability. To address these challenges, we propose a novel framework called Reinforced Retrieval Augmented Machine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs with supporting information retrieved by a purpose-built retriever from a vast user-provided database. By leveraging recent advancements in reinforcement learning, our method effectively addresses several critical challenges. Firstly, it circumvents the need for accessing LLM gradients. Secondly, our method alleviates the burden of retraining LLMs for specific tasks, as it is often impractical or impossible due to restricted access to the model and the computational intensity involved. Additionally we seamlessly link the retriever's task with the reasoner, mitigating hallucinations and reducing irrelevant, and potentially damaging retrieved documents. We believe that the research agenda outlined in this paper has the potential to profoundly impact the field of AI, democratizing access to and utilization of LLMs for a wide range of entities.
摘要
大型语言模型(LLM)的出现对机器学习和相关领域产生了革命性的变革,展示了人类语言理解、生成和修改的强大能力。然而,通过 API 提交文本提示来使用 LLM 存在一些限制,包括上下文约束和外部资源的可用性。为解决这些挑战,我们提出了一个新的框架 called Reinforced Retrieval Augmented Machine Learning(RRAML)。RRAML 将 LLM 的理解能力与用户提供的大量数据库中的支持信息结合起来,通过利用最近的回归学术进行有效地解决多个关键问题。首先,它绕过了访问 LLM 的梯度的需求。其次,我们的方法减轻了特定任务的 LLM 重新训练的压力,因为在访问模型和计算浩瀚性方面存在限制。此外,我们将检索器的任务与理解者联系在一起,以避免幻想和减少不相关和可能有害的检索文档。我们认为这篇论文的研究议程具有潜在的影响力,可以广泛影响 AI 领域,使 LLM 的访问和利用更加普遍和便捷。
Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN
results: 根据论文的描述,在呼叫中心环境中,使用链式混合HMM和CNN-TDNN来实现自动语音识别系统,可以达到5.2%的Word Error Rate(WER),包括干净环境和噪音环境下的 isolated words和连续杂音speech。Abstract
Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the $10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.
摘要
Call Centers possess vast amounts of audio data that can be leveraged for gaining valuable business insights, and the manual transcription of phone calls is a tedious task. An effective Automatic Speech Recognition (ASR) system can accurately transcribe these calls, enabling easy search through call history for specific context and content, and allowing for automatic call monitoring, improving quality of service (QoS) through keyword search and sentiment analysis. However, ASR systems for call centers must be more robust due to the noisy telephonic environment. Moreover, there are many low-resource languages that are on the verge of extinction, and ASR technology can help preserve these languages. Urdu, the 10th most widely spoken language in the world with 231,295,440 speakers, remains a resource-constrained language in ASR. Regional call-center conversations often operate in local languages, with a mix of English and technical terms, causing a "code-switching" problem.To address these challenges, this paper proposes an implementation framework for a resource-efficient ASR/Speech-to-Text system in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. By combining Hybrid HMM-DNN and CNN-TDNN, we can leverage the advantages of neural networks with less labeled data. Additionally, the CNN-TDNN approach has shown to work better in noisy environments due to the CNN's additional frequency dimension, which captures extra information from noisy speech, improving accuracy.We collected data from various open sources and labeled some of the unlabeled data after analyzing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English. Our results achieved a Word Error Rate (WER) of 5.2% with both noisy and clean environments in isolated words or numbers as well as in continuous spontaneous speech.
A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization
results: 研究结果显示,使用 myQASR 可以提高特定的性别、语言和说话者的表现,并且不需要组数调整。Abstract
Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
摘要
myQASR evaluates the quantization sensitivity of network layers by analyzing full-precision activation values, and generates a personalized mixed-precision quantization scheme for any pre-determined memory budget. Our results show that myQASR improves performance for specific genders, languages, and speakers.
Fake News Detection Through Graph-based Neural Networks: A Survey
results: 研究发现,图结构基于方法在假新闻检测中得到了显著的成果,特别是在模elling社交媒体宣传过程中。但是,还存在一些挑战和未解决的问题,如假新闻的定义和识别、社交媒体平台的不同性和数据的可靠性等。Abstract
The popularity of online social networks has enabled rapid dissemination of information. People now can share and consume information much more rapidly than ever before. However, low-quality and/or accidentally/deliberately fake information can also spread rapidly. This can lead to considerable and negative impacts on society. Identifying, labelling and debunking online misinformation as early as possible has become an increasingly urgent problem. Many methods have been proposed to detect fake news including many deep learning and graph-based approaches. In recent years, graph-based methods have yielded strong results, as they can closely model the social context and propagation process of online news. In this paper, we present a systematic review of fake news detection studies based on graph-based and deep learning-based techniques. We classify existing graph-based methods into knowledge-driven methods, propagation-based methods, and heterogeneous social context-based methods, depending on how a graph structure is constructed to model news related information flows. We further discuss the challenges and open problems in graph-based fake news detection and identify future research directions.
摘要
在线社交网络的流行化使得信息的传播变得非常快速,人们可以更快地分享和消耗信息。然而,低质量和/或意外或故意假的信息也可以快速传播,这可能会对社会产生重大和负面的影响。正确地识别、标注和驳斥在线谣言已成为一项急需解决的问题。许多方法已经被提议来检测假新闻,其中包括深度学习和图基于的方法。在过去几年中,图基于的方法在检测假新闻方面取得了强劲的结果,因为它们可以准确地模拟在线新闻的社交上下文和传播过程。本文提供一个系统性的审查,检测基于图和深度学习的假新闻检测研究。我们将现有的图基于方法分为知识驱动的方法、传播基于方法和多元社交上下文基于方法,根据如何构建图来模型新闻相关信息的流动。我们还讨论了假新闻检测中的挑战和未解决的问题,并确定了未来研究的方向。
Tachikuma: Understading Complex Interactions with Multi-Character and Novel Objects by Large Language Models
paper_authors: Yuanzhi Liang, Linchao Zhu, Yi Yang for:这篇论文旨在提高人工智能代理人在虚拟世界中的互动复杂性和灵活性,特别是在多个角色和新型对象的情况下。methods:该论文提出了在人工智能代理人世界模型中引入虚拟游戏主持人(GM)的想法,以增强信息把关、估计玩家的意图、提供环境描述和给予反馈等功能,从而补做当前世界模型的缺陷。results:该论文提出了一个名为Tachikuma的 benchmark,包括一个多个角色和新型对象基于互动 estimation(MOE)任务和一个相关的数据集。MOE挑战模型理解角色的意图并准确地确定他们在复杂情况下的行为。此外,数据集capture了在游戏即时通信中的实际交流记录,为未来的探索提供了多样、根据实际情况的复杂互动。最后,论文提出了一个简单的提示基线,并评估了其性能,示出其在促进互动理解方面的效果。Abstract
Recent advancements in natural language and Large Language Models (LLMs) have enabled AI agents to simulate human-like interactions within virtual worlds. However, these interactions still face limitations in complexity and flexibility, particularly in scenarios involving multiple characters and novel objects. Pre-defining all interactable objects in the agent's world model presents challenges, and conveying implicit intentions to multiple characters through complex interactions remains difficult. To address these issues, we propose integrating virtual Game Masters (GMs) into the agent's world model, drawing inspiration from Tabletop Role-Playing Games (TRPGs). GMs play a crucial role in overseeing information, estimating players' intentions, providing environment descriptions, and offering feedback, compensating for current world model deficiencies. To facilitate future explorations for complex interactions, we introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation (MOE) task and a supporting dataset. MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions. Besides, the dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations. Finally, we present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding. We hope that our dataset and task will inspire further research in complex interactions with natural language, fostering the development of more advanced AI agents.
摘要
To facilitate future explorations for complex interactions, we introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation (MOE) task and a supporting dataset. MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions. Besides, the dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.Finally, we present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding. We hope that our dataset and task will inspire further research in complex interactions with natural language, fostering the development of more advanced AI agents.Translation in Simplified Chinese:最近的自然语言和大型语言模型(LLMs)的进步,使得AI代理人能够在虚拟世界中模拟人类化的互动。然而,这些互动仍面临复杂性和灵活性的限制,特别是在多个角色和新的物品的情况下。将所有互动的物品都嵌入代理人的世界模型中存在挑战,而且通过复杂的互动传递多个角色的意图仍然具有挑战性。为解决这些问题,我们提出了在代理人的世界模型中 integrate 虚拟游戏大师(GMs)的想法, draw inspirations from 桌上角色扮演游戏(TRPGs)。GMs 在虚拟世界中扮演着重要的角色,负责资讯的监督、玩家的意图的估计、环境描述和回应,以补偿现有世界模型的不足。为了促进未来的复杂互动探索,我们提出了一个名为 Tachikuma 的benchmark,包括一个多个角色和新的物品基本互动Estimation(MOE)任务和一个支持 datasets。MOE 挑战模型能够理解角色的意图和精确地决定他们在复杂的多个角色和新的物品互动中的动作。此外, datasets capture 游戏中的实时通讯记录,提供多样化、根据现实的互动进行探索。最后,我们提出了一个简单的提示基eline,评估其表现,证明其能够增强互动理解。我们希望这个dataset和任务能够鼓励更多的研究在复杂互动中的自然语言,推动更进步的 AI 代理人的发展。
Towards Generalising Neural Topical Representations
paper_authors: Xiaohao Yang, He Zhao, Dinh Phung, Lan Du
for: 提高神经话题模型(NTM)的泛化能力,使其在不同文库和任务中产生质量话题表示。
methods: 使用数据扩充 durante el entrenamiento para模型 similar documents,并使用 Hierarchical Topic Transport Distance (HOTT) 测量文档之间的semantical distance。
results: 对多个NTMs进行了广泛的实验,并证明了框架可以significantly improve neural topical representation的泛化能力 across corpora。Abstract
Topic models have evolved from conventional Bayesian probabilistic models to Neural Topic Models (NTMs) over the last two decays. Although NTMs have achieved promising performance when trained and tested on a specific corpus, their generalisation ability across corpora is rarely studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation for documents in a different target corpus without retraining. In this work, we aim to improve NTMs further so that their benefits generalise reliably across corpora and tasks. To do so, we propose to model similar documents by minimising their semantical distance when training NTMs. Specifically, similar documents are created by data augmentation during training; The semantical distance between documents is measured by the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora.
摘要
To achieve this, we propose to model similar documents by minimizing their semantic distance during training. Specifically, we create similar documents by performing data augmentation during training, and we measure the semantic distance between documents using the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be easily applied to most NTMs as a plug-and-play module.Extensive experiments show that our framework significantly improves the generalization ability of neural topical representation across corpora.
Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation
for: This paper aims to study the robustness of current text adversarial attacks to round-trip translation and to introduce an intervention-based solution to improve the robustness of adversarial examples.
methods: The paper uses six state-of-the-art text-based adversarial attacks and integrates machine translation into the process of adversarial example generation to improve the robustness of adversarial examples.
results: The paper demonstrates that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.Here’s the text in Simplified Chinese:
methods: 论文使用了六种当前最佳文本基于攻击方法,并将机器翻译integrated into the process of adversarial example generation以提高攻击示例的Robustness。
results: 论文表明,找到可以在翻译中维持Robustness的攻击示例可以帮助发现语言模型的共同缺陷,并促进多语言攻击的研究。Abstract
Language Models today provide a high accuracy across a large number of downstream tasks. However, they remain susceptible to adversarial attacks, particularly against those where the adversarial examples maintain considerable similarity to the original text. Given the multilingual nature of text, the effectiveness of adversarial examples across translations and how machine translations can improve the robustness of adversarial examples remain largely unexplored. In this paper, we present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. Furthermore, we introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation and demonstrating increased robustness to round-trip translation. Our results indicate that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.
摘要
现代语言模型在许多下游任务上具有高准确率,但它们仍然易受到恶意攻击,特别是那些保留了原文的相似性。由于文本的多语言特性,对翻译后的恶意攻击的效iveness和机器翻译如何提高恶意攻击的Robustness remains largely unexplored。在这篇论文中,我们提供了round-trip translation对当前文本恶意攻击的全面研究。我们发现了6种现状顶尖文本基于攻击不具有翻译后的效力。此外,我们还介绍了一种利用机器翻译的解决方案,通过将机器翻译 integrate into the process of generating adversarial examples,并证明了该方法可以提高恶意攻击的Robustness。我们的结果表明,找到可以抵抗翻译的恶意攻击可以帮助发现语言模型的共同缺陷,并促进更多的关于多语言恶意攻击的研究。
Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training
results: 在End-to-end Speech Challenge Benchmark(ESB)上进行了广泛的实验,结果表明,SpeechLM-wapat模型比原始模型减少了6.28%的Word Error Rate(WER),达到了新的状态态-of-the-art。Abstract
Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.
摘要
开发一个实用robust的自动语音识别(ASR)系统是具有搅乱的挑战,因为模型需要不仅保持干净样本的原始性能,还需要在小量扰动和大域转换下实现一致的效果。为解决这个问题,我们提出了一种新的WavAugment导向的phoneme adversarial training(wapat)方法。wapat使用phoneme空间的对抗样本作为增强元素,使模型对phoneme表示的小变化具有抗衰减性,并保持干净样本的性能。此外,wapat利用增强后的phoneme表示导向对抗生成,以找到更稳定和多样的梯度方向,从而提高泛化能力。广泛的实验表明,wapat在End-to-end Speech Challenge Benchmark(ESB)上具有显著的效果,SpeechLM-wapat比原始模型减少6.28%的WRR,实现新的州际顶峰性。
On the Effectiveness of Offline RL for Dialogue Response Generation
paper_authors: Paloma Sodhi, Felix Wu, Ethan R. Elenberg, Kilian Q. Weinberger, Ryan McDonald
for: 研究 teacher forcing 的替代方法,以提高对话响应生成的性能。
methods: 使用了多种离线束规学学习(RL)方法,以优化对话响应生成的序列水平目标。
results: 研究发现,离线RL可以明显提高对话响应生成的性能,而不会导致训练不稳定或减少实际训练时间。Abstract
A common training technique for language models is teacher forcing (TF). TF attempts to match human language exactly, even though identical meanings can be expressed in different ways. This motivates use of sequence-level objectives for dialogue response generation. In this paper, we study the efficacy of various offline reinforcement learning (RL) methods to maximize such objectives. We present a comprehensive evaluation across multiple datasets, models, and metrics. Offline RL shows a clear performance improvement over teacher forcing while not inducing training instability or sacrificing practical training budgets.
摘要
一种常见的语言模型训练技巧是教师强制(TF)。TF尝试匹配人类语言完全一致,即使同一个意思可以表达在不同的方式。这种motivation使我们使用序列级目标来生成对话响应。在这篇论文中,我们研究了多种离线强化学习(RL)方法,以最大化这些目标。我们在多个数据集、模型和指标上进行了全面的评估。离线RL显示了与教师强制相比的表现提升,而不会导致训练不稳定或浪费实际训练预算。
results: 研究发现现有的 hate speech 检测软件对于某些政策有高失败率,而自动匹配新的示例和政策可以提高 AI 系统对于需求或政策的traceability。Abstract
In the recent years, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems brought challenges to traceability, because AI system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}.
摘要
Recently, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems have brought challenges to traceability, as their behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases that match each moderation policy, which we name HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software and find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further propose an automated approach to augment HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: [https://sites.google.com/view/content-moderation-project](https://sites.google.com/view/content-moderation-project).Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.
CommonsenseVIS: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models
results: 经过User Study,我们发现CommonsenseVIS可以帮助NLPT专家在不同情况下进行系统性和批量的可视化分析,从而更好地理解模型对概念的关系。Abstract
Recently, large pretrained language models have achieved compelling performance on commonsense benchmarks. Nevertheless, it is unclear what commonsense knowledge the models learn and whether they solely exploit spurious patterns. Feature attributions are popular explainability techniques that identify important input concepts for model outputs. However, commonsense knowledge tends to be implicit and rarely explicitly presented in inputs. These methods cannot infer models' implicit reasoning over mentioned concepts. We present CommonsenseVIS, a visual explanatory system that utilizes external commonsense knowledge bases to contextualize model behavior for commonsense question-answering. Specifically, we extract relevant commonsense knowledge in inputs as references to align model behavior with human knowledge. Our system features multi-level visualization and interactive model probing and editing for different concepts and their underlying relations. Through a user study, we show that CommonsenseVIS helps NLP experts conduct a systematic and scalable visual analysis of models' relational reasoning over concepts in different situations.
摘要
To address this challenge, we propose CommonsenseVIS, a visual explanatory system that leverages external common sense knowledge bases to contextualize model behavior for common sense question-answering. Specifically, we extract relevant common sense knowledge from inputs and use it to align the model's behavior with human knowledge. Our system features multi-level visualization and interactive model probing and editing for different concepts and their underlying relations.Through a user study, we demonstrate that CommonsenseVIS helps NLP experts conduct a systematic and scalable visual analysis of the models' relational reasoning over concepts in different situations. By providing a visual interface for exploring the models' behavior, CommonsenseVIS enables experts to gain a deeper understanding of how the models are using common sense knowledge to make predictions. This can help improve the models' performance and ensure that they are making accurate and informed decisions.
Evaluating Emotional Nuances in Dialogue Summarization
results: 研究发现,现有的概要模型不太好地保留对话中情感内容,而且通过减少训练集中不情感对话,可以更好地保留情感内容,同时保留最重要的事实信息。Abstract
Automatic dialogue summarization is a well-established task that aims to identify the most important content from human conversations to create a short textual summary. Despite recent progress in the field, we show that most of the research has focused on summarizing the factual information, leaving aside the affective content, which can yet convey useful information to analyse, monitor, or support human interactions. In this paper, we propose and evaluate a set of measures $PEmo$, to quantify how much emotion is preserved in dialog summaries. Results show that, summarization models of the state-of-the-art do not preserve well the emotional content in the summaries. We also show that by reducing the training set to only emotional dialogues, the emotional content is better preserved in the generated summaries, while conserving the most salient factual information.
摘要
自动对话摘要是一个已经成熟的任务,目的是从人类对话中提取最重要的内容,创建简短的文本摘要。尽管最近的进步在这个领域,但大多数研究仍然专注于摘要的事实信息,忽略了情感内容,这种内容可以带来有用的信息,分析、监测或支持人类交流。在这篇论文中,我们提出并评估了一组测量方法$PEmo$,以量化对话摘要中情感内容的保留程度。结果表明,现有的摘要模型并不能很好地保留对话中的情感内容。我们还表明,通过将训练集限制为只包含情感对话,可以更好地保留对话摘要中的情感内容,同时保留最重要的事实信息。