cs.CL - 2023-10-14

Beyond Testers’ Biases: Guiding Model Testing with Knowledge Bases using LLMs

paper_url: http://arxiv.org/abs/2310.09668
repo_url: None
paper_authors: Chenyang Yang, Rishabh Rustogi, Rachel Brower-Sinning, Grace A. Lewis, Christian Kästner, Tongshuang Wu
for: 这篇论文主要是为了提供一种用于模型测试的工具，以帮助测试人员更好地识别需要测试的方面。
methods: 这篇论文使用了大量自然语言处理技术，包括生成知识库和互动式推荐，以帮助测试人员系统地探索不同的概念。
results: 在用户研究中，测试人员使用Weaver工具时能够更好地识别模型需要测试的方面，并发现了大量的失败测试案例。此外，Weaver还可以帮助实践者在真实的应用场景中测试模型，例如代码理解和对话简要摘要。

Abstract
Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in real-world settings, where developers define more nuanced application scenarios (e.g., code understanding and transcript summarization) using LLMs.

摘要
当前模型测试工作主要集中在创建测试用例上。确定要测试的内容是一个大多数被忽视和不受支持的步骤。我们提出了 Weaver，一个互动工具，可以支持需求描述导引模型测试。Weaver 使用大型自然语言模型生成知识库并在互动方式下提供概念建议，allowing testers 可以从概念中得到更多的测试需求。Weaver 为测试人员提供了丰富的外部知识，并且鼓励测试人员系统地探索多种概念，超越自己的偏见。在用户研究中，我们发现了 NLP 专家和非专家都可以使用 Weaver 来确定更多，以及更多样本的测试需求。总的来说，他们在 zero-shot ChatGPT 上找到了 más de 200 个失败测试用例。我们的案例研究还表明，Weaver 可以帮助实践者在真实的应用场景中测试模型（如代码理解和讲话笔记摘要），使用 LLMs。

Legend at ArAIEval Shared Task: Persuasion Technique Detection using a Language-Agnostic Text Representation Model

paper_url: http://arxiv.org/abs/2310.09661
repo_url: None
paper_authors: Olumide E. Ojo, Olaronke O. Adebanji, Hiram Calvo, Damian O. Dieke, Olumuyiwa E. Ojo, Seye E. Akinsanya, Tolulope O. Abiola, Anna Feldman
for: 本研究的目的是参加2023年阿拉伯语言处理会议（ArabicNLP）的阿拉伯语AI任务评估挑战（ArAIEval），特别是任务1，即从推文和新闻文章中识别吸引人的技巧。
methods: 该研究使用了XLM-RoBERTa语言无关文本表示模型进行训练循环，并进行细化的多语言模型微调。
results: 在测试集评估中，我们的微调后的多语言模型在任务1下的子任务A中取得了0.64的微 F1分数。

Abstract
In this paper, we share our best performing submission to the Arabic AI Tasks Evaluation Challenge (ArAIEval) at ArabicNLP 2023. Our focus was on Task 1, which involves identifying persuasion techniques in excerpts from tweets and news articles. The persuasion technique in Arabic texts was detected using a training loop with XLM-RoBERTa, a language-agnostic text representation model. This approach proved to be potent, leveraging fine-tuning of a multilingual language model. In our evaluation of the test set, we achieved a micro F1 score of 0.64 for subtask A of the competition.

摘要
在这篇论文中，我们分享我们在“阿拉伯语言处理（ArabicNLP）2023”年度的“阿拉伯AI任务评估比赛”（ArAIEval）中的最佳提交。我们的关注点是任务1，即在推文和新闻文章中Identify persuasion techniques。在阿拉伯文本中探测了使用XLM-RoBERTa语言无关文本表示模型的训练循环。这种方法证明了其高效，通过细化多语言模型的 fine-tuning。在我们对测试集进行评估时，我们在subtask A中获得了0.64的微 F1分数。

An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes

paper_url: http://arxiv.org/abs/2310.09634
repo_url: https://github.com/kaanakdeniz/reproducibility_assessment
paper_authors: Eyüp Kaan Akdeniz, Selma Tekir, Malik Nizar Asad Al Hinnawi
for: 支持机器学习研究的可重复性评估
methods: 使用约定模板和自定义函数对Readme文件进行检查，并使用层次转移模型为Readme文件分类
results: 系统可以准确地评估Readme文件的可重复性，并且可以提供可解释的分数。同时，section similarity-based系统比层次转移模型 performs better。

Abstract
Increased reproducibility of machine learning research has been a driving force for dramatic improvements in learning performances. The scientific community further fosters this effort by including reproducibility ratings in reviewer forms and considering them as a crucial factor for the overall evaluation of papers. Accompanying source code is not sufficient to make a work reproducible. The shared codes should meet the ML reproducibility checklist as well. This work aims to support reproducibility evaluations of papers with source codes. We propose an end-to-end system that operates on the Readme file of the source code repositories. The system checks the compliance of a given Readme to a template proposed by a widely used platform for sharing source codes of research. Our system generates scores based on a custom function to combine section scores. We also train a hierarchical transformer model to assign a class label to a given Readme. The experimental results show that the section similarity-based system performs better than the hierarchical transformer. Moreover, it has an advantage regarding explainability since one can directly relate the score to the sections of Readme files.

摘要
增加机器学习研究的可重现性是导致学习性能的进步的驱动力。科学社区还进一步推动这一努力，将可重现性评估纳入评审表单中，并视为评审整体评价的关键因素。仅提供源代码不足以使一作品可重现。我们建议一个综合系统，运行在源代码存储库的Readme文件上。该系统根据提案的模板检查源代码的可重现性，并生成分数根据自定义的函数组合分。我们还训练了一个层次转换器模型，将给定的Readme文件分配分类标签。实验结果表明，基于section相似性的系统在可重现性评估中表现更好，并且具有更好的解释性，因为可以直接将分数关联到Readme文件中的section。

A Digital Language Coherence Marker for Monitoring Dementia

paper_url: http://arxiv.org/abs/2310.09623
repo_url: None
paper_authors: Dimitris Gkoumas, Adam Tsakalidis, Maria Liakata
for: 这个论文旨在提出一种新的、可靠且非侵入的方法，使用自然语言进行诊断和监测 деменcia。
methods: 该论文提出了一种新的任务，即学习叙述中的时间逻辑一致性，并 investigate了多种神经网络方法。
results: 研究发现，与健康人群相比，人们 WITH dementia 的语言一致性呈现出明显的差异，并且与临床生物标志物相关性较高。此外，该 marker 还具有普适性，可应用于其他相关的疾病。

Abstract
The use of spontaneous language to derive appropriate digital markers has become an emergent, promising and non-intrusive method to diagnose and monitor dementia. Here we propose methods to capture language coherence as a cost-effective, human-interpretable digital marker for monitoring cognitive changes in people with dementia. We introduce a novel task to learn the temporal logical consistency of utterances in short transcribed narratives and investigate a range of neural approaches. We compare such language coherence patterns between people with dementia and healthy controls and conduct a longitudinal evaluation against three clinical bio-markers to investigate the reliability of our proposed digital coherence marker. The coherence marker shows a significant difference between people with mild cognitive impairment, those with Alzheimer's Disease and healthy controls. Moreover our analysis shows high association between the coherence marker and the clinical bio-markers as well as generalisability potential to other related conditions.

摘要
使用自然语言来 derive 适当的数字标记已成为迅速发展、有前途和不侵入的诊断和监测诱导症方法。我们提议使用语言一致性作为一种经济、人类可读取的数字标记，以监测人类诱导症的认知变化。我们介绍了一项新任务，用于学习句子之间的时间逻辑一致性，并 investigate 多种神经网络方法。我们比较了这些语言一致性模式，并与三种临床生物标志物进行长期评估，以确定我们所提议的数字一致标记的可靠性。我们发现，与轻度认知障碍、阿兹海默病和健康群体进行比较，我们的一致标记具有显著差异。此外，我们的分析还显示了这些一致标记与临床生物标志物之间的高相关性，以及其普适性。

An Expression Tree Decoding Strategy for Mathematical Equation Generation

paper_url: http://arxiv.org/abs/2310.09619
repo_url: None
paper_authors: Wenqi Zhang, Yongliang Shen, Qingpeng Nong, Zeqi Tan, Yanna Ma, Weiming Lu
for: 该论文主要探讨了如何从自然语言中生成数学公式。
methods: 该论文提出了一种基于树结构的表达水平生成方法，通过层次并行解码策略和双分配匹配算法来生成数学公式。
results: 实验表明，该方法在生成复杂结构的数学公式方面表现出色，比基eline方法更高效。

Abstract
Generating mathematical equations from natural language requires an accurate understanding of the relations among math expressions. Existing approaches can be broadly categorized into token-level and expression-level generation. The former treats equations as a mathematical language, sequentially generating math tokens. Expression-level methods generate each expression one by one. However, each expression represents a solving step, and there naturally exist parallel or dependent relations between these steps, which are ignored by current sequential methods. Therefore, we integrate tree structure into the expression-level generation and advocate an expression tree decoding strategy. To generate a tree with expression as its node, we employ a layer-wise parallel decoding strategy: we decode multiple independent expressions (leaf nodes) in parallel at each layer and repeat parallel decoding layer by layer to sequentially generate these parent node expressions that depend on others. Besides, a bipartite matching algorithm is adopted to align multiple predictions with annotations for each layer. Experiments show our method outperforms other baselines, especially for these equations with complex structures.

摘要
<>输入文本中的数学公式生成需要精准地理解数学表达之间的关系。现有的方法可以分为token级和表达级两类。前者视数学公式为一种语言，顺序生成数学token。表达级方法每个表达都代表一步解题，但是这些步骤之间存在并行或依赖关系，现有的顺序方法忽略了这些关系。因此，我们将树结构integrated到表达级生成中，并提出了表达树解码策略。为生成一棵表达树，我们采用层 wise并行解码策略：在每层解码多个独立的表达（叶节点）并重复层 wise并行解码来生成依赖于别的父节点表达。此外，我们采用了一种两个分配算法来对多个预测与注释进行对应。实验表明，我们的方法在Equation with complex structures上比基eline方法表现更好。

Moral consensus and divergence in partisan language use

paper_url: http://arxiv.org/abs/2310.09618
repo_url: None
paper_authors: Nakwon Rim, Marc G. Berman, Yuan Chang Leong
methods: 这个论文使用了大规模的Reddit社区和新闻媒体的言语数据（294,476,146个评论和6,749,781篇文章），使用word embedding模型来捕捉语言中的 semantic association，并在7个政治话题（如 abortion、immigration）上进行了研究。results: 研究发现， despite shared moral understanding across the political spectrum, there are consistent differences in the moral associations of words between conservative and liberal text sources, which can be used to distinguish text sources with above 85% classification accuracy. These findings suggest that partisan language differences are widespread and may contribute to political polarization.

Abstract
Polarization has increased substantially in political discourse, contributing to a widening partisan divide. In this paper, we analyzed large-scale, real-world language use in Reddit communities (294,476,146 comments) and in news outlets (6,749,781 articles) to uncover psychological dimensions along which partisan language is divided. Using word embedding models that captured semantic associations based on co-occurrences of words in vast textual corpora, we identified patterns of affective polarization present in natural political discourse. We then probed the semantic associations of words related to seven political topics (e.g., abortion, immigration) along the dimensions of morality (moral-to-immoral), threat (threatening-to-safe), and valence (pleasant-to-unpleasant). Across both Reddit communities and news outlets, we identified a small but systematic divergence in the moral associations of words between text sources with different partisan leanings. Moral associations of words were highly correlated between conservative and liberal text sources (average $\rho$ = 0.96), but the differences remained reliable to enable us to distinguish text sources along partisan lines with above 85% classification accuracy. These findings underscore that despite a shared moral understanding across the political spectrum, there are consistent differences that shape partisan language and potentially exacerbate political polarization. Our results, drawn from both informal interactions on social media and curated narratives in news outlets, indicate that these trends are widespread. Leveraging advanced computational techniques, this research offers a fresh perspective that complements traditional methods in political attitudes.

摘要
政治话语的偏极化现象在现代社会中日益增加，这对政党分化的扩大做出了贡献。在这篇论文中，我们通过分析Reddit社区的294476146个评论和新闻媒体的6749781篇文章，以揭示政治话语中的心理维度。我们使用基于很大文本 Corpora 的词嵌入模型，捕捉了词语之间的含义相互关系，并发现了政治话语中的情感偏极现象。我们 then probed the semantic associations of words related to seven political topics (such as abortion and immigration) along the dimensions of morality (moral-to-immoral), threat (threatening-to-safe), and valence (pleasant-to-unpleasant).在Reddit社区和新闻媒体中，我们发现了一些小而系统的语言偏极现象，即不同政治倾向的文本来源之间的道德含义的差异。虽然保守和自由政治倾向的文本来源之间的道德含义之间存在高度的相互关系（平均值为0.96），但这些差异仍然可以准确地分辨出文本来源的政治倾向，使得我们可以在85%的权益上分类文本来源。这些发现表明，尽管在政治 спектrum中存在共同的道德理解，但是存在一些不同的特征，这些特征可能增加政治偏极化。我们的研究结果，来自社交媒体和新闻媒体，表明这些趋势是普遍的。我们利用了高级计算技术，这些研究结果可以补充传统的政治态度研究。

RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

paper_url: http://arxiv.org/abs/2310.09596
repo_url: https://github.com/junjie-ye/rethinkingtmsc
paper_authors: Junjie Ye, Jie Zhou, Junfeng Tian, Rui Wang, Qi Zhang, Tao Gui, Xuanjing Huang
for: 本研究的目的是调查目标受众情感分类 Task 中 modalities 的重要性和 multimodal fusion module 的效果，以及现有数据集是否能够支持研究。
methods: 本研究使用了广泛的实验和深入的分析来回答以下问题：Q1：modalities 在 TMSC 中的重要性是否相同？Q2：哪些 multimodal fusion module 更加有效？Q3：现有的数据集能否支持研究？
results: 实验和分析显示，目前的 TMSC 系统主要依靠文本模式来决定目标受众情感，因此我们指出了一些改进 TMSC 任务的方向，包括模型设计和数据集构建。

Abstract
Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars. However, current multimodal models have reached a performance bottleneck. To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets to answer the following questions: Q1: Are the modalities equally important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3: Do existing datasets adequately support the research? Our experiments and analyses reveal that the current TMSC systems primarily rely on the textual modality, as most of targets' sentiments can be determined solely by text. Consequently, we point out several directions to work on for the TMSC task in terms of model design and dataset construction. The code and data can be found in https://github.com/Junjie-Ye/RethinkingTMSC.

摘要
近期，目标受控多模态情感分类（TMSC）在学者中受到了广泛关注。然而，当前的多模态模型已经达到了性能瓶颈。为了调查这个问题的原因，我们进行了详细的实验和深入分析数据集，以回答以下问题：Q1：多 modalities 在 TMSC 中是否具有相同的重要性？Q2：哪些多模态融合模块更有效？Q3：现有的数据集是否能够完善 TMSC 研究？我们的实验和分析表明，目前的 TMSC 系统主要依赖于文本modalities，因为大多数目标的情感都可以通过文本来确定。因此，我们提出了一些关于 TMSC 任务的模型设计和数据集建设的方向。codes 和数据可以在找到。

Self-Detoxifying Language Models via Toxification Reversal

paper_url: http://arxiv.org/abs/2310.09573
repo_url: https://github.com/cooperleong00/toxificationreversal
paper_authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
for: 降低预训练语言模型（PLM）中生成危险或伤害性内容的风险，以便更安全地部署。
methods: 我们提出了一种较轻量级的方法，即让PLM本身实现”自我抹黑”。我们的方法基于 prepending a negative steering prompt 可以让 PLM 生成恶意内容。同时，我们受到了最近的解释研究中的研究，即通过注意层来实现 PLM 内部的Contextualized Representations 的演化。在这基础之上，我们设计了一种方法，可以从 normal generation process 中提取恶意方向，然后通过 manipulate 注意层内的信息流来驱动生成 towards the reversed direction。
results: 我们的方法，不需要任何 fine-tuning 或额外组件，可以 achieved comparable performance with state-of-the-art methods。

Abstract
Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. Existing methods can be roughly categorized as finetuning-based and decoding-based. However, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. In this paper, we propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification". Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content. At the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the PLM as an information stream facilitated by the attention layers. Drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. Experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.

摘要
language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. existing methods can be roughly categorized as finetuning-based and decoding-based. however, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. in this paper, we propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification". our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content. at the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the PLM as an information stream facilitated by the attention layers. drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.Here's the translation in Traditional Chinese:语模型净化目标是为了降低预训练语言模型（PLM）发布时的风险，以确保更安全的应用。现有的方法可以大致分为调整基于和解oding基于两种。然而，前者经常需要资源投入，而后者则可能会妥协生成流畅性。在这篇论文中，我们提出了一种更轻量级的方法，让 PLM 本身能够实现 "自我净化"。我们的方法基于 prepending 负面引导预告可以导致 PLM 产生毒性内容的观察。同时，我们受到解释性研究的鼓励，它将 PLM 中的变化 contextualized 表示形式化为资讯流通过注意层。从这个想法开始，我们提出了一种方法来从 normal 生成过程中决定毒化方向，然后通过注意层中的资讯运动来驾驭生成。实验结果显示，我们的方法，不需要任何调整或额外元件，可以与现有的方法相比获得相似的性能。

Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE

paper_url: http://arxiv.org/abs/2310.09550
repo_url: None
paper_authors: Yixuan Zhang, Haonan Li
for: 评估大型自然语言处理模型（LLMs）在理解古代中文方面的能力。
methods: 使用ACLUE评价指标集，评测8种现代最佳语言模型在古代中文和现代中文之间的表现差异。
results: through 评测，发现其表现最佳的是ChatGLM2，得分为37.4%。

Abstract
Large language models (LLMs) have showcased remarkable capabilities in understanding and generating language. However, their ability in comprehending ancient languages, particularly ancient Chinese, remains largely unexplored. To bridge this gap, we present ACLUE, an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese. ACLUE consists of 15 tasks cover a range of skills, spanning phonetic, lexical, syntactic, semantic, inference and knowledge. Through the evaluation of eight state-of-the-art LLMs, we observed a noticeable disparity in their performance between modern Chinese and ancient Chinese. Among the assessed models, ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%. We have made our code and data public available.

摘要
大型语言模型（LLM）在理解和生成语言方面表现出了很好的能力，但它们对古代中文理解的能力仍然很少被探索。为了填补这一空白，我们提出了ACLUE评价指标，用于评估语言模型对古代中文的理解能力。ACLUE包括15个任务，覆盖了各种技能，包括声学、词汇、语法、 semantics、推理和知识。通过评估8种当今最先进的LLM，我们发现了这些模型对现代中文和古代中文的表现存在显著差异。其中，ChatGLM2表现最出色，其平均分为37.4%。我们已经将代码和数据公开。

CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering

paper_url: http://arxiv.org/abs/2310.09536
repo_url: None
paper_authors: Md Rashad Al Hasan Rony, Christian Suess, Sinchana Ramakanth Bhat, Viju Sudhi, Julia Schneider, Maximilian Vogel, Roman Teucher, Ken E. Friedl, Soumya Sahoo
for: 本研究旨在提高大型自然语言模型（LLM）在域pecific问答中的表现，并解决现有LLM在域pecific问答中的限制。
methods: 本研究提出了一种名为CarExpert的在车辆中的问答系统，该系统利用LLM来控制输入、提供域pecific文档给抽取和生成答案组件，并控制输出以确保安全和域pecific答案。
results: 对比STATE-OF-THE-ART LLMs，CarExpert在生成自然、安全和车specific答案方面表现出色。

Abstract
Large language models (LLMs) have demonstrated remarkable performance by following natural language instructions without fine-tuning them on domain-specific tasks and data. However, leveraging LLMs for domain-specific question answering suffers from severe limitations. The generated answer tends to hallucinate due to the training data collection time (when using off-the-shelf), complex user utterance and wrong retrieval (in retrieval-augmented generation). Furthermore, due to the lack of awareness about the domain and expected output, such LLMs may generate unexpected and unsafe answers that are not tailored to the target domain. In this paper, we propose CarExpert, an in-car retrieval-augmented conversational question-answering system leveraging LLMs for different tasks. Specifically, CarExpert employs LLMs to control the input, provide domain-specific documents to the extractive and generative answering components, and controls the output to ensure safe and domain-specific answers. A comprehensive empirical evaluation exhibits that CarExpert outperforms state-of-the-art LLMs in generating natural, safe and car-specific answers.

摘要
大型自然语言模型（LLM）已经展现出了很好的性能，可以按照自然语言指令进行不需要精化的域pecific任务和数据上的执行。然而，使用LLM进行域pecific问答会遇到严重的限制。生成的答案往往会偏差，这是因为使用的数据采集时间（当用off-the-shelf）、复杂的用户语音和错误的检索（在检索增强生成中）。此外，由于缺乏域和期望输出的认知，这些LLM可能会生成不适用于目标域的不适用和不安全的答案。在这篇论文中，我们提出了CarExpert，一个基于LLM的在车辆内进行检索增强对话问答系统。具体来说，CarExpert使用LLM来控制输入、提供域pecific文档给抽取和生成答案组件，并控制输出，以确保安全和适用的答案。一项全面的实验证明了CarExpert在生成自然、安全和车辆特有的答案方面表现出优于状态之前的LLM。

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

paper_url: http://arxiv.org/abs/2310.09520
repo_url: https://github.com/haikangdeng/RAD
paper_authors: Haikang Deng, Colin Raffel
for: 这 paper 是为了提高语言模型生成文本的质量和特性。
methods: 这 paper 使用了 Reward-Augmented Decoding (RAD) 方法，这是一种基于小型单向奖励模型的文本生成方法，可以让语言模型生成文本有特定属性。RAD 使用奖励模型来评分生成过程中的每个步骤，并根据奖励值调整抽样概率，以增加高奖励的token。
results: 通过对生成非攻击性和情感控制文本进行实验，这 paper 表明 RAD 在不改变生成过程的情况下，可以达到最佳性和state-of-the-art方法的性能水平，并且可以在非常大的语言模型中实现最佳性，而且 computation overhead 很小。

Abstract
While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.

摘要
大型语言模型已经证明在广泛的下游应用中效果很好，但它们通常生成的文本具有问题或缺乏欲有的特性。在这篇论文中，我们介绍了奖励增强解oding（RAD），一种文本生成 процедуre，使用小型单向奖励模型来鼓励语言模型生成具有特定特性的文本。具体来说，RAD使用奖励模型来评分生成过程中的生成，并将抽样概率重新调整以偏好高奖励的字元。使用单向奖励模型，RAD可以储存上一步生成的活化，以减少计算成本。通过实验，我们证明了RAD在生成非攻击性和情感控制的文本方面表现最佳，并与现有的语言模型重新训练方法相当。此外，我们还验证了RAD在巨大语言模型上的效果，而且计算成本几乎没有增加。

Attentive Multi-Layer Perceptron for Non-autoregressive Generation

paper_url: http://arxiv.org/abs/2310.09512
repo_url: https://github.com/shark-nlp/attentivemlp
paper_authors: Shuyang Jiang, Jun Zhang, Jiangtao Feng, Lin Zheng, Lingpeng Kong
for: 这 paper 的目的是提出一种高效的非自然语言生成模型，以解决非自然语言生成模型的效率问题。
methods: 该 paper 使用了一种新的多层感知机制 variant，称为 Attentive Multi-Layer Perceptron~(AMLP)，来生成高效的语言模型。 AMLP 使用了适应的投影矩阵，来模型语言模型中的关系。
results: 该 paper 的实验结果表明，AMLP 与其他高效的非自然语言生成模型相比，在文本生成和机器翻译等任务上具有显著的优势。 AMLP 的自我和交叉注意能力也被分别测试，并与其他高效的模型相比，得到了比较良好的结果。 Additionally, the paper also shows that AMLP has a significant reduction in memory cost compared to vanilla non-autoregressive models for long sequences.

Abstract
Autoregressive~(AR) generation almost dominates sequence generation for its efficacy. Recently, non-autoregressive~(NAR) generation gains increasing popularity for its efficiency and growing efficacy. However, its efficiency is still bottlenecked by quadratic complexity in sequence lengths, which is prohibitive for scaling to long sequence generation and few works have been done to mitigate this problem. In this paper, we propose a novel MLP variant, \textbf{A}ttentive \textbf{M}ulti-\textbf{L}ayer \textbf{P}erceptron~(AMLP), to produce a generation model with linear time and space complexity. Different from classic MLP with static and learnable projection matrices, AMLP leverages adaptive projections computed from inputs in an attentive mode. The sample-aware adaptive projections enable communications among tokens in a sequence, and model the measurement between the query and key space. Furthermore, we marry AMLP with popular NAR models, deriving a highly efficient NAR-AMLP architecture with linear time and space complexity. Empirical results show that such marriage architecture surpasses competitive efficient NAR models, by a significant margin on text-to-speech synthesis and machine translation. We also test AMLP's self- and cross-attention ability separately with extensive ablation experiments, and find them comparable or even superior to the other efficient models. The efficiency analysis further shows that AMLP extremely reduces the memory cost against vanilla non-autoregressive models for long sequences.

摘要
自适应式生成（AR）技术在序列生成方面几乎占据了主导地位，而非自适应式生成（NAR）技术在最近几年得到了越来越多的关注，主要因为它的效率和生成能力在不断提高。然而，NAR技术的效率仍然受到序列长度的二次复杂性的限制，这使得扩展到长序列生成和少量工作已经做出了很多努力。在这篇论文中，我们提出了一种新的多层感知器（MLP）变体，称为注意力感知器（AMLP），以生成一个具有线性时间和空间复杂度的生成模型。与 класси型MLP的静态和学习投影矩阵不同，AMLP利用从输入中计算的适应投影。这些适应投影可以在序列中的各个元素之间进行通信，并且可以测量查询和关键空间之间的距离。此外，我们将AMLP与流行的NAR模型结合，得到了高效的NAR-AMLP架构，该架构具有线性时间和空间复杂度。实验结果表明，这种结合架构在文本到语音生成和机器翻译方面的性能明显高于竞争对手，并且我们还进行了详细的自注意力和交叉注意力能力测试，发现它们与其他高效的模型相当或者甚至更高。最后，我们还进行了内存成本分析，发现AMLP在长序列情况下可以极大地减少内存成本。

DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit

paper_url: http://arxiv.org/abs/2310.09501
repo_url: https://github.com/yaswanth-iitkgp/depnecti
paper_authors: Jivnesh Sandhan, Yaswanth Narsupalli, Sreevatsa Muppirala, Sriram Krishnan, Pavankumar Satuluri, Amba Kulkarni, Pawan Goyal
for: 本研究旨在提出一个新的任务：嵌入式多 компонент合成类型标识（NeCTI），以便理解多 component 合成中的隐藏结构和 semantics。
methods: 本研究使用了2个新的标注数据集，并对这些数据集进行了基线测试。然后，提出了一种新的框架名为 DepNeCTI，该框架基于依赖关系来实现嵌入式多 component 合成类型标识。
results: 对于 NeCTI 任务， DepNeCTI 框架在 Labeled Span Score (LSS) 方面的平均绝对改进率为 13.1 个 F1 分，并在推理效率方面实现了5倍的提高。此外，研究还发现了上下文对 NeCTI 任务的有利作用。

Abstract
Multi-component compounding is a prevalent phenomenon in Sanskrit, and understanding the implicit structure of a compound's components is crucial for deciphering its meaning. Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task. We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI

摘要
多Component合成是吠拜话中的普遍现象，理解合成元素的隐式结构是解译其意义的关键。 Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task. We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI.Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from Traditional Chinese.

Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders

paper_url: http://arxiv.org/abs/2310.09494
repo_url: None
paper_authors: Takeshi Saga, Hiroki Tanaka, Satoshi Nakamura
for: 这个研究是 investigate 哲思疾病 (FTD) 的表现在语言中，FTD 是 autism spectrum disorder (ASD) 和 schizophrenia 等疾病的一种表现。
methods: 这个研究使用了一个日本语音报告数据集，通过人员募集服务收集了有关 ASD 和 SPD 的分数标签。研究使用了社会响应度量表第二版 (SRS2) 和偏妄人格量表 (SPQ)，包括 SPQ 中的异常语言指标来评估语言特征。
results: 研究发现，异常语言指标与总 SPQ 和 SRS 分数显然相关，但两者自身不相关。异常语言指标 longer speech about negative memory 引起了更多 FTD 症状。减少研究表明，功能词和抽象特征对异常语言指标产生重要影响，而内容词只对 SRS 预测有效。这种结果表明 SPD 和 ASD 之间存在差异。数据和程序使用在这里：https://sites.google.com/view/sagatake/resource.

Abstract
[See full abstract in the pdf] Formal Thought Disorder (FTD), which is a group of symptoms in cognition that affects language and thought, can be observed through language. FTD is seen across such developmental or psychiatric disorders as Autism Spectrum Disorder (ASD) or Schizophrenia, and its related Schizotypal Personality Disorder (SPD). This paper collected a Japanese audio-report dataset with score labels related to ASD and SPD through a crowd-sourcing service from the general population. We measured language characteristics with the 2nd edition of the Social Responsiveness Scale (SRS2) and the Schizotypal Personality Questionnaire (SPQ), including an odd speech subscale from SPQ to quantify the FTD symptoms. We investigated the following four research questions through machine-learning-based score predictions: (RQ1) How are schizotypal and autistic measures correlated? (RQ2) What is the most suitable task to elicit FTD symptoms? (RQ3) Does the length of speech affect the elicitation of FTD symptoms? (RQ4) Which features are critical for capturing FTD symptoms? We confirmed that an FTD-related subscale, odd speech, was significantly correlated with both the total SPQ and SRS scores, although they themselves were not correlated significantly. Our regression analysis indicated that longer speech about a negative memory elicited more FTD symptoms. The ablation study confirmed the importance of function words and both the abstract and temporal features for FTD-related odd speech estimation. In contrast, content words were effective only in the SRS predictions, and content words were effective only in the SPQ predictions, a result that implies the differences between SPD-like and ASD-like symptoms. Data and programs used in this paper can be found here: https://sites.google.com/view/sagatake/resource.

摘要
[参考报告PDF]：正式思维障碍（FTD）是一组语言和思维方面的症状，可以通过语言来观察。FTD在发展或心理疾病中出现，如自闭症спектrum病（ASD）和 шизотипи性人格障碍（SPD）。这篇论文通过在日本的一个音频报告数据集上使用招募服务，收集了与ASD和SPD相关的语音报告数据。我们使用第2版社会响应度量表（SRS2）和 шизотипи性人格问卷（SPQ）进行语言特征量化，包括SPQ中的奇特语言子层，以评估FTD症状。我们提出了以下四个研究问题，通过机器学习基于分数预测来解答：（RQ1）ASD和SPD的度量相关吗？（RQ2）哪种任务可以最好Trigger FTD症状？（RQ3）长度的语言影响FTD症状的发现吗？（RQ4）FTD症状捕捉关键的特征是什么？我们发现，FTD相关的语言子层与总SPQ和SRS分数显著相关，但它们自身没有显著相关。我们的回归分析表明， longer speech about a negative memory elicited more FTD symptoms。减少学习中的函数词和抽象特征以及时间特征对FTD相关odd speech估算具有重要作用。相反，内容词只在SRS预测中有效，而内容词只在SPQ预测中有效，这种结果表明了ASD和SPD之间的差异。数据和程序可以在以下链接中找到：https://sites.google.com/view/sagatake/resource。