cs.CL - 2023-08-14

Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI

paper_url: http://arxiv.org/abs/2308.07213
repo_url: None
paper_authors: Houjiang Liu, Anubrata Das, Alexander Boltz, Didi Zhou, Daisy Pinaroc, Matthew Lease, Min Kyung Lee
for: 本研究旨在帮助职业Fact-checking增加效率和可扩展性，通过与Fact-checker合作设计AI工具，以满足Fact-checker的需求和价值观。
methods: 本研究使用了Matchmaking for AI方法，通过合作设计，让Fact-checker、设计师和NLP研究人员共同探索Fact-checker需要如何被技术支持，并从Fact-checker的角度设计AI工具。
results: 本研究在22名职业Fact-checker的合作设计中提出了11个新的设计想法，帮助Fact-checker更有效率地进行信息搜索、处理和写作任务，以及预防未来的谣言和减少自己的可能的偏见。

Abstract
A key challenge in professional fact-checking is its limited scalability in relation to the magnitude of false information. While many Natural Language Processing (NLP) tools have been proposed to enhance fact-checking efficiency and scalability, both academic research and fact-checking organizations report limited adoption of such tooling due to insufficient alignment with fact-checker practices, values, and needs. To address this gap, we investigate a co-design method, Matchmaking for AI, which facilitates fact-checkers, designers, and NLP researchers to collaboratively discover what fact-checker needs should be addressed by technology and how. Our co-design sessions with 22 professional fact-checkers yielded a set of 11 novel design ideas. They assist in information searching, processing, and writing tasks for efficient and personalized fact-checking; help fact-checkers proactively prepare for future misinformation; monitor their potential biases; and support internal organization collaboration. Our work offers implications for human-centered fact-checking research and practice and AI co-design research.

摘要
一个主要挑战在职业事实核查是它的限定可扩展性，与假信息的规模相对。虽然许多自然语言处理（NLP）工具已经被提议用于增强事实核查效率和可扩展性，但是学术研究和事实核查组织都报告了这些工具的采用率很低，主要是因为技术不符合事实核查员的做法、价值和需求。为解决这个差距，我们调查了一种合作设计方法，即Matchmaking for AI，该方法使得事实核查员、设计师和NLP研究人员共同探索事实核查员需要技术解决的问题和如何解决。我们与22名职业事实核查员进行了合作设计会议，得到了11项新的设计想法。这些设计想法可以帮助事实核查员更有效地搜索、处理和撰写信息，以及帮助他们预先准备对未来的假信息进行核查；监测他们的潜在偏见；以及支持内部组织合作。我们的工作对人类中心的事实核查研究和实践以及AI合作研究具有启示性。

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

paper_url: http://arxiv.org/abs/2308.07201
repo_url: https://github.com/chanchimin/chateval
paper_authors: Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu
for: 这个论文的目的是提出一种多智能体评估方法，以取代人工评估，并使用多种语言模型合作来提高评估效果。
methods: 这个论文使用了多种语言模型，包括Transformer和Recurrent Neural Network，并通过多智能体评估方法来提高评估效果。
results: 这个论文的实验结果表明，使用多智能体评估方法可以提高评估效果，并且可以与人工评估相比。

Abstract
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

摘要
文本评估历史上总是存在巨大的困难和成本高昂的问题。随着大语言模型（LLM）的出现，研究人员开始探索LLM的可能性作为人类评估的替代方案。although these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality.因此，我们认为最佳的人类评估过程经常含有多个人类标注者在评估中合作，我们采用多代理人讨论框架，超越单代理人提示策略。这种多代理人基本思路可以让一群LLM协同作用，利用它们的不同能力和专业知识来提高处理复杂任务的效率和效果。在这篇论文中，我们构建了一个多代理人评审组called ChatEval，以自动讨论和评估不同模型生成的响应质量。我们的分析表明，ChatEval不仅可以对文本进行评分，还可以提供人类模仿的评估过程，为可靠的评估提供可靠的评估。我们的代码可以在https://github.com/chanchimin/ChatEval中获取。

Incorporating Annotator Uncertainty into Representations of Discourse Relations

paper_url: http://arxiv.org/abs/2308.07179
repo_url: None
paper_authors: S. Magalí López Cortez, Cassandra L. Jacobs
for: 本研究探讨了不熟悉分类者对对话数据的谏话关系标注中的uncertainty。
methods: 研究使用对话上的单词、对话内speaker之间的对话、对话之间的对话 контекст来预测分类者的信任度。基于这些统计学特征，计算出谏话关系的分布式表示，并使用 hierarchical clustering 分析。
results: 研究发现，将分类者对谏话关系标注的uncertainty incorporated into distributed representations of discourse relations，可以准确地模型分类者的信任度。

Abstract
Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels.

摘要
描述关系标注是一项知名的困难任务，尤其是 для非专家标注员。本文研究非专家标注员对对话语音数据中描述关系的标注不确定性。我们发现对话上下文（单个转折、对话中 speaker 的对话对）是标注不确定性的重要预测因素。我们从共occurrence统计中计算出描述关系的分布式表示，并使用这些表示进行层次划分分析。我们发现，将描述关系表示与信度和对话上下文信息一起权重计算可以准确地模型我们的标注不确定性。

Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

paper_url: http://arxiv.org/abs/2308.07120
repo_url: None
paper_authors: Alexandra Sasha Luccioni, Anna Rogers
for: 本论文提出了大语言模型（LLMs）的定义，探讨了其功能和潜在应用，以及现有证据和反证据。
methods: 本论文使用了定义和探讨现有证据和反证据来探讨 LLMS 的功能和潜在应用。
results: 本论文提出了一个定义 LLMS，并探讨了现有证据和反证据，以及未来研究的可能性和方向。

Abstract
Much of the recent discourse within the NLP research community has been centered around Large Language Models (LLMs), their functionality and potential -- yet not only do we not have a working definition of LLMs, but much of this discourse relies on claims and assumptions that are worth re-examining. This position paper contributes a definition of LLMs, explicates some of the assumptions made regarding their functionality, and outlines the existing evidence for and against them. We conclude with suggestions for research directions and their framing in future work.

摘要
很多最近的NLP研究社区的讨论都集中在大语言模型（LLMs）上，其功能和潜力，然而我们没有一个正式的定义，而且大多数这些讨论都基于可能需要重新评估的假设和宣称。这篇Position paper提供了LLMs的定义，探讨了关于它们的功能假设，并对现有证据进行了总结。我们结束于未来研究方向的建议和Future工作的框架。

Large Language Models for Information Retrieval: A Survey

paper_url: http://arxiv.org/abs/2308.07107
repo_url: https://github.com/ruc-nlpir/llm4ir-survey
paper_authors: Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, Ji-Rong Wen
For: This paper is focused on the integration of large language models (LLMs) with information retrieval (IR) systems to improve the accuracy and efficiency of IR systems.* Methods: The paper discusses various methods for combining LLMs with IR systems, including query rewriters, retrievers, rerankers, and readers. These methods aim to leverage the language understanding and generation capabilities of LLMs to improve the performance of IR systems.* Results: The paper provides a comprehensive overview of the current state of research in this field and highlights the promising directions for future research. It also discusses the challenges and limitations of integrating LLMs with IR systems, such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses.

Abstract
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions within this expanding field.

摘要
primary means of information acquisition，information retrieval（IR）系统，如搜索引擎，已经成为我们日常生活的重要组成部分。这些系统还作为对话，问答和推荐系统的组件。IR的发展轨迹从起源的单词方法发展到与高级神经网络模型结合。尽管神经网络模型能够捕捉复杂的上下文信号和语义差异，但仍面临数据缺乏，可读性和生成上下文可能准确但又不正确的响应的挑战。这种发展需要结合传统方法（如简单的term-based sparse retrieval方法）和现代神经网络 architecture（如语言模型）。 Meanwhile，大型语言模型（LLMs），如ChatGPT和GPT-4，对自然语言处理产生了革命，因为它们具有出色的语言理解、生成、总结和逻辑能力。因此，最近的研究尝试利用LLMs提高IR系统。给出 rapidevolving的研究轨迹，我们需要结合现有的方法和提供细化的洞察。在这个survey中，我们探讨了LLMs和IR系统的结合，包括关键的query rewriter，retriever，reranker和reader。此外，我们还探讨了这个扩展的领域中的可能的方向。

Temporal Sentence Grounding in Streaming Videos

paper_url: http://arxiv.org/abs/2308.07102
repo_url: https://github.com/sczwangxiao/tsgvs-mm2023
paper_authors: Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, Liqiang Nie
for: 本研究目标是解决一个新的任务：流动视频中的时间句子根据（TSGSV）。
methods: 我们提出了两种新方法：一种是一种双网络结构，帮助模型学习到来自未来帧的事件信息；另一种是一种语言引导的视觉压缩器，从涉及到查询的视觉帧中消除无关的帧并强化相关的帧。
results: 我们在ActivityNet Captions、TACoS和MAD datasets上进行了广泛的实验，结果表明我们提出的方法具有优势。一种系统的减少研究也证明了它们的有效性。

Abstract
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

摘要

A TwinNet structure that enables the model to learn about upcoming events.2. A language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query.We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

Aesthetics of Sanskrit Poetry from the Perspective of Computational Linguistics: A Case Study Analysis on Siksastaka

paper_url: http://arxiv.org/abs/2308.07081
repo_url: https://github.com/sanskritshala/shikshastakam
paper_authors: Jivnesh Sandhan, Amruta Barbadikar, Malay Maity, Pavankumar Satuluri, Tushar Sandhan, Ravi M. Gupta, Pawan Goyal, Laxmidhar Behera
for: 这篇论文旨在探讨 sanskrit 诗歌的计算语言学分析和分类方法，具体来说是通过人工智能和专家的协作来挖掘古代 sanskrit 诗歌中的隐藏美丽之处。
methods: 该论文提出了一个可解Framework，该框架包括机器学习和人工智能的混合使用，以及一个人类在循环中的协作方式，用于分析和分类诗歌的质量和特征。
results: 通过对 sanskrit 诗歌 “Siksastaka” 的分析和注释，该论文提供了一个深入的分析和评价，并提供了一个在线应用程序，以便未来的研究人员可以继续进行相关研究。

Abstract
Sanskrit poetry has played a significant role in shaping the literary and cultural landscape of the Indian subcontinent for centuries. However, not much attention has been devoted to uncovering the hidden beauty of Sanskrit poetry in computational linguistics. This article explores the intersection of Sanskrit poetry and computational linguistics by proposing a roadmap of an interpretable framework to analyze and classify the qualities and characteristics of fine Sanskrit poetry. We discuss the rich tradition of Sanskrit poetry and the significance of computational linguistics in automatically identifying the characteristics of fine poetry. The proposed framework involves a human-in-the-loop approach that combines deterministic aspects delegated to machines and deep semantics left to human experts. We provide a deep analysis of Siksastaka, a Sanskrit poem, from the perspective of 6 prominent kavyashastra schools, to illustrate the proposed framework. Additionally, we provide compound, dependency, anvaya (prose order linearised form), meter, rasa (mood), alankar (figure of speech), and riti (writing style) annotations for Siksastaka and a web application to illustrate the poem's analysis and annotations. Our key contributions include the proposed framework, the analysis of Siksastaka, the annotations and the web application for future research. Link for interactive analysis: https://sanskritshala.github.io/shikshastakam/

摘要
sanskrit 诗歌在印度次大陆的文学和文化领域中扮演了重要的角色，但在计算语言学方面却没有受到充分的关注。本文探讨了 sanskrit 诗歌和计算语言学之间的交叉点，并提出了一个可解释的框架，用于分析和分类精美的 sanskrit 诗歌的特点和特色。我们讨论了 sanskrit 诗歌的丰富传统和计算语言学在自动识别精美诗歌的重要性。我们提出的框架采用了人类在 loop 的方法，将 deterministic 的方面委托给机器，而 deep semantics 则委托给人类专家。我们对 sanskrit 诗歌 "Siksastaka" 进行了6种 prominent kavyashastra 学派的深入分析，以 illustrate 我们的框架。此外，我们还提供了 Siksastaka 的 compound、dependency、anvaya（排序 linearized form）、米eter、rasa（情感）、alankar（ figura of speech）和 riti（书写风格）注释，以及一个网站，以便未来的研究。关于交互分析的链接：https://sanskritshala.github.io/shikshastakam/Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.

Can Knowledge Graphs Simplify Text?

paper_url: http://arxiv.org/abs/2308.06975
repo_url: https://github.com/subhasmalik/Microsoft-azure-cognitive-services
paper_authors: Anthony Colas, Haodi Ma, Xuanli He, Yang Bai, Daisy Zhe Wang
for: 这篇论文是关于无监督文本简化的研究，旨在使用知识图(KG)技术来生成简洁的文本，保持原始文本的意义。
methods: 该论文提出了一种名为KGSimple的新方法，它通过迭代和采样KG-first的方式，利用知识图生成的技术来生成简洁的文本，并保持原始文本的意义。
results: 该论文在使用现有的KG-to-text dataset进行评估，并示出了KGSimple模型的效果比起无监督文本简化模型更好。 Code available on GitHub.

Abstract
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in generating fluent and informative sentences which describe a given KG. As KGs are widespread across multiple domains and contain important entity-relation information, and as text simplification aims to reduce the complexity of a text while preserving the meaning of the original text, we propose KGSimple, a novel approach to unsupervised text simplification which infuses KG-established techniques in order to construct a simplified KG path and generate a concise text which preserves the original input's meaning. Through an iterative and sampling KG-first approach, our model is capable of simplifying text when starting from a KG by learning to keep important information while harnessing KG-to-text generation to output fluent and descriptive sentences. We evaluate various settings of the KGSimple model on currently-available KG-to-text datasets, demonstrating its effectiveness compared to unsupervised text simplification models which start with a given complex text. Our code is available on GitHub.

摘要
知识图（KG）-to-文本生成技术在最近得到了改进，能够生成流畅、有信息的句子，描述给定的KG。由于KG广泛存在多个领域，含有重要的实体关系信息，而文本简化的目标是将文本简化到最小化复杂度，保持原始文本的意思，因此我们提议KGSimple，一种新的无监督文本简化方法，利用KG确立的技术来构建简化KG路径，生成简洁的文本，保持原始输入的意思。我们采用迭代和采样KG-first方法，使我们的模型能够从KG开始简化文本，学习保留重要信息，同时利用KG-to-文本生成技术输出流畅、描述性的句子。我们在当前可用的KG-to-文本 datasets上评估了不同的KGSimple模型设置，并证明其比无监督文本简化模型，从给定的复杂文本开始简化文本更有效。我们的代码可以在GitHub上找到。

EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce

paper_url: http://arxiv.org/abs/2308.06966
repo_url: None
paper_authors: Yangning Li, Shirong Ma, Xiaobin Wang, Shen Huang, Chengyue Jiang, Hai-Tao Zheng, Pengjun Xie, Fei Huang, Yong Jiang
For: The paper aims to address the challenge of using general language models for e-commerce tasks, and proposes a new dataset and a tailored model (EcomGPT) to improve the model’s performance on these tasks.* Methods: The paper introduces a new dataset called EcomInstruct, which consists of 2.5 million instruction data and is designed to scale up the data size and task diversity for e-commerce tasks. The model EcomGPT is trained on this dataset using the backbone model BLOOMZ, and the authors use a chain-of-task approach to improve the model’s generalization capabilities.* Results: The paper reports that EcomGPT outperforms ChatGPT in terms of cross-dataset/task generalization on e-commerce tasks, as demonstrated through extensive experiments and human evaluations. The authors also show that EcomGPT acquires fundamental semantic understanding capabilities through the chain-of-task approach, which improves its performance on these tasks.Here are the three points in Simplified Chinese text:* For: 本研究旨在解决通用语言模型在电商任务上的挑战，并提出了一个新的数据集和一种适应型（EcomGPT），以提高这些任务的表现。* Methods: 本研究引入了一个新的数据集 called EcomInstruct，该数据集包含250万个指令数据，旨在扩大电商任务的数据大小和任务多样性。模型EcomGPT通过使用BLOOMZ的后托模型，在EcomInstruct上进行训练，并使用链式任务方法来提高模型的总体化能力。* Results: 本研究表明，EcomGPT比ChatGPT在电商任务上的跨数据集/任务总体化性能更高，经过广泛的实验和人工评估。 authors也表明，EcomGPT通过链式任务方法获得了基本的Semantic理解能力，从而提高了它在这些任务上的表现。

Abstract
Recently, instruction-following Large Language Models (LLMs) , represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.

摘要
Translation Notes:* "Recently" is translated as "最近" (most recent)* "Large Language Models" is translated as "大型自然语言模型" (large natural language models)* "E-commerce" is translated as "电商" (e-commerce)* "instruction" is translated as "指南" (instructions)* "dataset" is translated as "数据集" (dataset)* "task" is translated as "任务" (task)* "ChatGPT" is translated as "ChatGPT" (ChatGPT)* "EcomInstruct" is translated as "EcomInstruct" (E-commerce instruction dataset)* "atomic tasks" is translated as "原子任务" (atomic tasks)* "Chain-of-Task tasks" is translated as "链接任务" (chain-of-task tasks)* "backbone model" is translated as "基本模型" (backbone model)* "BLOOMZ" is translated as "BLOOMZ" (BLOOMZ)* "zero-shot generalization" is translated as "零批学习泛化" (zero-shot generalization)* "extensive experiments" is translated as "广泛的实验" (extensive experiments)* "human evaluations" is translated as "人类评估" (human evaluations)

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

paper_url: http://arxiv.org/abs/2308.06953
repo_url: None
paper_authors: David Heineman, Yao Dou, Wei Xu
for: 本研究是为了提供一个可靠、可重复的方式来评估文本生成任务，如摘要、简化、机器翻译和新闻生成。
methods: 本研究使用了一个名为Thresh的 plataform，该平台可以快速创建和测试任务特定的评估界面，并且可以在一个网页浏览器窗口中完成所有步骤。
results: 本研究通过Thresh平台可以快速创建和测试多种NLP任务的评估界面，并且可以提供多种批处理和大规模评估的选项。

Abstract
Fine-grained, span-level human evaluation has emerged as a reliable and robust method for evaluating text generation tasks such as summarization, simplification, machine translation and news generation, and the derived annotations have been useful for training automatic metrics and improving language models. However, existing annotation tools implemented for these evaluation frameworks lack the adaptability to be extended to different domains or languages, or modify annotation settings according to user needs. And the absence of a unified annotated data format inhibits the research in multi-task learning. In this paper, we introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation. By simply creating a YAML configuration file, users can build and test an annotation interface for any framework within minutes -- all in one web browser window. To facilitate collaboration and sharing, Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community, covering a wide range of NLP tasks. For deployment, Thresh offers multiple options for any scale of annotation projects from small manual inspections to large crowdsourcing ones. Additionally, we introduce a Python library to streamline the entire process from typology design and deployment to annotation processing. Thresh is publicly accessible at https://thresh.tools.

摘要
最细化的、span级人工评估已成为文本生成任务 such as 概要、简化、翻译和新闻生成的可靠和可靠的评估方法。但现有的评估工具不具备扩展到不同领域或语言的能力，也无法根据用户需求修改评估设置。此外，缺乏一个统一的注释数据格式，限制了多任务学习的研究。本文提出了 Thresh，一个统一、可定制和可部署的精细评估平台。通过创建一个 YAML 配置文件，用户可以在浏览器窗口内快速建立和测试任何框架的注释界面，并且可以在社区平台上共享和协作。为投入大规模注释项目，Thresh 提供多种部署选项。此外，我们还提供了一个 Python 库，以便从类型设计、部署到注释处理的整个过程。Thresh 公共访问地址为。

Automated Testing and Improvement of Named Entity Recognition Systems

paper_url: http://arxiv.org/abs/2308.07937
repo_url: None
paper_authors: Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, Pinjia He
for: 提高Named Entity Recognition（NER）系统的可靠性和精度，使其在不同的自然语言处理应用中更可靠。
methods: 提出了一种新的、广泛适用的方法，可以自动测试和修复不同的NER系统。
results: 通过测试两个state-of-the-art（SOTA）NER模型和两个商业NER API，发现自动测试和修复可以高效地提高NER系统的精度和可靠性。

Abstract
Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.

摘要
TIN的关键想法是在相似的上下文中，NER预测的相同名称应该是相同的。TIN的核心想法是在相似的上下文中，相似的名称应该有相同的NER预测。我们使用 TIN 测试了两个 SOTA NER 模型和两个商业 NER API，即 Azure NER 和 AWS NER。我们手动验证了 TIN 发现的 784 个可疑问题中的 702 个是错误的问题，得到了高精度（85.0%-93.4%）在四个NER错误类型中。对于自动修复，TIN 在四个系统上得到了高错误率（26.8%-50.6%），成功修复了 1,056 个reported NER 错误。

CausalLM is not optimal for in-context learning

paper_url: http://arxiv.org/abs/2308.06912
repo_url: None
paper_authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
for: 本研究旨在理解在 context 学习中使用 prefix 语言模型 (prefixLM) 和 causal 语言模型 (causalLM) 的性能差异。
methods: 本研究采用了理论分析方法，对 prefixLM 和 causalLM 的参数构造进行分析，并通过synthetic 和实际任务的实验 verify 其理论结论。
results: 研究结果显示， prefixLM 在 linear 回归问题中 converges to 优质解，而 causalLM 的 convergence 动态类似于在线梯度下降算法，不能 garantate 优质性，即使样本数量无限大。 Empirical 实验表明， causalLM 在所有设置下consistently underperforms prefixLM。

Abstract
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

摘要
近期实验证据表明，基于转换器的内容学习perform better使用预refix语言模型（prefixLM），因为所有的内容样本都可以互相attend，而不是 causalLM，它们使用自动征回应，禁止内容样本attend到未来的样本。 Although this result is intuitive, it is not understood from a theoretical perspective. 在这篇论文中，我们从理论角度分析了prefixLM和causalLM的整合行为。我们的分析显示，两种LM类型都会在某些参数构造下 converge to their stationary points at a linear rate，但是prefixLM会 converges to the optimal solution of linear regression，而causalLM的整合动态则类似于在线梯度下降算法，这并不是确保优化的，即使样本数量 infinitely grows。我们在实验中补充了我们的理论声明，并通过使用不同的转换器和实验任务进行了empirical experiments。我们的实验结果表明，causalLM在所有设置中一直下perform prefixLM。

paper_url: http://arxiv.org/abs/2308.06911
repo_url: None
paper_authors: Pengfei Liu, Yiming Ren, Zhixiang Ren
for: 这篇论文旨在开发一种多Modal语言模型，以捕捉分子数据的充分和复杂信息。
methods: 该论文使用了一种新的GIT-Former架构，可以将所有modalities映射到一个统一的 latent space中。
results: 该论文实现了一种创新的任意语言分子翻译策略，在分子captioning中提高了10%-15%的精度，在分子性预测中提高了5%-10%的准确率，并在分子生成中提高了20%的有效性。

Abstract
Large language models have made significant strides in natural language processing, paving the way for innovative applications including molecular representation and generation. However, most existing single-modality approaches cannot capture the abundant and complex information in molecular data. Here, we introduce GIT-Mol, a multi-modal large language model that integrates the structure Graph, Image, and Text information, including the Simplified Molecular Input Line Entry System (SMILES) and molecular captions. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture capable of mapping all modalities into a unified latent space. Our study develops an innovative any-to-language molecular translation strategy and achieves a 10%-15% improvement in molecular captioning, a 5%-10% accuracy increase in property prediction, and a 20% boost in molecule generation validity compared to baseline or single-modality models.

摘要
大型自然语言处理模型已经取得了重要进展，开创了新的应用领域，包括分子表示和生成。然而，现有的单模态方法通常无法捕捉分子数据中的丰富和复杂信息。我们介绍了 GIT-Mol，一个多模态大语言模型，该模型结合结构图、图像和文本信息，包括简化分子输入语言系统（SMILES）和分子描述。为了实现多modal分子数据的集成，我们提出了 GIT-Former 架构，可以将所有模式映射到一个统一的隐藏空间。我们的研究开发了一种创新的任意语言分子翻译策略，在分子描述中提高了10%-15%，在物理预测中提高了5%-10%，在分子生成VALIDIDAD中提高了20% compared to 基eline或单模态模型。

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

paper_url: http://arxiv.org/abs/2308.06873
repo_url: None
paper_authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
for: 这篇论文旨在探讨高质量零参数文本识别模型，以及如何使其能够处理多种声音转换任务，包括噪音抑制、目标说话人提取、声音编辑等。
methods: 这篇论文提出了一种名为SpeechX的多任务学习模型，通过将语音编码语言模型和多任务学习相结合，实现了对声音转换任务的统一和可扩展的模型化。
results: 实验结果表明，SpeechX模型在不同任务中表现出色，包括零参数文本识别、噪音抑制、目标说话人提取、声音编辑等，与专门设计的模型相比，其性能均或超过专门的模型。

Abstract
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

摘要
现代化的生成式speech模型，基于音频-文本提示，已经实现了高质量的零shot文本到语音。然而，现有的模型仍然面临着许多不同的音频-文本语音生成任务，包括音频 capture在不利的噪声条件下进行处理。这篇论文介绍了SpeechX，一种多功能的语音生成模型，可以实现零shot TTS和多种语音转换任务，并处理干净和噪声信号。SpeechX结合了神经编码语言模型和多任务学习，使用任务dependent的提示，实现了一个统一的和可扩展的模型，并提供了一种通用的文本输入方式来进行语音增强和转换任务。实验结果表明SpeechX在不同任务中具有优秀的性能，包括零shot TTS、噪声抑制、目标 speaker 提取、语音除去和语音编辑等，与专门的模型在任务中表现相当或更高。请参考 https://aka.ms/speechx 获取示例样本。

2023-08-14

Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Incorporating Annotator Uncertainty into Representations of Discourse Relations

Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

Large Language Models for Information Retrieval: A Survey

Temporal Sentence Grounding in Streaming Videos

Aesthetics of Sanskrit Poetry from the Perspective of Computational Linguistics: A Case Study Analysis on Siksastaka

Can Knowledge Graphs Simplify Text?

EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Automated Testing and Improvement of Named Entity Recognition Systems

CausalLM is not optimal for in-context learning

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer