2023-10-09

cs.CL

cs.CL - 2023-10-09

GPT-who: An Information Density-based Machine-Generated Text Detector

paper_url: http://arxiv.org/abs/2310.06202
repo_url: None
paper_authors: Saranya Venkatraman, Adaku Uchendu, Dongwon Lee
for: 本研究旨在检验语言模型和人类语言之间的差异，并提出一种基于统一信息密度原理的多类域不偏推权分类器GPT-who。
methods: 该分类器使用统一信息密度基本特征来模型每个语言模型和人类作者的独特统计特征，以便准确地归类作者。
results: 对4个大规模测试集进行评估，GPT-who比前一代统计基于分类器和非统计基于分类器的检测器（如GLTR、GPTZero、OpenAI检测器和ZeroGPT）提高了超过20%的表现。此外，GPT-who还具有较低的计算成本和可读性的优势。

Abstract
The Uniform Information Density principle posits that humans prefer to spread information evenly during language production. In this work, we examine if the UID principle can help capture differences between Large Language Models (LLMs) and human-generated text. We propose GPT-who, the first psycholinguistically-aware multi-class domain-agnostic statistical-based detector. This detector employs UID-based features to model the unique statistical signature of each LLM and human author for accurate authorship attribution. We evaluate our method using 4 large-scale benchmark datasets and find that GPT-who outperforms state-of-the-art detectors (both statistical- & non-statistical-based) such as GLTR, GPTZero, OpenAI detector, and ZeroGPT by over $20$% across domains. In addition to superior performance, it is computationally inexpensive and utilizes an interpretable representation of text articles. We present the largest analysis of the UID-based representations of human and machine-generated texts (over 400k articles) to demonstrate how authors distribute information differently, and in ways that enable their detection using an off-the-shelf LM without any fine-tuning. We find that GPT-who can distinguish texts generated by very sophisticated LLMs, even when the overlying text is indiscernible.

摘要
人类偏好将信息均匀分布在语言生成中，这种现象被称为Uniform Information Density原理（UID）。在这项工作中，我们研究了UID原理是否可以捕捉不同的大语言模型（LLM）和人类生成的文本之间的差异。我们提出了GPT-who，首个心理语言学感知的多类域共通统计基础探测器。这个探测器使用UID基本特征来模型每个LLM和人类作者的唯一统计特征，以便准确地归属作者。我们使用4个大规模数据集进行评估，发现GPT-who在各个领域都高于当前最佳探测器（包括统计和非统计基础的探测器），例如GLTR、GPTZero、OpenAI探测器和ZeroGPT，以上差20%以上。此外，GPT-who还具有低计算成本和可解释的文本表示，并进行了400k篇文章的最大分析，以示出作者在分布信息的方式，以及如何使用存储库LM进行探测。我们发现GPT-who可以分辨由非常复杂的LLM生成的文本，即使文本总体看起来一样。

Compressing Context to Enhance Inference Efficiency of Large Language Models

paper_url: http://arxiv.org/abs/2310.06201
repo_url: https://github.com/liyucheng09/selective_context
paper_authors: Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin
for: 提高大语言模型（LLM）的推理效率，解决长文档和长 conversations 的计算要求增加和内存占用问题。
methods: 提出了一种名为选择性上下文的方法，通过识别并剔除输入上下文中的重复部分，使输入更加紧凑。
results: 实验结果表明，选择性上下文方法可以significantly 降低内存成本和生成时间，保持与全Context相对的性能，具体是：Context cost 减少50%，内存使用量减少36%，生成时间减少32%，而BERTscore和 faithfulness 只减少0.023和0.038。

Abstract
Large language models (LLMs) achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50\% reduction in context cost, resulting in a 36\% reduction in inference memory usage and a 32\% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance.

摘要

The Importance of Prompt Tuning for Automated Neuron Explanations

paper_url: http://arxiv.org/abs/2310.06200
repo_url: None
paper_authors: Justin Lee, Tuomas Oikarinen, Arjun Chatha, Keng-Chi Chang, Yilan Chen, Tsui-Wei Weng
for: 了解大语言模型（LLMs）的各个神经元的作用，以便更好地理解模型和其安全性。
methods: 基于之前的研究，使用大语言模型such as GPT-4来解释每个神经元的作用。 Specifically, 分析使用的提示来生成解释的效果，并改进提示的格式以提高解释质量和减少计算成本。
results: 通过三种不同的方法，包括自动和人工评估，证明了我们的新提示可以大幅提高神经元解释质量，同时减少计算成本。

Abstract
Recent advances have greatly increased the capabilities of large language models (LLMs), but our understanding of the models and their safety has not progressed as fast. In this paper we aim to understand LLMs deeper by studying their individual neurons. We build upon previous work showing large language models such as GPT-4 can be useful in explaining what each neuron in a language model does. Specifically, we analyze the effect of the prompt used to generate explanations and show that reformatting the explanation prompt in a more natural way can significantly improve neuron explanation quality and greatly reduce computational cost. We demonstrate the effects of our new prompts in three different ways, incorporating both automated and human evaluations.

摘要
最近的进步使大语言模型（LLM）的能力得到了大幅提高，但我们对这些模型和其安全性的理解尚未随着速度进步。在这篇论文中，我们尝试更深入地理解LLM，通过研究它们的个体神经元。我们建立在之前的工作之上，证明大语言模型如GPT-4可以用于解释每个语言模型神经元的作用。我们分析推荐的提示对神经元解释质量产生的影响，并显示通过更自然的提示格式化可以显著提高神经元解释质量，同时大幅降低计算成本。我们通过三种不同的方法示出了我们新的提示的效果，包括自动和人工评估。

BYOC: Personalized Few-Shot Classification with Co-Authored Class Descriptions

paper_url: http://arxiv.org/abs/2310.06111
repo_url: None
paper_authors: Arth Bohra, Govert Verkes, Artem Harutyunyan, Pascal Weinberger, Giovanni Campagna
for: 该论文旨在提供一种可以让用户自己建立文本分类器的新方法，以便用户可以根据自己的需求建立个性化的分类器。
methods: 该方法使用大语言模型（LLM），并通过用户和LLM之间的互动来帮助用户描述每个类型的核心特征。用户通过 annotating 每个少量示例来提供描述，并且 LLM 会提问有关每个示例的问题，以便帮助用户更好地描述每个类型。
results: 实验表明，该方法可以达到高精度（大约 82%），只使用了比较少的数据集训练。此外，在30名参与者的研究中，个性化的分类器的平均精度达到 90%，比州态艺前的方法高出 15%。

Abstract
Text classification is a well-studied and versatile building block for many NLP applications. Yet, existing approaches require either large annotated corpora to train a model with or, when using large language models as a base, require carefully crafting the prompt as well as using a long context that can fit many examples. As a result, it is not possible for end-users to build classifiers for themselves. To address this issue, we propose a novel approach to few-shot text classification using an LLM. Rather than few-shot examples, the LLM is prompted with descriptions of the salient features of each class. These descriptions are coauthored by the user and the LLM interactively: while the user annotates each few-shot example, the LLM asks relevant questions that the user answers. Examples, questions, and answers are summarized to form the classification prompt. Our experiments show that our approach yields high accuracy classifiers, within 82% of the performance of models trained with significantly larger datasets while using only 1% of their training sets. Additionally, in a study with 30 participants, we show that end-users are able to build classifiers to suit their specific needs. The personalized classifiers show an average accuracy of 90%, which is 15% higher than the state-of-the-art approach.

摘要
文本分类是一个已经广泛研究并且具有多种应用的基础模块。然而，现有的方法需要大量的标注数据来训练一个模型，或者使用大型语言模型为基础，并且需要考虑制定的提示和长Context。这使得普通用户无法建立自己的分类器。为解决这个问题，我们提出了一种新的几个示例文本分类方法使用LLM。而不是几个示例，LLM被提示了每个类型的突出特征的描述。这些描述由用户和LLM共同编写：用户在每个几个示例中注解，LLM则问到有关的问题，用户回答。示例、问题和答案被总结为分类提示。我们的实验表明，我们的方法可以在82%的性能下建立高精度分类器，使用的训练集只有1%。此外，我们在30名参与者的研究中发现，普通用户可以建立自己需求的个性化分类器，这些分类器的平均精度为90%，高于现有方法15%。

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

paper_url: http://arxiv.org/abs/2310.06103
repo_url: https://github.com/digitalphonetics/multilingual-seq2seq-slu
paper_authors: Pavel Denisov, Ngoc Thang Vu
for: 这个论文旨在提出一种能够执行端到端语言理解（E2E-SLU）的多语言设置和任务，包括预测词语填充。
methods: 该方法使用预训练的语音和文本模型，并将其集成到一个生成型模型中，以实现E2E-SLU任务。
results: 经过预训练7000小时的多语言数据后，该模型可以超越当前状态的两个SLU数据集，并在另外两个SLU数据集上达到一定的改进。此外，该模型还可以在不同语言之间进行跨语言比较，并在PortMEDIA-Language数据集上提高最佳结果，减少了23.65%的概念/价值错误率。

Abstract
A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.

摘要
许多方法已经被提出用于终端到终端语言理解（E2E-SLU），但是它们的评估通常缺乏多语言设置和需要预测词语填充的任务。在这项工作中，我们提出了一种统一方法，将多语言预训练的语音和文本模型集成，并在六个数据集上进行E2E-SLU，包括词语填充预测。我们研究了如何通过多语言批处理训练对这种方法进行改进，并在7000小时多语言数据上进行预训练。这些预训练可以使我们超越当前状态的术语SLU数据集上的最佳性能，并在两个SLU数据集上部分超越状态。最后，我们检查了提posed模型的交叉语言能力，并在PortMEDIA-Language数据集上提高了最佳知识的结果，将概念/价值错误率降低至23.65%。

Auditing Gender Analyzers on Text Data

paper_url: http://arxiv.org/abs/2310.06061
repo_url: None
paper_authors: Siddharth D Jaiswal, Ankit Kumar Verma, Animesh Mukherjee
for: This study aims to audit existing gender analyzers for biases against non-binary individuals.
methods: The study uses two datasets (Reddit comments and Tumblr posts) and fine-tunes a BERT multi-label classifier on these datasets to evaluate the accuracy of the gender analyzers.
results: The study finds that the existing gender analyzers are highly inaccurate, with an overall accuracy of ~50% on all platforms. The fine-tuned BERT model achieves an overall performance of ~77% on the most realistically deployable setting and a surprisingly higher performance of 90% for the non-binary class. Additionally, the study shows that ChatGPT, a highly advanced AI model, is also biased and needs better audits and moderation.

Abstract
AI models have become extremely popular and accessible to the general public. However, they are continuously under the scanner due to their demonstrable biases toward various sections of the society like people of color and non-binary people. In this study, we audit three existing gender analyzers -- uClassify, Readable and HackerFactor, for biases against non-binary individuals. These tools are designed to predict only the cisgender binary labels, which leads to discrimination against non-binary members of the society. We curate two datasets -- Reddit comments (660k) and, Tumblr posts (2.05M) and our experimental evaluation shows that the tools are highly inaccurate with the overall accuracy being ~50% on all platforms. Predictions for non-binary comments on all platforms are mostly female, thus propagating the societal bias that non-binary individuals are effeminate. To address this, we fine-tune a BERT multi-label classifier on the two datasets in multiple combinations, observe an overall performance of ~77% on the most realistically deployable setting and a surprisingly higher performance of 90% for the non-binary class. We also audit ChatGPT using zero-shot prompts on a small dataset (due to high pricing) and observe an average accuracy of 58% for Reddit and Tumblr combined (with overall better results for Reddit). Thus, we show that existing systems, including highly advanced ones like ChatGPT are biased, and need better audits and moderation and, that such societal biases can be addressed and alleviated through simple off-the-shelf models like BERT trained on more gender inclusive datasets.

摘要

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

paper_url: http://arxiv.org/abs/2310.05919
repo_url: None
paper_authors: Chung-Ming Chien, Mingjiamei Zhang, Ju-Chieh Chou, Karen Livescu
for: 提高 spoken language understanding 任务中的数据有限性问题
methods: 使用共享表示空间的 speech-text 模型，并将文本模型转移到语音测试数据上
results: 与使用 speech-only 预训练模型 fine-tuned 10 倍更多数据相比，我们的提议方法可以在 sentiment analysis 和 named entity recognition 等任务中达到相似的性能，只需要 1 小时的标注语音数据Here’s the full text in Traditional Chinese:
for: 这paper是为了解决 spoken language understanding 任务中的数据有限性问题
methods: 我们使用共享表示空间的 speech-text 模型，并将文本模型转移到语音测试数据上
results: 与使用 speech-only 预训练模型 fine-tuned 10 倍更多数据相比，我们的提议方法可以在 sentiment analysis 和 named entity recognition 等任务中达到相似的性能，只需要 1 小时的标注语音数据

Abstract
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data. With as little as 1 hour of labeled speech data, our proposed approach achieves comparable performance on spoken language understanding tasks (specifically, sentiment analysis and named entity recognition) when compared to previous methods using speech-only pre-trained models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we also analyze the latent representations. We find that the bottom layers of speech-text models are largely task-agnostic and align speech and text representations into a shared space, while the top layers are more task-specific.

摘要
近期关于语音表示模型同时预训练文本的工作表明了改进语音表示的潜在可能性。在这篇论文中，我们利用这些共享表示来解决语音理解任务中的数据有限问题。我们使用预训练的语音文本模型，发现可以将文本预训练模型转移到语音测试数据上，只需要1小时的标注语音数据。与之前使用语音只预训练模型 fine-tune 10倍更多数据的方法相比，我们的提议方法可以在语音理解任务（具体来说是情感分析和命名实体识别）中达到相同的性能水平。此外，我们还分析了隐藏表示。我们发现语音文本模型的下层主要是无关任务的，可以将语音和文本表示同化到共享空间，而顶层则更加具体地关注特定任务。

NEFTune: Noisy Embeddings Improve Instruction Finetuning

paper_url: http://arxiv.org/abs/2310.05914
repo_url: https://github.com/neelsjain/neftune
paper_authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
for: 提高语言模型finetuning的性能
methods: 使用随机噪声添加到嵌入向量中 during 训练
results: 1. 使用噪声 embedding 的模型在 AlpacaEval 上的分数从 29.79% 提高到 64.69%；2. 在现代 instrucion datasets 上超过强基elines，包括 Evol-Instruct、ShareGPT 和 OpenPlatypus 上的提高约10%。

Abstract
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

摘要
我们显示了语言模型调整可以得到改善，有时会很大，只需使用简单的增强技巧。NEFTune在训练过程中将嵌入向量添加随机变化。通过使用Alpaca，标准调整LLaMA-2-7B的性能为29.79%，将提高到64.69%。NEFTune也超过了现代指令集 dataset 的强基elines。使用 Evol-Instruct 的模型进行调整会增加10%的性能，使用 ShareGPT 的模型进行调整会增加8%的性能，使用 OpenPlatypus 的模型进行调整会增加8%的性能。甚至是使用 RLHF 进一步调整的强大模型，如 LLama-2-Chat，也会受益于额外的 NEFTune 训练。

Controllable Chest X-Ray Report Generation from Longitudinal Representations

paper_url: http://arxiv.org/abs/2310.05881
repo_url: None
paper_authors: Francesco Dalla Serra, Chaoyang Wang, Fani Deligianni, Jeffrey Dalton, Alison Q O’Neil
for: 这篇论文的目的是提高医疗影像报告的速度和准确性，并且提供可控的报告生成模型。
methods: 本文提出了两个新方法：首先，使用 longitudinal representation learning 方法，将先前的医疗影像作为额外输入，将现有和先前的视觉信息联合和融合为一个共同 longitudinal 表现，以便给 multimodal 报告生成模型；其次，使用 sentence-anatomy dropout 训练策略，让报告生成模型在预测报告内容时仅预测和corrsponding的句子和体位。
results: 经过对 MIMIC-CXR 资料集的严格实验，本文的方法能够实现现有最佳的结果，同时具备可控的报告生成能力。

Abstract
Radiology reports are detailed text descriptions of the content of medical scans. Each report describes the presence/absence and location of relevant clinical findings, commonly including comparison with prior exams of the same patient to describe how they evolved. Radiology reporting is a time-consuming process, and scan results are often subject to delays. One strategy to speed up reporting is to integrate automated reporting systems, however clinical deployment requires high accuracy and interpretability. Previous approaches to automated radiology reporting generally do not provide the prior study as input, precluding comparison which is required for clinical accuracy in some types of scans, and offer only unreliable methods of interpretability. Therefore, leveraging an existing visual input format of anatomical tokens, we introduce two novel aspects: (1) longitudinal representation learning -- we input the prior scan as an additional input, proposing a method to align, concatenate and fuse the current and prior visual information into a joint longitudinal representation which can be provided to the multimodal report generation model; (2) sentence-anatomy dropout -- a training strategy for controllability in which the report generator model is trained to predict only sentences from the original report which correspond to the subset of anatomical regions given as input. We show through in-depth experiments on the MIMIC-CXR dataset how the proposed approach achieves state-of-the-art results while enabling anatomy-wise controllable report generation.

摘要
医学成像报告是详细的文本描述医学扫描结果。每份报告都描述了病人的相关临床发现，并常常包括与当前扫描相比较以描述它们是如何发展的。医学报告是一项时间消耗的过程，扫描结果经常会受到延迟。为了加速报告，可以 integrate 自动报告系统，但是临床部署需要高度准确和可解释性。先前的自动医学报告方法通常不提供先前扫描作为输入，因此无法进行相关的比较，这会导致报告不准确。此外，这些方法的可解释性也不够。因此，我们利用现有的解剖学输入格式，引入两个新的方面：1. longitudinal representation learning——我们将先前扫描作为额外输入，提议一种方法来对当前和先前的视觉信息进行对接、拼接和融合，以生成共同的长期表示，这个表示可以被传给多模态报告生成模型。2. sentence-anatomy dropout——一种训练策略，用于控制报告生成模型的可控性。在训练过程中，报告生成模型需要预测来自原始报告的具体句子，其中句子的选择取决于提供的解剖学区域输入。我们通过对 MIMIC-CXR 数据集进行深入的实验，证明我们的方法可以达到当前领导的结果，同时允许解剖学 wise 可控报告生成。

Are Large Language Models Geospatially Knowledgeable?

paper_url: http://arxiv.org/abs/2310.13002
repo_url: None
paper_authors: Prabin Bhandari, Antonios Anastasopoulos, Dieter Pfoser
for: investigate the extent of geospatial knowledge and reasoning abilities in pre-trained Large Language Models (LLMs)
methods: probe LLMs for geo-coordinates, use geospatial and non-geospatial prepositions to gauge geospatial awareness, and utilize a multidimensional scaling (MDS) experiment to assess geospatial reasoning capabilities
results: larger and more sophisticated LLMs can synthesize geospatial knowledge from textual information, but there are limitations to their geospatial abilities

Abstract
Despite the impressive performance of Large Language Models (LLM) for various natural language processing tasks, little is known about their comprehension of geographic data and related ability to facilitate informed geospatial decision-making. This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within such pretrained LLMs. With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities and to determine locations of cities based on prompting. Our results confirm that it does not only take larger, but also more sophisticated LLMs to synthesize geospatial knowledge from textual information. As such, this research contributes to understanding the potential and limitations of LLMs in dealing with geospatial information.

摘要
尽管大型自然语言模型（LLM）在不同的自然语言处理任务上表现出色，但对于地理数据的理解和有关能力却得到了少量的研究。这篇论文探究了抽象语言模型中地理知识、意识和逻辑能力的程度。我们专注于autoregressive语言模型，并设计了以下三种实验方法：1. 使用地理坐标来评估语言模型的地理知识。2. 使用地理和非地理前置词来评估语言模型的地理意识。3. 使用多维度规范（MDS）实验来评估模型的地理逻辑能力，并确定文本中提到的城市的位置。我们的结果表明，不仅要有更大的语言模型，也需要更加复杂的语言模型才能从文本信息中 sinthezize 地理知识。因此，这项研究对于理解LLM在处理地理信息的可能性和局限性做出了贡献。

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

paper_url: http://arxiv.org/abs/2310.05824
repo_url: None
paper_authors: Nikolay Bogoychev, Pinzhen Chen
for: 提高机器翻译下游应用中精度的重要性，并且一种常见的方式是在翻译系统中注入 terminate 约束。
methods: 我们在 WMT 2023 翻译任务中采用了 translate-then-refine approach，这种方法不受领域限制，并且需要最小的手动努力。我们首先使用 word alignment 获取 pseudo-terminology 翻译，然后使用这些翻译来训练一个 terminology-aware 模型。此外，我们还 explore 了两种后处理方法。
results: 我们的 experiment 表明，我们的 terminology-aware 模型能够有效地 incorporate 精度，而使用大语言模型进行 refine 过程可以进一步提高 terminate 记忆。

Abstract
Terminology correctness is important in the downstream application of machine translation, and a prevalent way to ensure this is to inject terminology constraints into a translation system. In our submission to the WMT 2023 terminology translation task, we adopt a translate-then-refine approach which can be domain-independent and requires minimal manual efforts. We annotate random source words with pseudo-terminology translations obtained from word alignment to first train a terminology-aware model. Further, we explore two post-processing methods. First, we use an alignment process to discover whether a terminology constraint has been violated, and if so, we re-decode with the violating word negatively constrained. Alternatively, we leverage a large language model to refine a hypothesis by providing it with terminology constraints. Results show that our terminology-aware model learns to incorporate terminologies effectively, and the large language model refinement process can further improve terminology recall.

摘要
<>翻译精度在机器翻译下游应用中非常重要，一种常见的方法是将翻译系统中的术语约束注入到翻译系统中。在我们对WMT 2023翻译任务提交中，我们采用了一种翻译后修改的方法，这种方法不依赖于域名和需要 minimal的手动努力。我们首先使用word alignment获取pseudo-术语翻译，然后使用这些翻译来训练一个术语意识Model。此外，我们还探索了两种后处理方法。首先，我们使用一个对应过程来检查翻译是否违反了术语约束，如果违反了，那么我们会使用违反的单词做为约束来重新解码。其次，我们利用一个大型自然语言模型来修改一个假设，并提供术语约束来进一步提高术语回忆率。结果显示，我们的术语意识Model有效地收录了术语，而大型自然语言模型的修改过程可以进一步提高术语回忆率。

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese

paper_url: http://arxiv.org/abs/2310.05818
repo_url: None
paper_authors: Liang Xu, Kangkang Zhao, Lei Zhu, Hang Xue
For: The paper aims to systematically assess the safety of Chinese large language models (LLMs) and provide a benchmark for creating safer and more trustworthy models.* Methods: The paper introduces a multi-round adversarial benchmark called SuperCLUE-Safety (SC-Safety) that includes 4912 open-ended questions covering 20 safety sub-dimensions. The benchmark involves human-model interactions and conversations to increase the challenges.* Results: The paper finds that closed-source models perform better in terms of safety compared to open-source models, and models released from China demonstrate comparable safety levels to LLMs like GPT-3.5-turbo. Smaller models with 6B-13B parameters can also compete effectively in terms of safety. The findings provide guidance on model selection and promote collaborative efforts to create safer LLMs.

Abstract
Large language models (LLMs), like ChatGPT and GPT-4, have demonstrated remarkable abilities in natural language understanding and generation. However, alongside their positive impact on our daily tasks, they can also produce harmful content that negatively affects societal perceptions. To systematically assess the safety of Chinese LLMs, we introduce SuperCLUE-Safety (SC-Safety) - a multi-round adversarial benchmark with 4912 open-ended questions covering more than 20 safety sub-dimensions. Adversarial human-model interactions and conversations significantly increase the challenges compared to existing methods. Experiments on 13 major LLMs supporting Chinese yield the following insights: 1) Closed-source models outperform open-sourced ones in terms of safety; 2) Models released from China demonstrate comparable safety levels to LLMs like GPT-3.5-turbo; 3) Some smaller models with 6B-13B parameters can compete effectively in terms of safety. By introducing SC-Safety, we aim to promote collaborative efforts to create safer and more trustworthy LLMs. The benchmark and findings provide guidance on model selection. Our benchmark can be found at https://www.CLUEbenchmarks.com

摘要
大型自然语言模型（LLM），如ChatGPT和GPT-4，已经表现出了惊人的自然语言理解和生成能力。然而，同时也可能生成有害内容，影响社会观念。为了系统地评估中文 LLM 的安全性，我们介绍了 SuperCLUE-Safety（SC-Safety）多轮对抗性测试框架，包括4912个开放式问题，覆盖超过20个安全子维度。对人机模型交互和对话的挑战性提高了现有方法的挑战性。对13种主要支持中文的 LLM 进行了实验，得到以下发现：1）关闭源代码模型在安全性方面表现更高；2）中国发布的模型与 GPT-3.5-turbo 的安全水平相当；3）一些6B-13B参数的小型模型可以有效竞争在安全性方面。通过引入 SC-Safety，我们希望促进开源 LLM 的创造和可信worthiness。我们的测试框架可以在https://www.CLUEbenchmarks.com找到。

DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models

paper_url: http://arxiv.org/abs/2310.05793
repo_url: https://github.com/Shark-NLP/DiffuSeq
paper_authors: Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
for: 提高Diffusion模型的训练速度和采样速度，以便于实际应用。
methods: 引入软吸收状态，使Diffusion模型能够在连续Diffusion空间中学习恢复分子突变，并使用现有的ODE解ulle器在连续空间中加速采样过程。
results: 实验结果表明，提出的方法可以提高训练速度4倍，并在800倍的采样速度下生成同质的样本，使其更加符合实际应用。

Abstract
Diffusion models have gained prominence in generating high-quality sequences of text. Nevertheless, current approaches predominantly represent discrete text within a continuous diffusion space, which incurs substantial computational overhead during training and results in slower sampling speeds. In this paper, we introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space, thereby enhancing its capacity to recover conditional signals. During the sampling phase, we employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Comprehensive experimental evaluations reveal that our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application. \footnote{The code is released at \url{https://github.com/Shark-NLP/DiffuSeq}

摘要
Diffusion models 已经在生成高质量文本序列方面得到广泛应用。然而，当前的方法主要将整数文本 Represented as a continuous diffusion space中的一部分，这会导致训练过程中的计算开销很大，以及采样速度较慢。在这篇论文中，我们引入了软吸收状态，使得扩散模型能够学习根据下面的 Gaussian space 中的精度 Mutations 进行重建，从而提高其对 conditional signals 的恢复能力。在采样阶段，我们使用 state-of-the-art ODE solvers 在连续空间中进行采样，以便加速采样过程。经过了广泛的实验评估，我们的提议方法可以在训练速度和样本质量两个方面提高效率， specifically 4x 快速 Train convergence 和 800x faster sample generation，使其更加接近实际应用。Note: The URL in the footnote has been translated as well: \url{https://github.com/Shark-NLP/DiffuSeq} becomes \url{https://github.com/Shark-NLP/DiffuSeq}

Problem-Solving Guide: Predicting the Algorithm Tags and Difficulty for Competitive Programming Problems

paper_url: http://arxiv.org/abs/2310.05791
repo_url: https://github.com/sronger/psg_predicting_algorithm_tags_and_difficulty
paper_authors: Juntae Kim, Eunjung Cho, Dongwoo Kim, Dongbin Na
For: This paper aims to help engineers and developers solve algorithm problems more efficiently by predicting the algorithm tag and difficulty level of a problem.* Methods: The authors propose a deep learning-based method for simultaneously predicting algorithm tags and difficulty levels of an algorithm problem given.* Results: The authors present a real-world algorithm problem multi-task dataset, AMT, which is the most large-scale dataset for predicting algorithm tags compared to previous studies. They also show that their proposed method can accurately predict algorithm tags and difficulty levels.

Abstract
The recent program development industries have required problem-solving abilities for engineers, especially application developers. However, AI-based education systems to help solve computer algorithm problems have not yet attracted attention, while most big tech companies require the ability to solve algorithm problems including Google, Meta, and Amazon. The most useful guide to solving algorithm problems might be guessing the category (tag) of the facing problems. Therefore, our study addresses the task of predicting the algorithm tag as a useful tool for engineers and developers. Moreover, we also consider predicting the difficulty levels of algorithm problems, which can be used as useful guidance to calculate the required time to solve that problem. In this paper, we present a real-world algorithm problem multi-task dataset, AMT, by mainly collecting problem samples from the most famous and large competitive programming website Codeforces. To the best of our knowledge, our proposed dataset is the most large-scale dataset for predicting algorithm tags compared to previous studies. Moreover, our work is the first to address predicting the difficulty levels of algorithm problems. We present a deep learning-based novel method for simultaneously predicting algorithm tags and the difficulty levels of an algorithm problem given. All datasets and source codes are available at https://github.com/sronger/PSG_Predicting_Algorithm_Tags_and_Difficulty.

摘要
现代软件开发行业强调解决问题能力，特别是应用程序开发人员。然而，基于人工智能的教育系统用于解决计算机算法问题尚未吸引到关注，而大多数大型科技公司都需要解决算法问题，包括Google、Meta和Amazon。解决算法问题的最有用指南可能是猜测问题的类别（标签）。因此，我们的研究挑战是预测算法标签作为工程师和开发人员的有用工具。此外，我们还考虑预测算法问题的困难程度，可以作为有用的导航来计算解决该问题所需的时间。在本文中，我们发布了一个实际世界上最大规模的算法问题多任务 dataset，即 AMT，通过主要收集来自最著名和最大竞赛编程网站 Codeforces 的问题样本。根据我们所知，我们提出的 dataset 是预测算法标签方面最大规模的比前一些研究。此外，我们的工作是第一次Addressing 预测算法问题的困难程度。我们提出了一种深度学习基于的新方法，可同时预测算法标签和问题的困难程度。所有数据和源代码都可以在 GitHub 上获取，请参考。

Aligning Language Models with Human Preferences via a Bayesian Approach

paper_url: http://arxiv.org/abs/2310.05782
repo_url: https://github.com/wangjs9/aligned-dpm
paper_authors: Jiashuo Wang, Haozhao Wang, Shichao Sun, Wenjie Li
for: This paper aims to advance human-centric natural language generation (NLG) systems by ensuring alignment between NLG models and human preferences.
methods: The proposed method uses a Bayesian framework to account for the distribution of disagreements among human preferences in training a preference model, and utilizes contrastive learning to train the NLG model with the preference scores.
results: The proposed method consistently exceeds previous state-of-the-art (SOTA) models in both automatic and human evaluations on two human-centric NLG tasks, i.e., emotional support conversation and integrity “Rule-of-Thumb” generation.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是提高人类中心的自然语言生成系统，确保NLG模型和人类偏好的对应。
methods: 提议方法使用 bayesian 框架来考虑人类偏好的分布不一致，在偏好模型训练中使用对比学习训练 NLG 模型。
results: 提议方法在两个人类中心 NLG 任务（情感支持对话和规则杆准则生成）的自动和人类评估中一直超越过去的 SOTA 模型。

Abstract
In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial. For this alignment, current popular methods leverage a reinforcement learning (RL) approach with a reward model trained on feedback from humans. However, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the NLG performance. To tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. Although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to quantitatively disclose the universality of human preferences. To address this challenge, this paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model, and names it as d-PM. Besides, considering the RL strategy's inefficient and complex training process over the training efficiency, we further propose utilizing the contrastive learning strategy to train the NLG model with the preference scores derived from the d-PM model. Extensive experiments on two human-centric NLG tasks, i.e., emotional support conversation and integrity "Rule-of-Thumb" generation, show that our method consistently exceeds previous SOTA models in both automatic and human evaluations.

摘要
在增进人类中心的自然语言生成（NLG）系统方面，与人类偏好的吻合是关键。现有的流行方法通常采用了强化学习（RL）方法，并在人类反馈中训练一个奖励模型。然而，人类偏好的内在分歧对RL方法的训练带来了很大挑战，导致NLG性能下降。为解决这个问题，之前的方法通常采用了多数投票或平均值来整合多个不一致的偏好，但这些方法容易受到人类偏好的分歧的限制，并且只能表征特定人群，无法量化透过表达人类偏好的多样性。为此，本文提出了一种新的方法，即使用 Bayesian 框架来考虑人类偏好的分歧分布，并称之为 d-PM。此外，由于RL策略的训练过程复杂且不效率，我们进一步提议使用对比学习策略来训练NLG模型，使用 d-PM 模型生成的偏好分数。经验表明，我们的方法在两个人类中心NLG任务上（即情感支持对话和规则精神生成）都能够连续超越过去的最佳模型，并在自动和人类评估中表现出色。

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

paper_url: http://arxiv.org/abs/2310.05736
repo_url: https://github.com/microsoft/LLMLingua
paper_authors: Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu
for: 这个研究旨在提高语言模型（LLM）的推导速度和降低成本，以便应用在各种应用中。
methods: 本研究使用了一种名为 LLMLingua 的弹性推导方法，包括一个预算控制器来维持semantic integrity，一个iterative compression algorithm来更好地模型压缩内容之间的依赖性，以及一种基于 instruction tuning的方法来对语言模型进行分布对齐。
results: 本研究在四个不同的数据集（GSM8K、BBH、ShareGPT和Arxiv-March23）上进行了实验和分析，结果显示 LLMLingua 方法可以实现现在的性能水准，并且可以实现高比例的压缩（Up to 20x），而且具有 littles 的性能损失。

Abstract
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.

摘要

Towards Emotion-Based Synthetic Consciousness: Using LLMs to Estimate Emotion Probability Vectors

paper_url: http://arxiv.org/abs/2310.10673
repo_url: None
paper_authors: David Sinclair, Willem Pye
for: 这篇论文探讨了如何使用大语言模型（LLMs）来估算文本中情感状态的摘要。
methods: 该论文使用了大语言模型来计算文本中的情感摘要，该摘要包括情感描述词和该词在提要中出现概率。
results: 通过对亚马逊商品评论进行情感分析，该论文示出了情感描述词可以被映射到PCA类型空间中。然而，通过使用尾提要来引发行动来改进当前状态的方法并不是直接可行。

Abstract
This paper shows how LLMs (Large Language Models) may be used to estimate a summary of the emotional state associated with piece of text. The summary of emotional state is a dictionary of words used to describe emotion together with the probability of the word appearing after a prompt comprising the original text and an emotion eliciting tail. Through emotion analysis of Amazon product reviews we demonstrate emotion descriptors can be mapped into a PCA type space. It was hoped that text descriptions of actions to improve a current text described state could also be elicited through a tail prompt. Experiment seemed to indicate that this is not straightforward to make work. This failure put our hoped for selection of action via choosing the best predict ed outcome via comparing emotional responses out of reach for the moment.

摘要
这篇论文介绍了如何使用大语言模型（LLMs）来估算一篇文本中的情感状态概述。概述的情感状态是一个词典，其中包含用于描述情感的词语以及这些词语在键入文本和情感Trigger后的概率。通过对亚马逊商品评论进行情感分析，我们示出了情感描述可以被映射到PCA类型空间。希望通过尾部提示来引发文本描述的行为改进方法，但实验表明这并不是一个简单的任务。这种失败使我们希望通过比较情感响应来选择最佳结果的方法被推迟。

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

paper_url: http://arxiv.org/abs/2310.05694
repo_url: https://github.com/kaihe-better/llm-for-healthcare
paper_authors: Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria
for: 这份survey旨在提供关于现有的大语言模型（LLMs）在医疗领域的开发进程和其应用前景，以及在PLMs的基础上发展出LLMs的开发路线图。
methods: 本文首先探讨LLMs在医疗应用中的可能性，并描述了LLMs的开发过程、训练数据、训练方法、优化策略和应用。此外，本文还对PLMs和LLMs之间进行比较，以及不同LLMs之间进行比较。
results: 本文总结了关于LLMs在医疗领域的开发和应用，包括关于Healthcare应用中LLMs的可能性、PLMs和LLMs之间的比较、训练数据、训练方法、优化策略和应用。此外，本文还考虑了在医疗领域部署LLMs时存在的独特问题，如公平、责任、透明度和伦理问题。

Abstract
The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to datacentered methodologies.

摘要
大量语言模型（LLM）在医疗领域的应用已经引起了广泛的兴趣和担忧，因为它们可以有效地回答免费文本查询，并且具有一定的专业知识。本调查概述了目前已经开发出的LLMs的能力，并详细介绍其开发过程，以提供对开发路线图的概述，从传统的预训练语言模型（PLMs）到LLMs。 specifically，我们首先探讨LLMs在各种医疗应用中的可能性，并 highlighted它们的优点和局限性。其次，我们进行了PLMs和最新的LLMs之间的比较，以及不同的LLMs之间的比较。然后，我们总结了相关的医疗训练数据、训练方法、优化策略和使用。最后，我们调查了在医疗设置中部署LLMs的独特问题，特别是公平、责任、透明度和伦理。我们的调查提供了从计算机科学和医疗专业的视角的全面的调查。除了医疗问题的讨论外，我们还支持计算机科学社区，并将可 accessible datasets、最新的方法ologies、代码实现和评估标准集成到GitHub中。总之，我们认为现在正在进行一次重要的 парадиг转换，从PLMs到LLMs。这种转换包括从推理AI方法到生成AI方法，以及从模型中心的方法ологи到数据中心的方法ологи。

Larth: Dataset and Machine Translation for Etruscan

paper_url: http://arxiv.org/abs/2310.05688
repo_url: https://github.com/gianlucavico/larth-etruscan-nlp
paper_authors: Gianluca Vico, Gerasimos Spanakis
for: 这个论文的目的是提供一个从 Et 语言到英语的机器翻译数据集，以便未来的研究人员可以利用这些数据进行语言处理研究。
methods: 这个论文使用了一些自动和手动提取的翻译示例，并使用了不同的机器翻译模型进行评估。
results: 论文的结果表明，使用一个小型转换器模型可以实现 BLEU 分数为 10.1。这些结果可以帮助未来的研究人员在这种语言和其他具有有限资源的语言中进行更好的语言处理研究。

Abstract
Etruscan is an ancient language spoken in Italy from the 7th century BC to the 1st century AD. There are no native speakers of the language at the present day, and its resources are scarce, as there exist only around 12,000 known inscriptions. To the best of our knowledge, there are no publicly available Etruscan corpora for natural language processing. Therefore, we propose a dataset for machine translation from Etruscan to English, which contains 2891 translated examples from existing academic sources. Some examples are extracted manually, while others are acquired in an automatic way. Along with the dataset, we benchmark different machine translation models observing that it is possible to achieve a BLEU score of 10.1 with a small transformer model. Releasing the dataset can help enable future research on this language, similar languages or other languages with scarce resources.

摘要
eti：一种古代的意大利语言，自公元7世纪起至公元1世纪止使用。目前无任何 native speaker，资源稀缺，只有约12,000件 known inscriptions。根据我们所知，没有公开available Etruscan corpora for natural language processing。因此，我们提出了一个从Etruscan到英语的机器翻译数据集，包含2891个翻译例子，其中一些是手动提取的，而另外一些是自动获取的。同时，我们对不同的机器翻译模型进行了benchmarking，发现可以达到10.1的BLEU分数，使用一个小型transformer模型。释放这个数据集可以帮助未来的研究这种语言、相似语言或其他语言with scarce resources。

A Closer Look into Automatic Evaluation Using Large Language Models

paper_url: http://arxiv.org/abs/2310.05657
repo_url: https://github.com/d223302/a-closer-look-to-llm-evaluation
paper_authors: Cheng-Han Chiang, Hung-yi Lee
for: This paper aims to evaluate the effectiveness of using large language models (LLMs) for text quality evaluation, and to compare different evaluation methods.
methods: The paper uses two existing methods, LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and analyzes their strengths and weaknesses in terms of correlating with human ratings.
results: The paper finds that the auto Chain-of-Thought (CoT) used in G-Eval does not always improve correlation with human ratings, and that forcing the LLM to output only a numeric rating is suboptimal. Additionally, the paper shows that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings, and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.Here is the same information in Simplified Chinese text:
for: 这篇论文目的是评估大语言模型（LLM）用于文本质量评估的有效性，并比较不同评估方法。
methods: 论文使用了两种现有的方法，LLM评估（Chiang和Lee，2023）和G-Eval（Liu等，2023），并分析它们在与人类评分相关性方面的优缺点。
results: 论文发现，G-Eval中的自动链条（CoT）并不总是提高与人类评分的相关性，而强制 LLM 输出只有数字评分也是不优化的。此外，论文还显示，让 LLM 解释自己的评分能够一直提高 ChatGPT 和人类评分之间的相关性，并达到状态之狮（SoTA）相关性水平在两个 meta-评估数据集上。

Abstract
Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

摘要
使用大语言模型（LLM）评估文本质量已经很受欢迎。一些先前的工作探讨了使用LLM进行评估的想法，但它们在评估过程中有一些细节的不同。本文分析了LLM评估（Chiang和Lee，2023）和G-Eval（Liu等，2023），并分析了评估过程中的细节如何影响LLM给出的评分与人类评分之间的相关性。我们发现自动链条（CoT）使用在G-Eval中并不总是使G-Eval更加与人类评分相关。此外，我们还示出了让LLM输出只有数值评分，如G-Eval中所做的，是不优化的。最后，我们发现让LLM解释自己的评分能够一直保持和人类评分的相关性，并Push state-of-the-art（SoTA）相关性在两个meta-评估数据集上。

RAUCG: Retrieval-Augmented Unsupervised Counter Narrative Generation for Hate Speech

paper_url: http://arxiv.org/abs/2310.05650
repo_url: None
paper_authors: Shuyu Jiang, Wenyi Tang, Xingshu Chen, Rui Tanga, Haizhou Wang, Wenxian Wang
for: 本研究旨在提出一种自动生成Counter Narrative（CN）的方法，以便在互联网上防止仇恨言论（HS）无需削弱言论自由。
methods: 本研究使用自然语言生成技术来自动生成CN，并提出了一种 Retrieval-Augmented Unsupervised Counter Narrative Generation（RAUCG）方法。RAUCG方法包括SSF检索方法、能量基于扩展知识的解码机制和不间断的改进。
results: 实验结果表明，RAUCG方法在语言质量、毒害性、诱导力、相关性和HS防止成功率等方面都有显著改进，与强基eline相比，RAUCG方法在所有指标上都有提高。此外，RAUCG方法使得GPT2超过T0在所有指标上表现。

Abstract
The Counter Narrative (CN) is a promising approach to combat online hate speech (HS) without infringing on freedom of speech. In recent years, there has been a growing interest in automatically generating CNs using natural language generation techniques. However, current automatic CN generation methods mainly rely on expert-authored datasets for training, which are time-consuming and labor-intensive to acquire. Furthermore, these methods cannot directly obtain and extend counter-knowledge from external statistics, facts, or examples. To address these limitations, we propose Retrieval-Augmented Unsupervised Counter Narrative Generation (RAUCG) to automatically expand external counter-knowledge and map it into CNs in an unsupervised paradigm. Specifically, we first introduce an SSF retrieval method to retrieve counter-knowledge from the multiple perspectives of stance consistency, semantic overlap rate, and fitness for HS. Then we design an energy-based decoding mechanism by quantizing knowledge injection, countering and fluency constraints into differentiable functions, to enable the model to build mappings from counter-knowledge to CNs without expert-authored CN data. Lastly, we comprehensively evaluate model performance in terms of language quality, toxicity, persuasiveness, relevance, and success rate of countering HS, etc. Experimental results show that RAUCG outperforms strong baselines on all metrics and exhibits stronger generalization capabilities, achieving significant improvements of +2.0% in relevance and +4.5% in success rate of countering metrics. Moreover, RAUCG enabled GPT2 to outperform T0 in all metrics, despite the latter being approximately eight times larger than the former. Warning: This paper may contain offensive or upsetting content!

摘要
“Counter Narrative（CN）是一种有前途的方法来防止网络诽谤（HS）无须对自由表达造成限制。过去几年，有关自动生成CN的研究愈来愈受到关注。然而，目前的自动CN生成方法主要靠专家撰写的数据进行训练，这是时间consuming和劳动密集的。另外，这些方法不能直接从外部获取和扩展反知识。为了解决这些限制，我们提出了内部扩展无监控Counter Narrative生成（RAUCG），以自动扩展外部反知识并将其映射到CN中。具体来说，我们首先引入SSF搜寻方法，从多种看法的立场一致性、semantic overlap rate和HS适用度等方面搜寻反知识。然后，我们设计了能量基的解oding机制，将知识注入、反驳和流利性的约束转化为可微分函数，让模型从反知识中建立CN映射，不需要专家撰写CN数据。最后，我们对模型表现进行了全面评估，包括语言质量、毒性、说服力、相关性和防止HS成功率等。实验结果显示，RAUCG在所有指标上表现出色，与强基eline进行比较，RAUCG在所有指标上具有+2.0%的相关性和+4.5%的防止HS成功率的改进。此外，RAUCG使得GPT2超越T0，即使T0比GPT2大约8倍。警告：本文可能含有刺激或尴尬的内容！”

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

paper_url: http://arxiv.org/abs/2310.05634
repo_url: None
paper_authors: Xinze Li, Yixin Cao2, Liangming Pan, Yubo Ma, Aixin Sun
for: 提高大语言模型（LLMs）的可靠性和准确性，并解决 LLMS 的三大核心问题。
methods: 利用知识图（KG）扩展 attrribution 源，并提出“自我不足”设定，考虑模型需要的知识不足。提出了一个全面的自动评估指标，覆盖文本质量、引用质量和文本引用对齐。
results: 通过构建生物领域数据集 BioKaLMA，并开发基线解决方案，显示 LLMS 在引用生成方面存在大量的改进空间，强调需要包含“自我不足”设定，以及重要性取得高度准确的检索精度。

Abstract
Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns on conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new ``Conscious Incompetence" setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via a well-designed evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs' citation generation, emphasizing the importance of incorporating the "Conscious Incompetence" setting, and the critical role of retrieval accuracy.

摘要
尽管大成功的大语言模型（LLM）通常受到不可靠的幻觉的影响，在这篇论文中，我们定义了一个新的知识感知语言模型归因（KaLMA）任务，以改进传统归因语言模型中的三大核心问题。首先，我们将归因源从文本扩展到知识图（KG）， whose rich structures benefit both the attribution performance and working scenarios。其次，我们提出了一种“意识不足”的设定，考虑到知识库的缺失，使模型可以识别需要支持知识的情况。最后，我们提出了一个全面的自动评估指标，涵盖文本质量、引用质量和文本引用对齐。为实现以上创新，我们建立了一个在生传记领域的 BioKaLMA 数据集，通过一种Well-designed演化问题生成策略来控制问题复杂性和需要的知识。为评估，我们开发了一个基线解决方案，并示出了 LLMs 的引用生成中的改进空间，强调了在“意识不足”设定下的重要性，以及检索准确性的关键性。

Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models

paper_url: http://arxiv.org/abs/2310.05628
repo_url: None
paper_authors: Marco Bronzini, Carlo Nicolini, Bruno Lepri, Andrea Passerini, Jacopo Staiano
for: This paper aims to provide a framework for extracting and analyzing non-financial information from sustainability reports to support investors’ ESG-related decision-making.
methods: The authors use Large Language Models (LLMs), Retrieved Augmented Generation, and in-context learning to extract semantically structured information from sustainability reports. They also employ graph-based representations to analyze the obtained findings.
results: The authors generate meaningful statistical, similarity, and correlation analyses concerning the sustainability actions undertaken across industries, sectors, and regions. They also investigate the factors that impact companies’ ESG scores using their findings and other company information.

Abstract
Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors' increasing attention to Environmental, Social, and Governance (ESG) issues. Such information is publicly released in a variety of non-structured and multi-modal documentation. Hence, it is not straightforward to aggregate and consolidate such data in a cohesive framework to further derive insights about sustainability practices across companies and markets. Thus, it is natural to resort to Information Extraction (IE) techniques to provide concise, informative and actionable data to the stakeholders. Moving beyond traditional text processing techniques, in this work we leverage Large Language Models (LLMs), along with prominent approaches such as Retrieved Augmented Generation and in-context learning, to extract semantically structured information from sustainability reports. We then adopt graph-based representations to generate meaningful statistical, similarity and correlation analyses concerning the obtained findings, highlighting the prominent sustainability actions undertaken across industries and discussing emerging similarity and disclosing patterns at company, sector and region levels. Lastly, we investigate which factual aspects impact the most on companies' ESG scores using our findings and other company information.

摘要
In this work, we leverage Large Language Models (LLMs) and prominent approaches such as Retrieved Augmented Generation and in-context learning to extract semantically structured information from sustainability reports. We then use graph-based representations to generate meaningful statistical, similarity, and correlation analyses of the obtained findings, highlighting the prominent sustainability actions undertaken across industries and discussing emerging similarity and disclosure patterns at company, sector, and region levels.Finally, we investigate which factual aspects have the greatest impact on companies' ESG scores using our findings and other company information.

Integrating Stock Features and Global Information via Large Language Models for Enhanced Stock Return Prediction

paper_url: http://arxiv.org/abs/2310.05627
repo_url: None
paper_authors: Yujie Ding, Shuai Jia, Tianyi Ma, Bingcheng Mao, Xiuze Zhou, Liuliu Li, Dongming Han
for: 这个研究旨在将大语言模型（LLMs）如ChatGPT和GPT-4 integrating into existing quantitative investment models, in order to improve the accuracy of stock return predictions.
methods: 本研究提出了一个 novel framework, which consists of two components: (1) the Local-Global (LG) model, which introduces three distinct strategies for modeling global information, and (2) Self-Correlated Reinforcement Learning (SCRL), which focuses on aligning the embeddings of financial news generated by LLMs with stock features within the same semantic space.
results: 经过实现本研究的框架后, 在中国A股市场中的 Rank Information Coefficient和回归表现得到了提高, 特别是与仅使用股票特征的模型相比。

Abstract
The remarkable achievements and rapid advancements of Large Language Models (LLMs) such as ChatGPT and GPT-4 have showcased their immense potential in quantitative investment. Traders can effectively leverage these LLMs to analyze financial news and predict stock returns accurately. However, integrating LLMs into existing quantitative models presents two primary challenges: the insufficient utilization of semantic information embedded within LLMs and the difficulties in aligning the latent information within LLMs with pre-existing quantitative stock features. We propose a novel framework consisting of two components to surmount these challenges. The first component, the Local-Global (LG) model, introduces three distinct strategies for modeling global information. These approaches are grounded respectively on stock features, the capabilities of LLMs, and a hybrid method combining the two paradigms. The second component, Self-Correlated Reinforcement Learning (SCRL), focuses on aligning the embeddings of financial news generated by LLMs with stock features within the same semantic space. By implementing our framework, we have demonstrated superior performance in Rank Information Coefficient and returns, particularly compared to models relying only on stock features in the China A-share market.

摘要
大型自然语言模型（LLM）如ChatGPT和GPT-4的出色成就和快速进步，曝光了它们在量化投资中的巨大潜力。投资者可以充分利用这些LLM来分析金融新闻并准确预测股票回报。然而，将LLM integrate into现有的量化模型存在两个主要挑战：一是在LLM中嵌入的 semantics信息的不足利用，二是将LLM中的幽默信息与现有的量化股票特征进行对齐。我们提出了一个新的框架，包括两个组成部分：1. 本地-全局（LG）模型，这个模型引入了三种不同的全局模型策略，这些策略分别基于股票特征、LLM的能力和两种思想的混合方法。2. 自相关束力学学习（SCRL），它专注于将LLM生成的金融新闻嵌入与股票特征在同一个semantic空间进行对齐。通过实施我们的框架，我们在中国A股市场中示出了与只使用股票特征模型相比更高的rank信息系数和回报表现。

LAiW: A Chinese Legal Large Language Models Benchmark (A Technical Report)

paper_url: http://arxiv.org/abs/2310.05620
repo_url: https://github.com/dai-shen/laiw
paper_authors: Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, Hao Wang
for: 评估当前许多法律LLM的法律能力。
methods: 分为三级：基础法律NLP能力、基础法律应用能力和复杂法律应用能力。
results: 第一阶段的评估结果显示，虽有一些法律LLM的表现更好于其基础模型，但还有一定差距与ChatGPT相比。

Abstract
With the emergence of numerous legal LLMs, there is currently a lack of a comprehensive benchmark for evaluating their legal abilities. In this paper, we propose the first Chinese Legal LLMs benchmark based on legal capabilities. Through the collaborative efforts of legal and artificial intelligence experts, we divide the legal capabilities of LLMs into three levels: basic legal NLP capability, basic legal application capability, and complex legal application capability. We have completed the first phase of evaluation, which mainly focuses on the capability of basic legal NLP. The evaluation results show that although some legal LLMs have better performance than their backbones, there is still a gap compared to ChatGPT. Our benchmark can be found at URL.

摘要
“现在有很多法律 LLMS 出现，但是无法确定这些 LLMS 的法律能力水平。在这篇论文中，我们提出了第一个基于法律能力的中国 LLMS benchmark。通过法律和人工智能专家的共同努力，我们将 LLMS 的法律能力分为三级：基础法律 NLP 能力、基础法律应用能力和复杂法律应用能力。我们已经完成了第一阶段的评估，主要是关于基础法律 NLP 的能力。评估结果显示，虽然一些法律 LLMS 的性能比其底层更好，但还有一定的差距与 ChatGPT 相比。我们的 benchmark 可以在 URL 上找到。”Note: "LLMS" stands for "Legal Language Models" in this context.

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

paper_url: http://arxiv.org/abs/2310.05597
repo_url: None
paper_authors: Molly R. Petersen, Lonneke van der Plas
for: 测试模型是否可以学习基本的analogical reasoning，并使用更常见的人类语言 analogies 进行评估。
methods: 使用小量数据进行模型训练，并对模型的性能进行比较，以及与人类基准数据进行比较。
results: 结果显示，模型可以学习analogical reasoning，即使只使用小量数据。此外，模型在训练后与人类基准数据的性能相似。

Abstract
While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used NLP benchmarks. Our experiments find that models are able to learn analogical reasoning, even with a small amount of data. We additionally compare our models to a dataset with a human baseline, and find that after training, models approach human performance.

摘要
而 analogies 是一种常见的方式来评估 word embeddings 在自然语言处理中，但是还是有趣的问题是否可以学习 analogical reasoning。在这篇文章中，我们测试了几种方式来学习基本的analogical reasoning，特别是关注常用于评估人类的 analogies，而不是常用于 NLP benchmarks 中的 analogies。我们的实验发现，模型可以学习 analogical reasoning，即使只有一小 amount of data。我们还对比了我们的模型与人类基准数据，发现， después de 训练，模型接近人类性能。

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

paper_url: http://arxiv.org/abs/2310.05589
repo_url: https://github.com/starreeze/drin
paper_authors: Shangyu Xing, Fei Zhao, Zhen Wu, Chunhui Li, Jianbing Zhang, Xinyu Dai
for: 本研究旨在解决多模态Entity Linking（MEL）任务中的匹配批处和多样性问题。
methods: 我们提出了一种新的Dynamic Relation Interactive Network（DRIN）模型，其中明确地表示了四种不同的匹配对象之间的对应关系，并通过动态的Graph Convolutional Network（GCN）选择对应的对应关系，以适应不同的输入样本。
results: 我们在两个数据集上进行了实验，并证明了DRIN比前STATE-OF-THE-ART方法提高了大量的性能。

Abstract
Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.

摘要
多modal实体连接（MEL）是一个任务，旨在将多modal上的潜在含义链接到多modal知识库中的参照实体。 current methods for MEL 采用一致的框架：先将文本和图像相互作用，以获得提及和实体的表示，然后计算它们之间的相似性，以预测正确的实体。 However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data.To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.

Regulation and NLP (RegNLP): Taming Large Language Models

paper_url: http://arxiv.org/abs/2310.05553
repo_url: None
paper_authors: Catalina Goanta, Nikolaos Aletras, Ilias Chalkidis, Sofia Ranchordas, Gerasimos Spanakis
for: 这篇论文目的是探讨 NLP 研究如何受益于与法规研究的互动，以便更好地评估和控制风险。
methods: 论文使用的方法包括研究现有的法规研究和 NLP 研究，以及将这两者相互连接。
results: 论文指出了现有 NLP 研究中关于风险评估的缺陷，并提出了一种新的多学科研究空间（RegNLP），以便将科学知识与法规过程相连。

Abstract
The scientific innovation in Natural Language Processing (NLP) and more broadly in artificial intelligence (AI) is at its fastest pace to date. As large language models (LLMs) unleash a new era of automation, important debates emerge regarding the benefits and risks of their development, deployment and use. Currently, these debates have been dominated by often polarized narratives mainly led by the AI Safety and AI Ethics movements. This polarization, often amplified by social media, is swaying political agendas on AI regulation and governance and posing issues of regulatory capture. Capture occurs when the regulator advances the interests of the industry it is supposed to regulate, or of special interest groups rather than pursuing the general public interest. Meanwhile in NLP research, attention has been increasingly paid to the discussion of regulating risks and harms. This often happens without systematic methodologies or sufficient rooting in the disciplines that inspire an extended scope of NLP research, jeopardizing the scientific integrity of these endeavors. Regulation studies are a rich source of knowledge on how to systematically deal with risk and uncertainty, as well as with scientific evidence, to evaluate and compare regulatory options. This resource has largely remained untapped so far. In this paper, we argue how NLP research on these topics can benefit from proximity to regulatory studies and adjacent fields. We do so by discussing basic tenets of regulation, and risk and uncertainty, and by highlighting the shortcomings of current NLP discussions dealing with risk assessment. Finally, we advocate for the development of a new multidisciplinary research space on regulation and NLP (RegNLP), focused on connecting scientific knowledge to regulatory processes based on systematic methodologies.

摘要
科学创新在自然语言处理（NLP）和人工智能（AI）领域正在进行最快的发展。大语言模型（LLMs）在新的自动化时代引入了重要的争议，包括开发、部署和使用的优点和风险。目前，这些争议主要由人工智能安全和人工智能伦理运动领导。这种偏见，经常通过社交媒体扩大，正在影响政策制定和人工智能管理，并且存在政策捕捉的问题。在NLP研究中，越来越多的注意力被集中在评估风险和害处的问题上，但这些讨论通常缺乏系统化的方法和 suficient的基础知识，从而威胁NLP研究的科学 integriy。在这篇论文中，我们 argue that NLP研究可以从靠近 regulatory studies 和相关领域中受益。我们讨论了基本的管制原则，以及风险和不确定性的问题，并指出了当前NLP讨论中评估风险的缺陷。最后，我们倡议成立一个新的多学科研究空间（RegNLP），专注于将科学知识与管制过程相连，基于系统化的方法。

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

paper_url: http://arxiv.org/abs/2310.05513
repo_url: None
paper_authors: Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe
for: 本研究旨在探讨多语言speech recognition和语言识别领域中自主学习模型的应用。
methods: 本研究使用了SUPERB框架，并对多语言speech recognition和语言识别进行了自主学习模型的应用。
results: 研究发现，即便扩大模型规模，也不是一定能解决多语言speech任务中的所有挑战。不同的speech/voice类型对多语言speech处理具有显著的挑战。

Abstract
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track where language resource researchers can contribute and evaluate their low-resource language data in the context of the latest progress in multilingual speech recognition. The challenge garnered 12 model submissions and 54 language corpora, resulting in a comprehensive benchmark encompassing 154 languages. The findings indicate that merely scaling models is not the definitive solution for multilingual speech tasks, and a variety of speech/voice types present significant challenges in multilingual speech processing.

摘要
2023年多语言语音通用性表现 benchmark (ML-SUPERB) 挑战，将原有的 SUPERB 框架扩展到多语言语音识别和语言标识领域，强调自动标注模型。挑战包括应用于特定多语言主题的研究车道、挑战车道 для模型提交，以及新语言车道，其中语言资源研究人员可以在最新的多语言语音识别技术下评估和投入低资源语言数据。挑战共收到12个模型提交和54个语言资料库，创造了154种语言的全面 benchmark。发现表明，只有简单地扩大模型的规模并不是多语言语音任务的绝佳解决方案，多种语音/voice 类型在多语言语音处理中具有重要挑战。

XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners

paper_url: http://arxiv.org/abs/2310.05502
repo_url: https://github.com/luoxiaoheics/xal
paper_authors: Yun Luo, Zhen Yang, Fandong Meng, Yingjie Li, Fang Guo, Qinglin Qi, Jie Zhou, Yue Zhang
for: The paper is written for proposing a novel Explainable Active Learning (XAL) framework for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
methods: The paper uses a pre-trained bi-directional encoder for classification, and employs a pre-trained uni-directional decoder to generate and score the explanation. A ranking loss is proposed to enhance the decoder’s capability in scoring explanations. During the selection of unlabeled data, the paper combines the predictive uncertainty of the encoder and the explanation score of the decoder to acquire informative data for annotation.
results: The paper achieves substantial improvement on all six tasks over previous Active Learning (AL) methods, and ablation studies demonstrate the effectiveness of each component. Human evaluation shows that the model trained in XAL performs surprisingly well in explaining its prediction.

Abstract
Active learning aims to construct an effective training set by iteratively curating the most informative unlabeled data for annotation, which is practical in low-resource tasks. Most active learning techniques in classification rely on the model's uncertainty or disagreement to choose unlabeled data. However, previous work indicates that existing models are poor at quantifying predictive uncertainty, which can lead to over-confidence in superficial patterns and a lack of exploration. Inspired by the cognitive processes in which humans deduce and predict through causal information, we propose a novel Explainable Active Learning framework (XAL) for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations. Specifically, besides using a pre-trained bi-directional encoder for classification, we employ a pre-trained uni-directional decoder to generate and score the explanation. A ranking loss is proposed to enhance the decoder's capability in scoring explanations. During the selection of unlabeled data, we combine the predictive uncertainty of the encoder and the explanation score of the decoder to acquire informative data for annotation. As XAL is a general framework for text classification, we test our methods on six different classification tasks. Extensive experiments show that XAL achieves substantial improvement on all six tasks over previous AL methods. Ablation studies demonstrate the effectiveness of each component, and human evaluation shows that the model trained in XAL performs surprisingly well in explaining its prediction.

摘要
aktive lerning 目的是建立一个有效的训练集 by 遍历最有用的无标例数据进行标签，尤其适用于低资源任务。大多数 aktive lerning 技术在分类中依赖模型的不确定性或争议选择无标例数据。然而，先前的工作表明，现有的模型对于预测不确定性的评估不善，可能会导致超过自信和 superficiale 的模式，而无法探索。 inspirited by 人类的认知过程中的推理和预测，我们提出了一个 novel Explainable Active Learning 框架 (XAL) 供低资源文本分类， aiming to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations. Specifically, besides using a pre-trained bi-directional encoder for classification, we employ a pre-trained uni-directional decoder to generate and score the explanation. A ranking loss is proposed to enhance the decoder's capability in scoring explanations. During the selection of unlabeled data, we combine the predictive uncertainty of the encoder and the explanation score of the decoder to acquire informative data for annotation. As XAL is a general framework for text classification, we test our methods on six different classification tasks. Extensive experiments show that XAL achieves substantial improvement on all six tasks over previous AL methods. Ablation studies demonstrate the effectiveness of each component, and human evaluation shows that the model trained in XAL performs surprisingly well in explaining its prediction.

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

paper_url: http://arxiv.org/abs/2310.05484
repo_url: None
paper_authors: Vageesh Saxena, Benjamin Bashpole, Gijs Van Dijck, Gerasimos Spanakis
for: 本研究旨在帮助法 enforcement 机构（LEA）更好地识别和连接人口贩卖（HT）案件和在线广告（ads）。
methods: 本研究使用了87,595个文本广告和5,244个vendor标签来建立一个庞大的IDTraffickers数据集，并使用了DeCLUTR-small模型进行训练，以实现闭包集合分类环境中的macro-F1分数0.8656。
results: 通过使用训练过的分类器提取的样式表示，实现了基于开集排序环境的mean r-precision分数0.8852，以便更好地识别潜在的HT指示器。

Abstract
Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators.

摘要
人口贩卖（HT）是一个广泛存在的全球问题，影响到抵触的个人，违反其基本人权。调查表明，许多HT案件与在线广告（ads）相关，特别是在escort市场上。因此，为了识别和连接HT提供者而成为了法 enforcement agencies（LEAs）的挑战。为解决这个问题，我们介绍IDTraffickers，一个包含87,595个文本广告和5,244个提供者标签的广泛的数据集，以启用在线escort市场上的HT提供者验证和识别。为建立作者鉴定的标准，我们训练了DeCLUTR-small模型，在closed-set分类环境中实现了macro-F1分数0.8656。然后，我们利用训练出来的样式表示来进行作者鉴定，在开放集排名环境中实现了mean r-precision分数0.8852。最后，为促进未来研究和负责任数据分享，我们计划将IDTraffickers数据集和benchmark分发给研究人员，但是需要特定的条件，考虑到数据的敏感性。我们认为，随着我们的数据和benchmark的可用性，未来的研究人员将能够利用我们的发现，从而促进escort广告和HT指标的有效链接，并开发更加坚强的HT指标识别方法。

Empower Nested Boolean Logic via Self-Supervised Curriculum Learning

paper_url: http://arxiv.org/abs/2310.05450
repo_url: https://github.com/gingasan/boolkill
paper_authors: Hongqiu Wu, Linfeng Liu, Hai Zhao, Min Zhang
for: 检验语言模型是否具有强大的推理能力，而不仅仅是因为数据训练。
methods: 使用自然语言处理技术，对语言模型进行自我超vised学习，逐步增加复杂的逻辑逻辑，从 simpler to harder。
results: 语言模型通过this新的自我超vised学习方法（\textsc{Clr），能够有效地推理更加复杂和长距离的逻辑。

Abstract
Beyond the great cognitive powers showcased by language models, it is crucial to scrutinize whether their reasoning capabilities stem from strong generalization or merely exposure to relevant data. As opposed to constructing increasingly complex logic, this paper probes into the boolean logic, the root capability of a logical reasoner. We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested boolean logic, a task that humans can handle with ease. To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method \textit{Curriculum Logical Reasoning} (\textsc{Clr}), where we augment the training data with nested boolean logic chain step-by-step, and program the training from simpler logical patterns gradually to harder ones. This new training paradigm allows language models to effectively generalize to much harder and longer-hop logic, which can hardly be learned through naive training. Furthermore, we show that boolean logic is a great foundation for improving the subsequent general logical tasks.

摘要
更 beyond the great cognitive powers displayed by language models, it is crucial to examine whether their reasoning abilities are based on strong generalization or simply exposure to relevant data. Unlike constructing increasingly complex logic, this paper explores the boolean logic, the fundamental capability of a logical reasoner. We find that pre-trained language models, including large language models, can only perform like a random selector when faced with multi-nested boolean logic, a task that humans can handle easily. To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method called \textsc{Curriculum Logical Reasoning} (\textsc{Clr}), where we gradually add nested boolean logic chains to the training data, starting with simpler logical patterns and gradually increasing the difficulty. This new training paradigm enables language models to effectively generalize to much harder and longer-hop logic, which cannot be learned through naive training. Furthermore, we show that boolean logic provides a solid foundation for improving subsequent general logical tasks.

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

paper_url: http://arxiv.org/abs/2310.05442
repo_url: None
paper_authors: Robert Litschko, Max Müller-Eberstein, Rob van der Goot, Leon Weber, Barbara Plank
for: 理解自然语言处理（NLP）的核心概念和任务的 Computational Modeling，以及如何将其应用于实际场景中。
methods: traditional compartmentalized approaches for understanding a model’s functional capacity, as well as recommendations for more multi-faceted evaluation protocols.
results: the need for trustworthy and reliable NLP systems, and the importance of rethinking the traditional notion of language tasks and model evaluation in order to pursue a more holistic view of language.

Abstract
Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

摘要
Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols.Here's the translation in Traditional Chinese:语言理解是一种多方面的认知能力，自然语言处理（NLP）社群在数十年来一直努力以计算方式模型。传统上，语言智能的不同方面被分类为特殊的任务，并且运用专门的模型架构和评估协议。 however, with the advent of large language models (LLMs)，社群目睹了一个剧烈的转变，从特定任务的专门模型演化为通用、任务无关的方法，这导致了传统的语言任务分类系统崩溃。这一传统的分类系统崩溃，也导致了评估和分析的问题增加。同时，LLMs 正在更多的实际应用中，包括以前未见的零学习设置，增加了可靠和可信的系统的需求。因此，我们认为现在是重新定义语言任务和模型评估的时候，并将信任性置于中心。以这个目标为导向，我们回顾现有的分类方法，并提供更多的多方面评估协议。

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

paper_url: http://arxiv.org/abs/2310.05424
repo_url: https://github.com/raymin0223/fast_robust_early_exit
paper_authors: Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
for: 提高 autoregressive 语言模型的推理延迟
methods: 提出 Fast and Robust Early-Exiting (FREE) 框架，包括 shallow-deep 模块和同步并发解码
results: 在各种生成任务上实质性提高了推理速度，并提出了一种基于 Beta 混合模型的适应阈值估计器来确定适当的信心阈值

Abstract
To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.

摘要
previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.Here's the translation in Traditional Chinese:previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.

Automating Customer Service using LangChain: Building custom open-source GPT Chatbot for organizations

paper_url: http://arxiv.org/abs/2310.05421
repo_url: None
paper_authors: Keivalya Pandya, Mehfuza Holia
for: This research paper aims to automate customer service using a custom Large Language Model (LLM) called LangChain, which can provide personalized, responsive, and context-aware support.methods: The paper proposes a new approach that combines open-source methodologies, web scraping, fine-tuning, and the integration of LangChain into customer service platforms. The research uses data collection via web scraping, embeddings, Google’s Flan T5 XXL, Base, and Small language models for knowledge retrieval, and the integration of a chatbot into customer service platforms.results: The paper shows that the proposed approach can provide real-time support and query resolution, with the chatbot integrated into customer service platforms. The results also demonstrate the ability to scale across industries and organizations, and elevate customer retention, value extraction, and brand image.

Abstract
In the digital age, the dynamics of customer service are evolving, driven by technological advancements and the integration of Large Language Models (LLMs). This research paper introduces a groundbreaking approach to automating customer service using LangChain, a custom LLM tailored for organizations. The paper explores the obsolescence of traditional customer support techniques, particularly Frequently Asked Questions (FAQs), and proposes a paradigm shift towards responsive, context-aware, and personalized customer interactions. The heart of this innovation lies in the fusion of open-source methodologies, web scraping, fine-tuning, and the seamless integration of LangChain into customer service platforms. This open-source state-of-the-art framework, presented as "Sahaay," demonstrates the ability to scale across industries and organizations, offering real-time support and query resolution. Key elements of this research encompass data collection via web scraping, the role of embeddings, the utilization of Google's Flan T5 XXL, Base and Small language models for knowledge retrieval, and the integration of the chatbot into customer service platforms. The results section provides insights into their performance and use cases, here particularly within an educational institution. This research heralds a new era in customer service, where technology is harnessed to create efficient, personalized, and responsive interactions. Sahaay, powered by LangChain, redefines the customer-company relationship, elevating customer retention, value extraction, and brand image. As organizations embrace LLMs, customer service becomes a dynamic and customer-centric ecosystem.

摘要
在数字时代，顾客服务的动力是不断发展，受技术进步和大语言模型（LLM）的整合影响。这篇研究论文提出了一种创新的自动化顾客服务方法，基于自定义的 LangChain LLM，为组织提供了一种新的客户服务模式。论文探讨传统顾客支持技术，特别是常见问题（FAQ）的过时性，并提出了一种新的客户交互模式，强调响应式、上下文感知和个性化的客户交互。这种创新的核心在于将开源方法ologies、网络抓取、精度调整和 LangChain 集成到顾客服务平台上。这个开源的 state-of-the-art 框架，即 "Sahaay"，能够在不同的行业和组织之间扩展，提供实时支持和问题解决。研究的关键元素包括通过网络抓取获取数据、使用 Google 的 Flan T5 XXL、Base 和 Small 语言模型 для知识检索，以及将 chatbot 集成到顾客服务平台上。研究结果提供了这些技术在不同的应用场景中的性能和使用情况，特别是在教育机构中。这项研究标志着客户服务的新时代，通过技术来创造高效、个性化、响应式的客户交互。Sahaay，基于 LangChain，重塑了客户-公司关系，提高客户退货、价值提取和品牌形象。随着组织接受 LLM，客户服务变成了一个动态和客户中心的生态系统。

mBBC: Exploring the Multilingual Maze

paper_url: http://arxiv.org/abs/2310.05404
repo_url: https://github.com/PortNLP/mBBC
paper_authors: Sina Bagheri Nezhad, Ameeta Agrawal
for: 这个论文的目的是评估三种知名的多语言语言模型（mBERT、XLM-R和GPT-3）的性能，以便更好地理解这些模型在不同语言和语言上下文中的表现。
methods: 这个论文使用了自然语言处理技术中的自我超vised任务（下一个单词预测）来评估这些模型的性能，并在多种语言中进行了评估。
results: 研究发现资源水平对模型性能产生关键作用，具有更高资源水平的模型具有更高的准确率。此外，研究还发现了语言家族和字体类型之间复杂的关系，需要进一步的调查和研究，以便更好地理解语言特点和结构变化对模型性能的影响。

Abstract
Multilingual language models have gained significant attention in recent years, enabling the development of applications that cater to diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three prominent multilingual language models: mBERT, XLM-R, and GPT-3. Using the self-supervised task of next token prediction, we assess their performance across a diverse set of languages, with a focus on understanding the impact of resource availability, word order, language family, and script type on model accuracy. Our findings reveal that resource availability plays a crucial role in model performance, with higher resource levels leading to improved accuracy. We also identify the complex relationship between resource availability, language families, and script types, highlighting the need for further investigation into language-specific characteristics and structural variations. Additionally, our statistical inference analysis identifies significant features contributing to model performance, providing insights for model selection and deployment. Our study contributes to a deeper understanding of multilingual language models and informs future research and development to enhance their performance and generalizability across languages and linguistic contexts.

摘要
simplified_chinese多语言语言模型在最近几年内受到了广泛关注，使得开发能够适应多种语言文化背景的应用程序变得可能。在本文中，我们对三种著名的多语言语言模型——mBERT、XLM-R和GPT-3进行了全面的评估。使用下一个元素预测任务，我们评估了这些模型在不同语言中的表现，并将着眼于资源可用性、字符串顺序、语言家族和文字类型对模型准确率的影响。我们发现资源可用性在模型性能中扮演着关键的角色，高resource levels导致了改进的准确率。我们还发现了语言家族和文字类型之间复杂的关系，这种关系需要进一步的研究，以便更好地理解语言特有的特征和结构上的变化。此外，我们的统计推理分析还提到了对模型性能的重要贡献因素，为未来的模型选择和部署提供了智能。本研究对多语言语言模型的深入理解和未来研发的提高和普适性做出了贡献。

GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence

paper_url: http://arxiv.org/abs/2310.05388
repo_url: None
paper_authors: Zhihua Wen, Zhiliang Tian, Wei Wu, Yuxin Yang, Yanqi Shi, Zhen Huang, Dongsheng Li
for: This paper aims to enhance the complexity and credibility of story generation by leveraging information from human-written stories and using a retrieval-augmented story generation framework.
methods: The proposed method uses a retrieval repository of target conditions to produce few-shot examples that serve as prompts for a large language model (LLM). It also employs an “asking-why” prompting scheme to extract a forest of evidence, which is used to compensate for ambiguities in the generated story.
results: The experimental results and numerous examples demonstrate the effectiveness of the proposed method in generating stories with complex and credible plots.

Abstract
Conditional story generation is significant in human-machine interaction, particularly in producing stories with complex plots. While Large language models (LLMs) perform well on multiple NLP tasks, including story generation, it is challenging to generate stories with both complex and creative plots. Existing methods often rely on detailed prompts to guide LLMs to meet target conditions, which inadvertently restrict the creative potential of the generated stories. We argue that leveraging information from exemplary human-written stories facilitates generating more diverse plotlines. Delving deeper into story details helps build complex and credible plots. In this paper, we propose a retrieval-au\textbf{G}mented sto\textbf{R}y generation framework with a f\textbf{O}rest of e\textbf{V}id\textbf{E}nce (GROVE) to enhance stories' complexity. We build a retrieval repository for target conditions to produce few-shot examples to prompt LLMs. Additionally, we design an ``asking-why'' prompting scheme that extracts a forest of evidence, providing compensation for the ambiguities that may occur in the generated story. This iterative process uncovers underlying story backgrounds. Finally, we select the most fitting chains of evidence from the evidence forest and integrate them into the generated story, thereby enhancing the narrative's complexity and credibility. Experimental results and numerous examples verify the effectiveness of our method.

摘要
假设故事生成是人机交互中的重要方面，特别是生成具有复杂剧本的故事。虽然大型语言模型（LLMs）在多种自然语言处理任务中表现良好，但是生成具有复杂和创新剧本的故事仍然是挑战。现有的方法通常靠着详细的提示来引导LLMs，从而限制生成的故事创作潜力。我们认为可以利用人类写的好故事中的信息，以生成更多元的剧本。深入探究故事细节可以建立更加复杂和真实的剧本。在这篇文章中，我们提出了一个具有追踪和补充的故事生成框架（GROVE），以增强故事的复杂性。我们建立了一个目标状况库，以生成少量的示例提示LLMs。此外，我们设计了一个“问题”提示方案，可以从故事中提取一棵证据森林，以补偿可能在生成的故事中出现的歧难。这个迭代过程可以暴露出故事的背景。最后，我们从证据森林中选择最符合的证据链，并将其与生成的故事结合，从而增强故事的复杂性和实际性。实验结果和许多例子证明了我们的方法的有效性。

paper_url: http://arxiv.org/abs/2310.05378
repo_url: https://github.com/NickDiSanto/Twitter2030/tree/main/Beta
paper_authors: Nick DiSanto, Anthony Corso, Benjamin Sanders, Gavin Harding
for: investigate social media data to uncover abstract relationships and challenge the reliance on complex models
methods: employ Bag-of-Words models specific to each city to analyze Twitter data and evaluate representation
results: discover hidden insights and demonstrate the considerable influence of geographic location on online communication, challenging the notion that intricate models are necessary for pattern recognition

Abstract
While transformers have pioneered attention-driven architectures as a cornerstone of research, their dependence on explicitly contextual information underscores limitations in their abilities to tacitly learn overarching textual themes. This study investigates social media data as a source of distributed patterns, challenging the heuristic paradigm of performance benchmarking. In stark contrast to networks that rely on capturing complex long-term dependencies, models of online data inherently lack structure and are forced to learn underlying patterns in the aggregate. To properly represent these abstract relationships, this research dissects empirical social media corpora into their elemental components and analyzes over two billion tweets across population-dense locations. Exploring the relationship between location and vernacular in Twitter data, we employ Bag-of-Words models specific to each city and evaluate their respective representation. This demonstrates that hidden insights can be uncovered without the crutch of advanced algorithms and demonstrates that even amidst noisy data, geographic location has a considerable influence on online communication. This evidence presents tangible insights regarding geospatial communication patterns and their implications in social science. It also challenges the notion that intricate models are prerequisites for pattern recognition in natural language, aligning with the evolving landscape that questions the embrace of absolute interpretability over abstract understanding. This study bridges the divide between sophisticated frameworks and intangible relationships, paving the way for systems that blend structured models with conjectural reasoning.

摘要
transformers推动了注意力驱动的建筑，但它们对文本主题的潜在学习表现出了局限性。这个研究通过社交媒体数据来挑战传统性能标准的假设，因为模型在线上数据上自然地缺乏结构，需要通过汇总来学习下级 Patterns。为了正确表示这些抽象关系，我们在Twitter数据中分解了实际社交媒体文本，并对全球各地的 tweet 进行分析，检查了地点和方言之间的关系。我们使用特定于每个城市的 Bag-of-Words 模型进行评估，并发现了隐藏的Patterns。这种方法表明，无需复杂的算法，地理位置在在线交流中具有显著的影响。这些证据表明在社会科学中的地理通信模式和其影响，并挑战了人们对精准模型的依赖。这种研究既结合了结构化模型，也结合了推理。

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

paper_url: http://arxiv.org/abs/2310.05374
repo_url: None
paper_authors: Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Ting Yeung, Xiao Chen
for: 这个论文主要是为了提高END-TO-END speech处理模型的性能，尤其是在数据中心时代的人工智能 era。methods: 这个论文提出了一种名为LaSyn的高效文本数据利用框架，用于增强END-TO-END speech处理模型的训练。LaSyn使用文本数据生成一种中间的幻数表示，然后将其与预训练的speech模型进行混合，以提高模型的性能。results: 在ASR任务上，LaSyn可以提高E2E基eline的表达误差率超过22.3%。在SLU任务上，LaSyn可以提高E2E基eline的意图分类精度和插槽填充精度。与已有的发布状态的工作相比，LaSyn的参数更少，并且得到了相当的性能提升。这些结果表明LaSyn生成的训练数据的质量。

Abstract
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) and EM-Tree accuracies on STOP respectively. With fewer parameters, the results of LaSyn are competitive to published state-of-the-art works. The results demonstrate the quality of the augmented training data.

摘要
培训高性能端到端语音处理模型需要庞大量的标注语音数据，特别在人工智能时代。然而，标注语音数据通常比文本数据更 scarce 和更昂贵。我们提议Latent Synthesis（LaSyn），一种高效的文本数据利用框架 для端到端语音处理模型。我们在一个预训练的语音模型上训练一个干扰生成器，将文本数据转换为一种中间的干扰表示。这些干扰表示可以增强语音数据的训练。我们对LaSyn进行了评估，在不同的测试集上，LaSyn在自动语音识别（ASR）和语言理解（SLU）任务上提高了基eline的性能。在ASR任务上，LaSyn在LibriSpeech train-clean-100上训练的基eline上，相对减少了22.3%的单词错误率。在SLU任务上，LaSyn提高了我们的基eline的意向分类精度和插槽填充精度，相对增加了4.1%和3.8%。 LaSyn的参数数量 fewer ，与已发表的状态 Künstler 的性能相匹配。结果表明增强的训练数据质量。

A Glance is Enough: Extract Target Sentence By Looking at A keyword

paper_url: http://arxiv.org/abs/2310.05352
repo_url: None
paper_authors: Ying Shi, Dong Wang, Lantian Li, Jiqing Han
for: 这个论文探讨了从多个说话人的多话语中提取目标句子，只需要输入一个关键词。例如，在社会保障应用中，关键词可能是“帮助”，目标是从其他说话人的干扰中提取某个人呼叫的句子。
methods: 我们提议使用Transformer架构将关键词和语音词汇 embedding，然后通过cross-attention机制选择正确的内容从拼接或重叠的语音中提取目标句子。
results: 在Librispeech数据集上，我们的提议方法可以很好地提取噪音和杂音声中的目标句子（SNR=-3dB），PER为26%，比基eline系统的PER为96%。

Abstract
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both the keyword and the speech utterance and then rely on the cross-attention mechanism to select the correct content from the concatenated or overlapping speech. Experimental results on Librispeech demonstrate that our proposed method can effectively extract target sentences from very noisy and mixed speech (SNR=-3dB), achieving a phone error rate (PER) of 26\%, compared to the baseline system's PER of 96%.

摘要

Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

paper_url: http://arxiv.org/abs/2310.05338
repo_url: None
paper_authors: Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, Pascale Fung
for: 本研究旨在评估视言语（VL）模型中对物体幻化的影响，以提高模型的可靠性和可信度。
methods: 本研究使用了大量的自然语言模型生成29.5k个高质量的 sintetic negative pronoun（NegP）数据，以评估VL模型对物体幻化的敏感性。
results: 研究发现，无论是当前的State-of-the-art VL模型都不免受物体幻化的影响，其中所有模型在NegP问题中的准确率都低于10%。此外，研究还发现了lexically diverse visual questions、宽泛的问题类型和场景相关的物体，可能会使VL模型增加物体幻化的风险。

Abstract
Object hallucination poses a significant challenge in vision-language (VL) models, often leading to the generation of nonsensical or unfaithful responses with non-existent objects. However, the absence of a general measurement for evaluating object hallucination in VL models has hindered our understanding and ability to mitigate this issue. In this work, we present NOPE (Negative Object Presence Evaluation), a novel benchmark designed to assess object hallucination in VL models through visual question answering (VQA). We propose a cost-effective and scalable approach utilizing large language models to generate 29.5k synthetic negative pronoun (NegP) data of high quality for NOPE. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions, where the ground truth answers are denoted as NegP (e.g., "none"). Additionally, we evaluate their standard performance on visual questions on 9 other VQA datasets. Through our experiments, we demonstrate that no VL model is immune to the vulnerability of object hallucination, as all models achieve accuracy below 10\% on NegP. Furthermore, we uncover that lexically diverse visual questions, question types with large scopes, and scene-relevant objects capitalize the risk of object hallucination in VL models.

摘要
对象幻像 pose 视语言（VL）模型中的挑战，经常导致生成无意义或不准确的回答，其中包括无存在的对象。然而，对视语言模型中对象幻像的评价没有一个通用的方法，这限制了我们对这个问题的理解和处理能力。在这种情况下，我们提出了 NOPE（负对象存在评价），一种新的benchmark，用于评价视语言模型中对象幻像的能力。我们提出了一种可靠且可扩展的方法，利用大型自然语言模型生成29.5k个高质量的负对象数据（NegP）。我们广泛研究了10种当前最佳的视语言模型在判断视Question中的对象不存在时的性能，其中ground truth answers denoted as NegP（例如，"none"）。此外，我们还评估了这些模型在9个其他VQA数据集上的标准性能。经过我们的实验，我们发现没有一个视语言模型是对象幻像的免疫者，所有模型在NegP上的准确率都低于10%。此外，我们发现了不同类型的视Question、广泛的问题类型和场景相关的对象都会增加视语言模型中对象幻像的风险。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation

paper_url: http://arxiv.org/abs/2310.05318
repo_url: None
paper_authors: Xunxin Cai, Meng Xiao, Zhiyuan Ning, Yuanchun Zhou
for: This paper aims to address the issue of data imbalance in Natural Language Processing, specifically in the context of research proposals submitted for funding.
methods: The paper uses large language models (Llama V1) as data generators to augment research proposals categorized within intricate disciplinary hierarchies. The authors design prompts for keyword-based research proposal generation to rectify data imbalances and enhance the equity of expert assignments.
results: The experiments conducted in the paper demonstrate the efficacy of the generated data, showing that the research proposals produced using the prompts can effectively address the issue of data imbalance and generate high-quality scientific text data.

Abstract
In addressing the imbalanced issue of data within the realm of Natural Language Processing, text data augmentation methods have emerged as pivotal solutions. This data imbalance is prevalent in the research proposals submitted during the funding application process. Such imbalances, resulting from the varying popularity of disciplines or the emergence of interdisciplinary studies, significantly impede the precision of downstream topic models that deduce the affiliated disciplines of these proposals. At the data level, proposals penned by experts and scientists are inherently complex technological texts, replete with intricate terminologies, which augmenting such specialized text data poses unique challenges. At the system level, this, in turn, compromises the fairness of AI-assisted reviewer assignment systems, which raises a spotlight on solving this issue. This study leverages large language models (Llama V1) as data generators to augment research proposals categorized within intricate disciplinary hierarchies, aiming to rectify data imbalances and enhance the equity of expert assignments. We first sample within the hierarchical structure to find the under-represented class. Then we designed a prompt for keyword-based research proposal generation. Our experiments attests to the efficacy of the generated data, demonstrating that research proposals produced using the prompts can effectively address the aforementioned issues and generate high quality scientific text data, thus help the model overcome the imbalanced issue.

摘要
在自然语言处理领域中解决数据不均衡问题，文本数据增强方法已成为关键解决方案。这种数据不均衡问题在研究提案申请过程中非常普遍，这些不均衡导致下游话题模型准确性受到影响。在数据层次，由专家和科学家写的提案是复杂的技术文本，充满专业术语，增强这种专业文本数据带来了独特的挑战。在系统层次，这会导致人工智能助手分配系统的公平性受到影响。本研究利用大型自然语言模型（Llama V1）作为数据生成器，增强分类在复杂的学科层次中的研究提案，以解决数据不均衡问题并提高专家分配的公平性。我们首先在层次结构中采样到下降类，然后设计了关键词基于的研究提案生成提示。我们的实验证明了生成的数据的有效性，显示了使用提示生成的研究提案可以有效地解决上述问题，并生成高质量的科学文本数据，帮助模型超越数据不均衡问题。

2023-10-09

GPT-who: An Information Density-based Machine-Generated Text Detector

Compressing Context to Enhance Inference Efficiency of Large Language Models

The Importance of Prompt Tuning for Automated Neuron Explanations

BYOC: Personalized Few-Shot Classification with Co-Authored Class Descriptions

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Auditing Gender Analyzers on Text Data

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

NEFTune: Noisy Embeddings Improve Instruction Finetuning

Controllable Chest X-Ray Report Generation from Longitudinal Representations

Are Large Language Models Geospatially Knowledgeable?

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese

DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models

Problem-Solving Guide: Predicting the Algorithm Tags and Difficulty for Competitive Programming Problems

Aligning Language Models with Human Preferences via a Bayesian Approach

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Towards Emotion-Based Synthetic Consciousness: Using LLMs to Estimate Emotion Probability Vectors

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Larth: Dataset and Machine Translation for Etruscan

A Closer Look into Automatic Evaluation Using Large Language Models

RAUCG: Retrieval-Augmented Unsupervised Counter Narrative Generation for Hate Speech

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models

Integrating Stock Features and Global Information via Large Language Models for Enhanced Stock Return Prediction

LAiW: A Chinese Legal Large Language Models Benchmark (A Technical Report)

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

Regulation and NLP (RegNLP): Taming Large Language Models

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

Empower Nested Boolean Logic via Self-Supervised Curriculum Learning

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Automating Customer Service using LangChain: Building custom open-source GPT Chatbot for organizations

mBBC: Exploring the Multilingual Maze

GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence

Transcending the Attention Paradigm: Representation Learning from Geospatial Social Media Data

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

A Glance is Enough: Extract Target Sentence By Looking at A keyword

Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation