cs.CL - 2023-09-16

The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated

  • paper_url: http://arxiv.org/abs/2309.09092
  • repo_url: None
  • paper_authors: Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki
  • for: 这篇论文主要探讨了预训模型中的社会偏见,以及如何对这些偏见进行调整。
  • methods: 论文使用了多种方法来对预训模型进行调整,以消除社会偏见。
  • results: 论文的实验结果显示,对于不同的下游任务,调整预训模型的影响都是被低估的。此外,为了正确评估调整的影响,应该对 female、male 和传统词汇进行分开考虑。
    Abstract Pre-trained language models trained on large-scale data have learned serious levels of social biases. Consequently, various methods have been proposed to debias pre-trained models. Debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. In previous research, whether useful information is retained has been confirmed by the performance of downstream tasks in debiased pre-trained models. On the other hand, it is not clear whether these benchmarks consist of data pertaining to social biases and are appropriate for investigating the impact of debiasing. For example in gender-related social biases, data containing female words (e.g. ``she, female, woman''), male words (e.g. ``he, male, man''), and stereotypical words (e.g. ``nurse, doctor, professor'') are considered to be the most affected by debiasing. If there is not much data containing these words in a benchmark dataset for a target task, there is the possibility of erroneously evaluating the effects of debiasing. In this study, we compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets that containing female, male, and stereotypical words. Experiments show that the effects of debiasing are consistently \emph{underestimated} across all tasks. Moreover, the effects of debiasing could be reliably evaluated by separately considering instances containing female, male, and stereotypical words than all of the instances in a benchmark dataset.
    摘要 各种方法已经被提议来减少预训练模型中的社会偏见。这些方法需要从预训练模型中除去排斥性偏见信息,而不是抹除有用的信息。在之前的研究中,已经证明了这些减少后的模型在下游任务中的表现。然而,不清楚这些标准 benchmark 数据是否包含社会偏见的信息,并不适用于研究减少的影响。例如,在性别相关的社会偏见中,包含女性词汇(如“她,女性,女人”)、男性词汇(如“他,男性,男人”)和 gender 刻板印象(如“护士,医生,教授”)被视为最受减少影响。如果 benchmark 数据中不含这些词汇的数据,那么可能会误判减少的影响。在这种研究中,我们比较了减少对多个下游任务的表现,使用包含女性、男性和 gender 刻板印象的多种 benchmark 数据。实验显示,减少的影响被一致地低估,并且可以通过分别考虑包含女性、男性和 gender 刻板印象的实例来可靠地评估减少的影响。

Improving Speech Recognition for African American English With Audio Classification

  • paper_url: http://arxiv.org/abs/2309.09996
  • repo_url: None
  • paper_authors: Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara Sainath, Françoise Beaufays, Pedro Moreno Mengibar
  • for: 提高短句子朗读识别系统的品质 disparities between different language varieties.
  • methods: 使用小量的out-of-domain(长形)非裔美国英语数据来改善US英语短句子朗读识别系统的Robustness.
  • results: 使用CORAAL、YouTube和Mozilla Common Voice等数据集来训练一个音频分类器,可以准确地判断一个utterance是非裔美国英语还是其他语种,包括主流美国英语。通过将这些分类器输出与杂合的地理信息相结合,可以在大规模的半监督学习中选择一组utterances进行 semi-supervised learning。经过精细调整,这些utterances的word error rate disparity reduction between AAE和MAE可以达到38.5%。
    Abstract Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
    摘要 We use CORAAL, YouTube, and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety, including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.

  • paper_url: http://arxiv.org/abs/2309.09069
  • repo_url: None
  • paper_authors: Thi-Hai-Yen Vuong, Minh-Quan Hoang, Tan-Minh Nguyen, Hoang-Trung Nguyen, Ha-Thanh Nguyen
  • for: 本文提出了一种法律案例文档和相关法律知识图构建方法,以提高法律信息的有效组织和下游任务的提高。
  • methods: 本方法包括三个主要步骤:数据抓取、信息提取和知识图部署。首先,数据抓取器从多种来源收集了大量的法律案例文档和相关法律信息,为后续处理提供了丰富的数据库。然后,信息提取步骤使用自然语言处理技术提取了法律案例文档中的法院、案件、领域和法律等实体,以及它们之间的关系。最后,知识图被部署,将这些实体连接起来,创建了一个多元图,有效地表示了法律信息,且适用于法律领域的用户,如律师、法官和学者。
  • results: 建立的基线模型通过不监督学习方法,并通过知识图的支持,能够为给定的法律案例提取相关法律。这种方法开启了法律领域的多种应用,如法律案例分析、法律建议和决策支持。
    Abstract This paper presents a knowledge graph construction method for legal case documents and related laws, aiming to organize legal information efficiently and enhance various downstream tasks. Our approach consists of three main steps: data crawling, information extraction, and knowledge graph deployment. First, the data crawler collects a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing. Next, the information extraction step employs natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text. Finally, the knowledge graph is deployed, connecting these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars. The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.
    摘要
  1. Data crawling: Collect a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing.2. Information extraction: Employ natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text.3. Knowledge graph deployment: Connect these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars.The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.

Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF

  • paper_url: http://arxiv.org/abs/2309.09055
  • repo_url: https://github.com/simengsun/alpaca_farm_lora
  • paper_authors: Simeng Sun, Dhawal Gupta, Mohit Iyyer
  • for: 这个技术报告目的是 empirically investigating an efficient implementation of RLHF using low-rank adaptation (LoRA),以便使用两个A100 GPUs进行RLHF,而不需要八个GPUs进行全模型精细化。
  • methods: 本文使用了LoRA来实现RLHF,并评估了各种LoRA-based PPO实现方式的效果,包括 removing KL regularization term、使用Jensen-Shannon divergence等其他 regularizers,以及trainining with LoRA的影响。
  • results: 本文发现:(1) 在LoRA设置下,移除KL正化term不会对AlpacaFarm评估集的性能造成负面影响; (2) 使用Jensen-Shannon divergence等其他正化器可以提高性能; (3) PPO训练会对模型生成的回答造成负面影响,但是使用LoRA可以几乎完全解除这个效应。
    Abstract During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
    摘要
  1. Removing the penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup.2. Other regularizers, such as Jensen-Shannon divergence, lead to improved performance.3. PPO training negatively impacts the factuality of model-generated responses, but training with LoRA largely mitigates this effect.We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

Context-aware Adversarial Attack on Named Entity Recognition

  • paper_url: http://arxiv.org/abs/2309.08999
  • repo_url: None
  • paper_authors: Shuguang Chen, Leonardo Neves, Thamar Solorio
  • for: 研究名实recognition任务中模型的Robustness,采用Context-aware adversarial attack方法。
  • methods: 提出 perturbing the most informative words for recognizing entities来创建攻击示例,并研究不同的候选替换方法来生成自然和可能的攻击示例。
  • results: 实验和分析表明,我们的方法比强基eline更有效地使模型作出错误预测。
    Abstract In recent years, large pre-trained language models (PLMs) have achieved remarkable performance on many natural language processing benchmarks. Despite their success, prior studies have shown that PLMs are vulnerable to attacks from adversarial examples. In this work, we focus on the named entity recognition task and study context-aware adversarial attack methods to examine the model's robustness. Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples and investigate different candidate replacement methods to generate natural and plausible adversarial examples. Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.
    摘要 Here is the text in Simplified Chinese:近年来,大型预训语言模型(PLM)在自然语言处理benchmark上表现出色。然而,先前的研究表明PLM对攻击性例子有敏感性。在这项工作中,我们将焦点放在命名实体识别任务上,研究上下文意识攻击方法,以评估模型的可靠性。特别是,我们提议在识别实体时对最有用的词语进行拟合,并比较不同的候选替换方法来生成自然和可能的攻击示例。我们的实验和分析表明,我们的方法可以更有效地使模型进行错误预测。

Rethinking STS and NLI in Large Language Models

  • paper_url: http://arxiv.org/abs/2309.08969
  • repo_url: None
  • paper_authors: Yuxia Wang, Minghan Wang, Preslav Nakov
  • for: 这个研究旨在重新思考大语言模型(LLM)时代的科学技术与社会(STS)和自然语言理解(NLI)。
  • methods: 我们首先评估了五个数据集上的科学技术与社会(STS)和自然语言理解(NLI)的准确率,然后评估LLM的预测信心和其能够捕捉人类集体意见的能力。
  • results: 我们发现LLM可以为特定话题提供个性化描述,或者生成不同语调的semantically相似内容,但现在LLM很难为个人提供个性化评价或决策。此外,我们发现零shot ChatGPT在临床和生物医学STS/NLI中达到了竞争性的准确率,但是采样变化很大, ensemble结果表现最佳。
    Abstract In this study, we aim to rethink STS and NLI in the era of large language models (LLMs). We first evaluate the accuracy of clinical/biomedical STS and NLI over five datasets, and then we assess LLM predictive confidence and their capability of capturing collective human opinions. We find that LLMs may be able to provide personalised descriptions for a specific topic, or to generate semantically similar content in different tones, but that this is hard for current LLMs to make personalised judgements or decisions. We further find that zero-shot ChatGPT achieves competitive accuracy over clinical and biomedical STS/NLI, constraining to the fine-tuned BERT-base. However, there is a large variation in sampling, ensembled results perform the best.
    摘要 在本研究中,我们想重新思考STS和NLI在大语言模型(LLM)时代中的应用。我们首先评估了五个数据集上的临床/生物医学STS和NLI的准确率,然后评估LLM的预测自信和人类集体意见的捕捉能力。我们发现LLM可能可以为特定主题提供个性化描述,或生成不同调式的含义相似内容,但当前LLM很难为个人作出个性化判断或决策。此外,我们发现零批量ChatGPT在临床和生物医学STS/NLI中 achieved competitive accuracy,但是精度受到 fine-tuned BERT-base 的限制。然而,采样方法的差异导致ensembled结果表现最佳。

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

  • paper_url: http://arxiv.org/abs/2309.08968
  • repo_url: None
  • paper_authors: Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
  • for: 这个论文旨在探讨如何使用SortedNet训练技术来实现大语言模型的动态推论,并且不需要预先训练和专门的硬件支持。
  • methods: 这个论文使用了SortedNet训练技术,将深度神经网络分成多个子模型,并且根据计算/准确性特征进行排序,以获得不同 Computational loads的子模型。
  • results: 这个论文的结果显示,使用Sorted Fine-Tuning(SoFT)技术可以实现大语言模型的动态推论,并且可以提高模型的效率,不需要预先训练和专门的硬件支持。
    Abstract The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.
    摘要 大量语言模型(LLM)的快速进步已经革命化自然语言处理(NLP)领域。虽然这些模型能够理解和生成人类语言样式的文本,但它们的广泛部署可能是非常昂贵的。SortedNet是一种最近的训练技术,用于启用深度神经网络的动态推理。它利用网络归一化来创建具有不同计算负担的子模型,并将它们按照计算/准确度特征进行嵌套排序。我们将SortedNet应用于生成型NLP任务,使得大语言模型在动态模式下运行,不需要预训练和只需要将标准超级精度训练(SFT)替换为Sorted Fine-Tuning(SoFT),并且可以保持模型的效率。我们的方法可以消除多个模型的需求,用于不同的执行环境。我们通过在LLaMa 2 13B上对斯坦福小羊数据集进行调整,并与标准训练和早期终止via PandaLM benchmark进行比较,显示Sorted Fine-Tuning可以提供比原始模型快 twice as fast的模型,同时保持或超越性能。

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

  • paper_url: http://arxiv.org/abs/2309.08963
  • repo_url: https://github.com/gersteinlab/struc-bench
  • paper_authors: Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein
  • for: 这个研究是为了评估当前的大型自然语言模型(LLMs)在生成复杂结构数据方面的能力,并提出了一种结构意识细化适应方法来改进这种能力。
  • methods: 研究者们提出了Struc-Bench,一个包括五种代表性的LLMs(即GPT-NeoX 20B、GPT-3.5、GPT-4、Vicuna)的评估方法,并对这些模型在手动构建的文本、HTML和LaTeX表格上进行了全面的评估。
  • results: 研究者们发现了当前模型在处理复杂结构输出时存在一些共同的格式错误和改进的可能性,并使用FormatCoT(链式思维)生成Format指令从目标输出中提取了 Format 信息。在应用结构意识细化适应方法后,LLaMA-7B模型在遵守自然语言约束方面表现出色,超过了其他评估的LLMs。
    Abstract Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
    摘要 尽管大型语言模型(LLM)如GPT-4具有强大的语言生成能力,但它们仍然在需要生成复杂结构化输出的任务上遇到困难。在这项研究中,我们评估当今LLM在生成复杂结构数据方面的能力,并提出一种结构意识练化方法来改进这种能力。为了进行全面的评估,我们提出了Struc-Bench,包括5种代表性的LLM(即GPT-NeoX 20B、GPT-3.5、GPT-4、Vicuna),并对它们在我们手动构建的数据集上进行评估。根据我们对当前模型性能的分析,我们标识出了特定的公共格式错误和改进的 возмож性。为了处理复杂的格式要求,我们使用FormatCoT(链条思维)生成 Format instrucions 从目标输出。我们的实验表明,当我们应用结构意识练化方法到 LLaMA-7B 时,可以显著提高遵从自然语言约束的能力,超过其他评估的LLMs。基于这些结果,我们提出了模型能力的六个维度(即覆盖率、格式、逻辑、理解、 Pragmatics 和 hallucination)能力图,这些能力图 highlights LLMs 在处理复杂结构输出的弱点,并提出了未来工作的优秀方向。我们的代码和模型可以在 https://github.com/gersteinlab/Struc-Bench 找到。

ODSum: New Benchmarks for Open Domain Multi-Document Summarization

  • paper_url: http://arxiv.org/abs/2309.08960
  • repo_url: None
  • paper_authors: Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
    for:这个论文主要目标是提出一种基于规则的方法,用于从查询基于文档 summarization 数据集中生成 open-domain multi-document summarization(ODMDS)数据集。methods:这种方法基于 retrieve-then-summarize 方法,并使用了一个新的数据集 ODSum,该数据集的文档索引相互关联并经常相互关系。results:通过广泛的实验, authors 发现了评价指标的变化和其可靠性,以及 LLMS Retrieving 错误导致的性能下降。 authors 还试图改进性能并调查其对不完全检索的 Robustness。
    Abstract Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textit{retrieve-then-summarize} method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at https://github.com/yale-nlp/ODSum.
    摘要 开放领域多文摘要 (ODMDS) 是一种重要的工具,可以将庞大的文档缩写成一个准确、简洁的摘要。由于文档集之间存在较多的相互关联,因此在 Retrieval 中不存在答案,这使得评估表现变得更加困难。我们提出了一种基于规则的方法,用于将查询基于文档摘要 dataset 转换为 ODMDS dataset。基于这种方法,我们提出了一个新的 dataset,ODSum,它的文档索引相互关联,并且经常存在相互关联。我们使用“Retrieve-then-summarize”方法来解决 ODMDS,并investigate了一系列的检索和摘要器的性能。通过广泛的实验,我们发现了评估指标之间的差异和其可靠性。此外,我们还发现了 LLMS 在检索错误时的性能下降。我们进一步调查了改进性能的方法,以及其对不完整检索的Robustness。我们将在 GitHub 上发布数据和代码。

Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals

  • paper_url: http://arxiv.org/abs/2309.08949
  • repo_url: None
  • paper_authors: Zhiyuan Hu, Yue Feng, Yang Deng, Zekun Li, See-Kiong Ng, Anh Tuan Luu, Bryan Hooi
  • for: 提高对话系统的效率和成功率,以及用户满意度。
  • methods: 采用了前期预测对话动作的方法,并将目标奖励信号纳入对话系统中。
  • results: 在MultiWoZ 2.1 dataset上实现了比前一代完全监督模型更高的性能,同时用户满意度和系统效率也得到了提高。
    Abstract Recently, the development of large language models (LLMs) has been significantly enhanced the question answering and dialogue generation, and makes them become increasingly popular in current practical scenarios. While unlike the general dialogue system which emphasizes the semantic performance, the task-oriented dialogue (ToD) systems aim to achieve the dialogue goal efficiently and successfully in multiple turns. Unfortunately, existing LLM-induced ToD systems lack the direct reward toward the final goal and do not take account of the dialogue proactivity that can strengthen the dialogue efficiency. To fill these gaps, we introduce the ProToD (Proactively Goal-Driven LLM-Induced ToD) approach, which anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems. Additionally, we present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations. This method allows us to gauge user satisfaction, system efficiency and successful rate while overcoming the limitations of current Information and Success metrics. Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data compared to previous end-to-end fully supervised models. This improvement is accompanied by enhanced user satisfaction and efficiency.
    摘要 (Simplified Chinese translation)近期,大型语言模型(LLM)的开发已经大幅提高了问答和对话生成,使其在现实场景中越来越受欢迎。而不同于普通的对话系统,任务对话(ToD)系统的目标是在多个转换中efficiently完成对话目标。然而,现有的LLM引导的ToD系统缺乏直接奖励最终目标,并不考虑对话的积极性,这可以增强对话效率。为了填补这些空白,我们介绍了ProToD(主动目标驱动LLM引导ToD)方法,预测未来对话动作并将目标奖励信号纳入ToD系统。此外,我们还提出了一种新的评估方法,基于目标驱动对话 simulations,以评估ToD系统的用户满意度、系统效率和成功率。这种方法可以超越现有的信息和成功指标,评估ToD系统的性能。实验表明,我们的模型在MultiWoZ 2.1数据集上可以使用只有10%的数据达到前一个完全监督模型的性能,这种改进是由用户满意度和效率增加而成就的。

Contextual Label Projection for Cross-Lingual Structure Extraction

  • paper_url: http://arxiv.org/abs/2309.08943
  • repo_url: None
  • paper_authors: Tanmay Parekh, I-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, Nanyun Peng
  • For: The paper is written for the task of creating pseudo-training data in target languages for structure extraction tasks, specifically event argument extraction.* Methods: The paper proposes a method called CLAP, which translates text to the target language and performs contextual translation on the labels using the translated text as the context. The method uses instruction-tuned language models with multilingual capabilities as the contextual translator.* Results: The paper reports that CLAP improves the F1-score by 2-2.5 points over other label projection techniques on the Chinese and Arabic ACE05 datasets.
    Abstract Translating training data into target languages has proven beneficial for cross-lingual transfer. However, for structure extraction tasks, translating data requires a label projection step, which translates input text and obtains translated labels in the translated text jointly. Previous research in label projection mostly compromises translation quality by either facilitating easy identification of translated labels from translated text or using word-level alignment between translation pairs to assemble translated phrase-level labels from the aligned words. In this paper, we introduce CLAP, which first translates text to the target language and performs contextual translation on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We compare CLAP with other label projection techniques for creating pseudo-training data in target languages on event argument extraction, a representative structure extraction task. Results show that CLAP improves by 2-2.5 F1-score over other methods on the Chinese and Arabic ACE05 datasets.
    摘要 训练数据的翻译到目标语言有助于cross-lingual transfer。然而, для结构提取任务,翻译数据的翻译需要一个标签投影步骤,该步骤将输入文本和其翻译后的文本一起翻译标签。过去的研究中,大多数标签投影方法会牺牲翻译质量, either by facilitating the identification of translated labels from the translated text or by using word-level alignment between translation pairs to assemble translated phrase-level labels from the aligned words。在这篇论文中,我们介绍了CLAP,它首先将文本翻译到目标语言,然后使用翻译后的文本作为 Context,以确保更高的翻译标签准确性。我们利用了Multilingual可调语言模型作为我们的上下文翻译器,并对翻译后的标签进行上下文翻译,以便在翻译后的文本中找到翻译标签。我们与其他标签投影技术进行比较,在目标语言中的ACE05数据集上进行pseudo-training数据的创建。结果显示,CLAP在中文和阿拉伯语ACE05数据集上提高了2-2.5个F1分的性能。

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

  • paper_url: http://arxiv.org/abs/2309.08929
  • repo_url: None
  • paper_authors: Kaiyan Zhao, Qiyu Wu, Xin-Qiang Cai, Yoshimasa Tsuruoka
  • for: 学习多语言句子表示是自然语言处理领域的基本和重要任务之一。
  • methods: 本文提出了一种新的方法MPCL,利用多个正例来改进学习多语言句子表示。
  • results: 我们的实验结果表明,相比 conventional CL,MPCL可以提高句子embedding模型的检索、Semantic相似性和分类性能。此外,我们还发现在未经看过的语言上,基于多个正例进行学习的句子embedding模型在跨语言传播性能更好。
    Abstract Learning multi-lingual sentence embeddings is a fundamental and significant task in natural language processing. Recent trends of learning both mono-lingual and multi-lingual sentence embeddings are mainly based on contrastive learning (CL) with an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multi-lingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information to learn. In order to investigate the impact of CL with multiple positives, we propose a novel approach MPCL to effectively utilize multiple positive instances to improve learning multi-lingual sentence embeddings. Our experimental results on various backbone models and downstream tasks support that compared with conventional CL, MPCL leads to better retrieval, semantic similarity, and classification performances. We also observe that on unseen languages, sentence embedding models trained on multiple positives have better cross-lingual transferring performance than models trained on a single positive instance.
    摘要 学习多语言句子嵌入是自然语言处理中的基础和重要任务。现今的多语言句子嵌入学习主要基于对比学习(CL)的 anchor、一个正例和多个负例。在这项工作中,我们认为可以利用多个正例,因为(1)多语言正例可以提高语言之间的学习,和(2)多个正例之间的相互关系可以提供可靠的结构信息来学习。为了研究CL与多个正例的影响,我们提出了一种新的方法MPCL,可以有效地利用多个正例来提高多语言句子嵌入学习。我们的实验结果表明,与传统CL相比,MPCL在不同的基础模型和下游任务中都具有更好的检索、Semantic相似性和分类性能。此外,我们还发现在未看过的语言上,基于多个正例的句子嵌入模型在cross-lingual传输性能上比基于单个正例的模型更好。

Multimodal Multi-Hop Question Answering Through a Conversation Between Tools and Efficiently Finetuned Large Language Models

  • paper_url: http://arxiv.org/abs/2309.08922
  • repo_url: None
  • paper_authors: Hossein Rajabzadeh, Suyuchen Wang, Hyock Ju Kwon, Bang Liu
  • for: Answering complex multimodal multi-hop questions
  • methods: 使用大语言模型(LLM)和预定的工具集来分解问题,并通过多模式多步骤的问题分解来提高 LLM 的理解能力
  • results: 在两个最新引入的复杂问答数据集上进行评估,实验结果显示了substantial improvement over existing state-of-the-art solutions, indicating the effectiveness and generality of the proposed strategy
    Abstract We employ a tool-interacting divide-and-conquer strategy enabling large language models (LLMs) to answer complex multimodal multi-hop questions. In particular, we harness the power of large language models to divide a given multimodal multi-hop question into unimodal single-hop sub-questions to be answered by the appropriate tool from a predefined set of tools. After all corresponding tools provide the LLM with their answers, the LLM generates the next relevant unimodal single-hop question. To increase the reasoning ability of LLMs, we prompt chatGPT to generate a tool-interacting divide-and-conquer dataset. This dataset is then used to efficiently finetune the corresponding LLM. To assess the effectiveness of this approach, we conduct an evaluation on two recently introduced complex question-answering datasets. The experimental analysis demonstrate substantial improvements over existing state-of-the-art solutions, indicating the efficacy and generality of our strategy
    摘要 我们采用工具互动分解胜利策略,让大型语言模型(LLM)能够回答复杂多Modal多步问题。具体来说,我们利用大型语言模型将给定的多Modal多步问题分解成单Modal单步问题,由预定的工具集中的相应工具来答复。接下来,所有相应工具提供其答案后,LLM生成下一个相关的单Modal单步问题。为了提高LLM的逻辑能力,我们让ChatGPT生成一个工具互动分解数据集。这个数据集然后用于高效地训练相应的LLM。为评估我们的方法效果,我们对两个最近引入的复杂问题回答数据集进行评估。实验分析表明,我们的策略具有显著的改善,表明我们的方法的有效性和通用性。

Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models

  • paper_url: http://arxiv.org/abs/2309.08902
  • repo_url: None
  • paper_authors: Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim
  • for: 这 paper 探讨了 LLM 是否存在各种社会团体中的偏见,以及这些偏见是否对 consequential 决策产生影响,如employmnet、人性评价和刑事判决。
  • methods: 作者使用了 sentence completion task 来测试 LLM 的偏见,并使用了多种社会团体和不同的属性来检测偏见。
  • results: 作者发现了 LLM 在不同社会团体和属性之间的偏见,包括年龄和美貌等。这些偏见与人类在实验心理学中发现的偏见相似。
    Abstract LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less studied, but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs (specially autoregressive language models) make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the ``what is beautiful is good'' bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. Finally, we report the correlations that we find for multiple cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.
    摘要

Semantic Information Extraction for Text Data with Probability Graph

  • paper_url: http://arxiv.org/abs/2309.08879
  • repo_url: None
  • paper_authors: Zhouxiang Zhao, Zhaohui Yang, Ye Hu, Licheng Lin, Zhaoyang Zhang
  • for: 本文研究了在有限通信资源下传输 semantic information,以提高文本数据的传输效率。
  • methods: 本文使用自然语言处理技术提取原始文本数据,然后将提取的semantic information capture在知识图中。提取semantic information的问题被设为优化框架,目标是提取最重要的semantic information进行传输。
  • results: 提议的算法可以减少semantic uncertainty和semantic similarity,并且与基于文本 Similarity的方法进行比较。
    Abstract In this paper, the problem of semantic information extraction for resource constrained text data transmission is studied. In the considered model, a sequence of text data need to be transmitted within a communication resource-constrained network, which only allows limited data transmission. Thus, at the transmitter, the original text data is extracted with natural language processing techniques. Then, the extracted semantic information is captured in a knowledge graph. An additional probability dimension is introduced in this graph to capture the importance of each information. This semantic information extraction problem is posed as an optimization framework whose goal is to extract most important semantic information for transmission. To find an optimal solution for this problem, a Floyd's algorithm based solution coupled with an efficient sorting mechanism is proposed. Numerical results testify the effectiveness of the proposed algorithm with regards to two novel performance metrics including semantic uncertainty and semantic similarity.
    摘要 在本文中,我们研究了在有限通信资源的文本数据传输中的Semantic信息提取问题。我们考虑的模型中,一个文本数据序列需要在有限通信资源的网络中传输,只允许有限的数据传输。因此,在发送器端,原始文本数据使用自然语言处理技术进行提取。然后,提取到的Semantic信息被 capture在一个知识图中。在这个图中,我们引入了一个概率维度,用于捕捉每个信息的重要性。这个Semantic信息提取问题被 pose为一个优化框架,其目标是提取最重要的Semantic信息进行传输。为了找到最优解,我们提出了基于Floyd的算法和高效排序机制的解决方案。数据测试表明了我们提出的算法的效果,并使用了两个新的性能指标:Semanticuncertainty和Semantic similarity。

X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

  • paper_url: http://arxiv.org/abs/2309.08873
  • repo_url: None
  • paper_authors: Juan Diego Rodriguez, Katrin Erk, Greg Durrett
  • for: 本研究的目的是解决跨语言文本之间的信息差异问题。
  • methods: 本研究使用了多种方法来解决这个问题,包括经典的机器翻译tokenAlignment、文本推理方法和大语言模型的推断。
  • results: 研究发现这些方法在处理可推理信息方面表现不一,但都落后于人类表现。
    Abstract Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting of large language models. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance.
    摘要 理解两个文本具有相同信息是许多自然语言处理(NLP)问题的目标,包括文本推理和事实核查。当这两个文本在不同语言时,这个问题变得更加复杂。我们现在介绍了X-PARADE(跨语言段级分析异同和推理),这是首个跨语言段级信息异同数据集。注解员将目标语言中的一个段标记为源语言中的对应段之间的异同,并评估它们之间的关系,以确定一个信息是否相同、新的或新的但可以推理出来。这个概念与跨语言NLI(自然语言理解)建立了联系。我们使用了这些数据集,调查了一系列方法来解决这个问题,包括机器翻译的 класси型token对齐、文本推理方法的本地化和大语言模型的激励。我们的结果表明,这些方法在处理推理出来的信息方面表现不一样,但它们都不足以达到人类性能。

Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using U.S. College Subreddit Data from 2019 to 2022

  • paper_url: http://arxiv.org/abs/2309.08845
  • repo_url: https://github.com/alvayan/postcovidsentianalysis
  • paper_authors: Tian Yan, Fang Liu
    for: 这项研究的目的是探讨2019年至2022年的情绪变化,特别是在疫情风险降低后情绪是否回归到了过去的水平。methods: 该研究使用Reddit数据收集于2019年、2020年、2021年和2022年,从128所美国大学/学院的Subreddit中收集数据,并使用预训练的Robustly Optimized BERT预训练方法(RoBERTa)和图像注意网络(GAT)来预测情绪。results: 研究发现,相比2019年,2020年、2021年和2022年的负面情绪的可能性分别提高了24%、4.3%和10.3%,这些增长都是 statistically significant(适用 $p$ <0.05)。这些结果表明在疫情风险降低后,情绪的组成部分在后疫情emergency era中进行了部分恢复。
    Abstract As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people's emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of pandemic, transitioning period to post-emergency period) from subreddits in 128 universities/colleges in the U.S., and a set of school-level characteristics. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and graph attention network (GAT) that leverages both rich semantic and relational information among posted messages and then applied a logistic stacking method to obtain the final sentiment classification. After obtaining sentiment label for each message, we used a generalized linear mixed-effects model to estimate temporal trend in sentiment from 2019 to 2022 and how school-level factors may affect sentiment. Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 24%, 4.3%, and 10.3% higher, respectively, which are all statistically significant(adjusted $p$<0.05). Our study findings suggest a partial recovery in the sentiment composition in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022.
    摘要 COVID-19 大流行的影响逐渐减轻,个人和社会逐渐返回到前疫情时期的活动。这项研究旨在探讨人们在疫情期间和后emergency期间的情绪是否发生了变化,以及情绪是否回归到了前疫情水平。我们收集了2019年、2020年、2021年和2022年Reddit数据(来自128所美国大学/学院Subreddit),以及一组学校级特征。我们预测了两组情绪(来自Robustly Optimized BERT预训练方法(RoBERTa)和图像注意网络(GAT)),然后应用了ilogistic栈合并方法来获得最终的情绪分类。接下来,我们使用一种通用的线性混合效应模型来估计2019年至2022年间情绪的时间趋势,以及学校级因素是否影响情绪。相比2019年,2020年、2021年和2022年的负面情绪的 odds 分别高于24%、4.3%和10.3%,这些均为 statistically significant( adjusted $p$ <0.05)。我们的研究发现,在后疫情emergency era,情绪结构有部分恢复。结果与常见预期相符,并为2019年至2022年间情绪的发展提供了详细的量化。

EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning

  • paper_url: http://arxiv.org/abs/2309.10687
  • repo_url: None
  • paper_authors: Rajasekhar Reddy Mekala, Yasaman Razeghi, Sameer Singh
  • for: 提高大型语言模型在上下文学习中的表现
  • methods: 引入EchoPrompt策略,让模型重复提问以提高它的答案
  • results: 实验结果表明,EchoPrompt可以在标准和链式提问下提高zero-shot和几少shot上下文学习的表现,并且在不同的数学计算(GSM8K、SVAMP、MultiArith、SingleOp)、阅读理解(DROP、SQuAD)和逻辑理解(Shuffled Objects、Date Understanding、Coin Flipping)任务中均有显著提高。
    Abstract Large language models primarily rely on incontext learning to execute tasks. We introduce EchoPrompt, a simple yet effective approach to prompt the model to rephrase its queries before answering them. EchoPrompt is inspired by self-questioning, a cognitive strategy humans use to vocalize queries before providing answers, thereby reducing misconceptions. Experimental results demonstrate that EchoPrompt leads to substantial improvements in both zero-shot and few-shot in-context learning with standard and chain-of-thought prompting on four families of causal language models. These improvements are observed across various numerical reasoning (GSM8K, SVAMP, MultiArith, SingleOp), reading comprehension (DROP, SQuAD), and logical reasoning (Shuffled Objects, Date Understanding, Coin Flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. We investigate the effectiveness of EchoPrompt through ablation studies, which reveal the significance of both original and rephrased queries for EchoPrompt's efficacy. Our empirical results show that EchoPrompt is an effective technique that can easily augment in-context learning for better performance.
    摘要