cs.CL - 2023-07-16

Automatic Identification of Alzheimer’s Disease using Lexical Features extracted from Language Samples

  • paper_url: http://arxiv.org/abs/2307.08070
  • repo_url: None
  • paper_authors: M. Zakaria Kurdi
  • for: 本研究的目的是提高阿尔茨海默病(AD)对不同方面的语言处理的影响的理解,以及使用这些方面的语言特征作为机器学习分类器的特征来实现状态对精度的自动识别语言样本。
  • methods: 本研究使用了ADDreSS挑战数据集,该数据集来自于DementiaBank集成。研究使用的数据集包括54名参与者在训练部分中提供的Cookie Theft图像描述语音样本,以及24名参与者在测试部分中提供的样本。总的来说,训练集和测试集中的语音样本数为108和48。首先,研究通过分析99个选择的语言特征对AD的影响进行了研究,并在训练和测试部分中进行了语音样本的分类。其次,研究通过不同的语言复杂度区域进行了一些机器学习实验,以确定可以实现优化性能的特征子组合。最后,研究还对语音样本的大小对分类的影响进行了研究。
  • results: 使用语言特征 seule,可以实现状态对精度高于91%的自动识别语言样本Produced by individuals with AD from those produced by healthy control subjects.这表明AD会对语言处理产生重要的影响。
    Abstract Objective: this study has a twofold goal. First, it aims to improve the understanding of the impact of Dementia of type Alzheimer's Disease (AD) on different aspects of the lexicon. Second, it aims to demonstrate that such aspects of the lexicon, when used as features of a machine learning classifier, can help achieve state-of-the-art performance in automatically identifying language samples produced by patients with AD. Methods: data is derived from the ADDreSS challenge, which is a part of the DementiaBank corpus. The used dataset consists of transcripts of Cookie Theft picture descriptions, produced by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is 108 in the training set and 48 in the test set. First, the impact of AD on 99 selected lexical features is studied using both the training and testing parts of the dataset. Then some machine learning experiments were conducted on the task of classifying transcribed speech samples with text samples that were produced by people with AD from those produced by normal subjects. Several experiments were conducted to compare the different areas of lexical complexity, identify the subset of features that help achieve optimal performance, and study the impact of the size of the input on the classification. To evaluate the generalization of the models built on narrative speech, two generalization tests were conducted using written data from two British authors, Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan. Results: using lexical features only, state-of-the-art classification, F1 and accuracies, of over 91% were achieved in categorizing language samples produced by individuals with AD from the ones produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.
    摘要 Methods: The study uses data from the ADDreSS challenge, part of the DementiaBank corpus. The dataset consists of transcripts of Cookie Theft picture descriptions produced by 54 subjects in the training set and 24 subjects in the test set, with 108 narrative samples in the training set and 48 in the test set. The study first examines the impact of AD on 99 selected lexical features using both the training and testing parts of the dataset. Then, machine learning experiments are conducted to classify transcribed speech samples produced by people with AD from those produced by normal subjects. The experiments compare different areas of lexical complexity, identify the subset of features that achieve optimal performance, and study the impact of input size on classification. To evaluate the generalization of the models built on narrative speech, the study conducts two generalization tests using written data from Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan.Results: The study achieves state-of-the-art classification, F1, and accuracies of over 91% in categorizing language samples produced by individuals with AD from those produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.

Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach

  • paper_url: http://arxiv.org/abs/2307.07994
  • repo_url: https://github.com/jfzhouyoo/supporter
  • paper_authors: Jinfeng Zhou, Zhuang Chen, Bo Wang, Minlie Huang
  • for: 提供情感支持(ES),提高心理状态。
  • methods: 使用混合专家模型和资料学习,评估对话的凝合性和情感诱发效果。
  • results: 在回答过程中,Supporter模型可以诱发正面情感,同时维持对话的凝合性。
    Abstract Emotional support conversation (ESC) aims to provide emotional support (ES) to improve one's mental state. Existing works stay at fitting grounded responses and responding strategies (e.g., question), which ignore the effect on ES and lack explicit goals to guide emotional positive transition. To this end, we introduce a new paradigm to formalize multi-turn ESC as a process of positive emotion elicitation. Addressing this task requires finely adjusting the elicitation intensity in ES as the conversation progresses while maintaining conversational goals like coherence. In this paper, we propose Supporter, a mixture-of-expert-based reinforcement learning model, and well design ES and dialogue coherence rewards to guide policy's learning for responding. Experiments verify the superiority of Supporter in achieving positive emotion elicitation during responding while maintaining conversational goals including coherence.
    摘要 emotional support conversation (ESC) 目的是提供情感支持 (ES) 以改善心理状态。现有的工作停留在适应地响应(例如问题),而忽视 ES 的影响和没有明确的目标导航情感积极转移。为此,我们引入了一种新的 парадиг,将多回合 ES 视为积极情感诱发的过程。在这个任务中,需要精准地调整情感诱发强度,以便在对话进行时维持对话目标,包括凝合性。在这篇论文中,我们提出了支持者,一种权重学习模型,并设计了 ES 和对话凝合性奖励来引导策略的学习。实验证明了支持者在回答时积极诱发情感的同时保持对话目标,包括凝合性。

A Survey of Techniques for Optimizing Transformer Inference

  • paper_url: http://arxiv.org/abs/2307.07982
  • repo_url: None
  • paper_authors: Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani
  • for: 本文旨在概述transformer网络在推理阶段的优化技术,包括知识储存、剪枝、量化、 neural architecture search 和轻量级网络设计等方法。
  • methods: 本文 Survey了一系列的算法级别优化技术,包括知识储存、剪枝、量化、 neural architecture search 和轻量级网络设计等方法。
  • results: 本文 Summarized了一些模型和技术的量化结果,以及它们的准确率和计算复杂度之间的贸易OFF。 In English, the three main points of the paper are:
  • for: The paper aims to survey optimization techniques for the inference phase of transformer networks, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design.
  • methods: The paper surveys a series of algorithm-level optimization techniques, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design.
  • results: The paper summarizes the quantitative results of several models and techniques, including the tradeoff between accuracy and computational complexity.
    Abstract Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
    摘要 This paper provides a comprehensive survey of techniques for optimizing transformer inference, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design. We also review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results of several models and techniques to show the tradeoffs they exercise.The survey aims to educate both novice and seasoned researchers in this rapidly evolving field and spark a plethora of research efforts. We believe that this survey will be a valuable resource for researchers and practitioners who are interested in optimizing transformer inference for a wide range of applications.Here is the Simplified Chinese translation of the text:最近几年,变换器神经网络的性能和应用已经惊人地增长。变换器家族,包括bidirectional Encoder Representations from Transformer(BERT)、Generative Pretrained Transformer(GPT)和Vision Transformer(ViT),在自然语言处理(NLP)和计算机视觉(CV)领域都有显著的效果。然而,为了追求高预测性能,变换器的内存和计算核心占用量已经呈指数增长。研究人员已经提出了多种优化变换器推理的技术,包括知识储存、剪辑、量化、神经网络搜索和轻量级网络设计。这篇论文提供了优化变换器推理的全面survey,包括知识储存、剪辑、量化、神经网络搜索和轻量级网络设计。我们还对硬件优化技术进行了评论,以及设计了新的硬件加速器 для变换器。我们SUMMARIZE了几种模型和技术的量化结果,以示其质量和计算量的交易。我们还对未来的发展方向进行了讨论。我们认为,这篇论文将成为 transformer 优化推理的全面资源,对研究人员和实践者都会是非常有价值的。我们期望,这篇论文将能够教育 novice 和经验丰富的研究人员,并促进这个领域的研究努力。

Model Adaptation for ASR in low-resource Indian Languages

  • paper_url: http://arxiv.org/abs/2307.07948
  • repo_url: None
  • paper_authors: Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Savitha, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Rohan Saxena, Sai Praneeth Reddy Mora, Srinivasa Raghavan
  • for: 本研究旨在探讨如何使用自适应学习和大规模多语言训练提高语音识别性能,特别是对具有限制的语音和文本数据的低资源语言进行应用。
  • methods: 本研究使用的方法包括使用wav2vec2自适应学习模型和大规模多语言训练,以及对文本和语音数据进行适应和细化。
  • results: 研究发现,通过对相似语言进行适应和细化,可以在低资源语言中提高语音识别性能,同时也可以利用大量的文本数据来提高语音识别性能。
    Abstract Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages. In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world.
    摘要 自动语音识别(ASR)性能在最近几年内有大幅度改善,主要归功于自我超级学习(SSL)基于音频模型如wav2vec2和大规模多语言训练如Whisper。然而,低资源语言仍然存在巨大挑战,特别是印度语言。这是因为印度语言有多种方言,使得训练ASR模型变得更加困难。然而,许多印度语言可以被分组为同一家族,并且共享同一个字母和语法结构。这使得可以应用大量相似语言的适应和细化技术来抵消低资源数据的问题。在这个特别 sessio中,我们邀请社区探讨以下问题:在印度语言中,音频和文本Modalities在建立可靠ASR模型中的重要性。可能是因为一种语言有充足的音频数据,因此减少了大量文本训练 corpora的需求。或者,由于可用的多种预训练音频模型,因此可能是文本训练 corpora的需求减少了。我们鼓励社区通过使用两种低资源印度语言:孟加拉语和季风语进行研究。这些方法不仅适用于印度语言,也适用于世界各地的其他语言。

Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling

  • paper_url: http://arxiv.org/abs/2307.07946
  • repo_url: https://github.com/zifengcheng/cdap
  • paper_authors: Zifeng Cheng, Qingyu Zhou, Zhiwei Jiang, Xuemin Zhao, Yunbo Cao, Qing Gu
  • for: 这篇论文的目的是提出一个能够在几少标签样本的情况下识别新的类别的方法。
  • methods: 这篇论文提出了一个叫做 Consistent Dual Adaptive Prototypical (CDAP) 网络,这个网络包含了token level和span level的网络,并将它们在不同的粒度上进行联合训练。另外,这篇论文还提出了一个对于两个网络的一致损失函数,以便它们可以从对方学习。
  • results: 在实验阶段,这篇论文获得了三个benchmark dataset上的新的州际之最的结果。
    Abstract Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets.
    摘要 《几个样本级标注 для序列标注》的目标是基于只有几个标注样本来识别新的类别。现有方法主要通过设计token级或Span级标注模型来解决数据缺乏问题,但这些方法只是在单一级别( тоeken级或Span级)进行训练,它们有相应的缺陷。在这篇论文中,我们首先统一了token级和Span级监督,并提出了一个Consistent Dual Adaptive Prototypical(CDAP)网络 для几个样本级标注。CDAP网络包括token级和Span级网络,在不同的级别进行联合训练。为了将两个网络的输出保持一致,我们还提出了一种一致损失函数,使其可以从彼此学习。在推断阶段,我们提出了一种一致推断算法,首先调整预测概率,然后选择非重叠的Span WITH maximum概率。广泛的实验表明,我们的模型在三个标准 benchmark dataset上达到了新的状态级 результа。

Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions

  • paper_url: http://arxiv.org/abs/2307.07940
  • repo_url: None
  • paper_authors: Atsushi Shirafuji, Yutaka Watanobe
  • for: 提高在线评测系统中学生参照多种解决方案的能力
  • methods: 使用去重和排名常见解决方案来减少参照program的数量
  • results: 实验结果表明,去重后的program数量减少60.20%,比基准值减少29.59%,meaning users only need to refer to39.80% of programs on average,top-10 ranked programs cover 29.95% of programs on average.
    Abstract Referring to the solution programs written by the other users is helpful for learners in programming education. However, current online judge systems just list all solution programs submitted by users for references, and the programs are sorted based on the submission date and time, execution time, or user rating, ignoring to what extent the program can be a reference. In addition, users struggle to refer to a variety of solution approaches since there are too many duplicated and near-duplicated programs. To motivate the learners to refer to various solutions to learn the better solution approaches, in this paper, we propose an approach to deduplicate and rank common solution programs in each programming problem. Based on the hypothesis that the more duplicated programs adopt a more common approach and can be a reference, we remove the near-duplicated solution programs and rank the unique programs based on the duplicate count. The experiments on the solution programs submitted to a real-world online judge system demonstrate that the number of programs is reduced by 60.20%, whereas the baseline only reduces by 29.59% after the deduplication, meaning that the users only need to refer to 39.80% of programs on average. Furthermore, our analysis shows that top-10 ranked programs cover 29.95% of programs on average, indicating that the users can grasp 29.95% of solution approaches by referring to only 10 programs. The proposed approach shows the potential of reducing the learners' burden of referring to too many solutions and motivating them to learn a variety of better approaches.
    摘要 依据我们的提案,在编程教育中参考其他用户提交的解决方案程序是有帮助的。然而,当前在线评测系统只是列出所有由用户提交的解决方案程序作为参考,并根据提交日期和时间、执行时间或用户评分排序,而忽略了解决方案的多样性。此外,用户很难参考多种解决方法,因为有太多重复和相似的程序。为了鼓励学生参考多种解决方法,并学习更好的解决方法,我们提出了一种方法,可以在每个编程问题中去除重复的解决方案程序,并将唯一的程序排名基于重复计数。我们的实验表明,在一个真实的在线评测系统中,可以将解决方案程序的数量减少了60.20%,而基eline只减少了29.59%,这意味着用户只需要参考39.80%的程序的平均数量。此外,我们的分析表明,排名前10的程序覆盖了29.95%的程序的平均数量,这表明用户可以通过参考只有10个程序来掌握29.95%的解决方法。我们的方法表明可以减轻学生参考太多解决方案程序的负担,并鼓励他们学习更多的更好的解决方法。

Communicative Agents for Software Development

  • paper_url: http://arxiv.org/abs/2307.07924
  • repo_url: None
  • paper_authors: Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, Maosong Sun
  • for: 这篇论文旨在探讨一种基于大语言模型(LLM)的软件开发方法,它可以在不同的软件开发阶段中使用自然语言交流来替代专门的模型。
  • methods: 这篇论文提出了一种基于虚拟对话的软件开发方法,其中每个阶段都有一个团队参与,包括程序员、代码审查人员和测试工程师。在每个阶段中,团队成员可以通过对话来协作、分享想法和解决问题。
  • results: 研究表明,使用这种方法可以在 less than 7 分钟内完成整个软件开发过程,并且成本低于 1 美元。此外,这种方法还可以快速察觉和修复潜在的漏洞和幻觉,保证了软件的可靠性和效率。
    Abstract Software engineering is a domain characterized by intricate decision-making processes, often relying on nuanced intuition and consultation. Recent advancements in deep learning have started to revolutionize software engineering practices through elaborate designs implemented at various stages of software development. In this paper, we present an innovative paradigm that leverages large language models (LLMs) throughout the entire software development process, streamlining and unifying key processes through natural language communication, thereby eliminating the need for specialized models at each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered software development company that mirrors the established waterfall model, meticulously dividing the development process into four distinct chronological stages: designing, coding, testing, and documenting. Each stage engages a team of agents, such as programmers, code reviewers, and test engineers, fostering collaborative dialogue and facilitating a seamless workflow. The chat chain acts as a facilitator, breaking down each stage into atomic subtasks. This enables dual roles, allowing for proposing and validating solutions through context-aware communication, leading to efficient resolution of specific subtasks. The instrumental analysis of ChatDev highlights its remarkable efficacy in software generation, enabling the completion of the entire software development process in under seven minutes at a cost of less than one dollar. It not only identifies and alleviates potential vulnerabilities but also rectifies potential hallucinations while maintaining commendable efficiency and cost-effectiveness. The potential of ChatDev unveils fresh possibilities for integrating LLMs into the realm of software development.
    摘要 软件工程是一个具有复杂决策过程的领域,常常依赖于细腻的直觉和咨询。在最近的深度学习技术的推动下,软件工程做法正在不断发展和改进。在这篇论文中,我们提出了一种创新的思路,利用大语言模型(LLMs)在软件开发过程中扮演重要角色,通过自然语言交流,解决特定阶段的问题,从而消除特殊模型的需求。在我们的思路中,核心是一家虚拟对话驱动的软件开发公司——ChatDev,它类似于传统的水平模型,将软件开发过程分成四个阶段:设计、编程、测试和文档。每个阶段都有一群代表不同职业的代理人,如程序员、代码审查员和测试工程师,通过对话协作,实现了无缝的工作流程。对话链 acts as a facilitator,将每个阶段分解成原子任务,使代理人能够通过上下文感知的沟通,提出和验证解决方案,从而提高效率。我们的研究表明,ChatDev在软件生成过程中具有惊人的效果,可以在七分钟之内完成整个软件开发过程,成本低于一元。它不仅可以找到和消除潜在的漏洞,还可以修正潜在的幻觉,保持了卓越的效率和成本效果。ChatDev的潜在力量探讨出了许多新的软件开发领域的可能性,它可以帮助我们更好地利用深度学习技术,提高软件开发效率和质量。

Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages

  • paper_url: http://arxiv.org/abs/2307.08714
  • repo_url: None
  • paper_authors: Sunisth Kumar, Davide Liu, Alexandre Boulenger
  • for: 这项研究的目的是提出一种高效的多语言命名实体识别框架,以便在不同语言的 semi-structured 文本数据中进行命名实体识别。
  • methods: 这种模型设计基于知识储存和一致训练,利用一个大型语言模型(XLMRoBERTa)预训练后的知识,并通过学生教师关系进行知识传递。学生模型还在低资源目标语言上进行不supervised一致训练(使用 KL 异谱损失)。
  • results: 我们使用了两个独立的英文和阿拉伯文短信数据集,每个数据集包含 semi-structured 银行交易信息,以证明知识传递的可行性。只需训练30个标注样本,我们的模型可以将英文中的商户、金额等场景转移到阿拉伯文中。我们的模型设计方式在比较状态艺术方法(如 DistilBERT 预训练目标语言或直接在目标语言上supervised 训练)时表现出色,并且在总体来说表现最佳。
    Abstract We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. Our approach relies on both knowledge distillation and consistency training. The modeling framework leverages knowledge from a large language model (XLMRoBERTa) pre-trained on the source language, with a student-teacher relationship (knowledge distillation). The student model incorporates unsupervised consistency training (with KL divergence loss) on the low-resource target language. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information, and focus on exhibiting the transfer of knowledge from English to Arabic. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic. We show that our modeling approach, while efficient, performs best overall when compared to state-of-the-art approaches like DistilBERT pre-trained on the target language or a supervised model directly trained on labeled data in the target language. Our experiments show that it is enough to learn to recognize entities in English to reach reasonable performance in a low-resource language in the presence of a few labeled samples of semi-structured data. The proposed framework has implications for developing multi-lingual applications, especially in geographies where digital endeavors rely on both English and one or more low-resource language(s), sometimes mixed with English or employed singly.
    摘要 我们提出了一种高效的模型框架,用于跨语言名实Recognition semi-structured text数据。我们的方法基于知识储存和一致性训练。我们的模型框架利用源语言中已经预训练的大语言模型(XLMRoBERTa),并通过学生与师之间的关系(知识储存)。学生模型包括低资源目标语言中无监督一致性训练(KL异同损失)。我们使用了英文和阿拉伯语两个独立的短信数据集,每个数据集包含英文和阿拉伯语的 semi-structured 银行交易信息。我们主要关注在英文到阿拉伯语的知识传递中。只有访问30个标注样本后,我们的模型可以将英文中的商户、金额和其他字段识别到阿拉伯语中。我们的模型方法,尽管高效,与状态机器翻译的approaches如DistilBERT预训练目标语言或直接在目标语言上监督训练模型相比,表现最佳。我们的实验表明,只要学习英文中的实体,就可以在低资源语言中 reaches reasonable performance,即使只有几个标注样本的 semi-structured 数据。我们的模型框架在发展多语言应用程序方面具有重要意义,特别是在某些地区的数字努力依赖于英文和一些低资源语言(或混合英文)。

LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models

  • paper_url: http://arxiv.org/abs/2307.07889
  • repo_url: None
  • paper_authors: Adian Liusie, Potsawee Manakul, Mark J. F. Gales
  • for: 本研究探讨了如何使用大语言模型(LLM)来自动评估自然语言生成(NLG)的能力,以及 Comparative assessment 的可能性和优势。
  • methods: 本研究使用了 LLM 的 emergent 能力进行 zero-shot NLG 评估,包括绝对分数预测和比较评估。比较评估使用了对候选作品之间的相对比较,而不是独立地评估每一个候选作品。
  • results: 研究发现,使用 LLM 进行 Comparative assessment 可以达到 moderate-sized open-source LLMs 的比较好的性能,并且可以超越prompt scoring。此外,研究还发现了 LLB 的位置偏见问题,并提出了一些debiasing方法来进一步改进性能。
    Abstract Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural language generation (NLG), a highly challenging area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses relative comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspectives: performance compared to absolute grading; positional biases in the prompt; and efficient ranking in terms of the number of comparisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring, and in many cases can achieve performance competitive with state-of-the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional biases when making pairwise comparisons, and we propose debiasing methods that can further improve performance.
    摘要 现有大型语言模型(LLM)的发展,使得zero-shot能力在不同的自然语言任务中展现出卓越的表现。在这篇论文中,我们探索了两种利用LLM的发现能力进行零shot语言生成评估的方法:绝对分数预测和相对比较方法。相对比较方法,尽管尚未广泛研究在语言生成评估中,但人类往往觉得比较两个选项更直觉。这篇论文从多种角度探讨了相对比较方法:相对于绝对分数评估;提示中的位置偏见;以及有效的排名方法。我们发现了LLM相对比较评估是一种简单、通用且有效的语言生成评估方法。对于中等规模的开源LLM,如FlanT5和Llama2-chat,相对比较评估比提示分数评估更好,且在许多情况下可以与现有方法竞争。此外,我们发现了LLM在比较对比中时常表现出强烈的位置偏见,我们提出了修复方法,可以进一步提高表现。

Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding

  • paper_url: http://arxiv.org/abs/2307.07880
  • repo_url: https://github.com/boleima/profit
  • paper_authors: Bolei Ma, Ercong Nie, Helmut Schmid, Hinrich Schütze
  • for: 这个研究旨在研究在多语言预训练语义模型(MPLMs)中的提示基于finetuning的跨语言能力。
  • methods: 该研究使用了提示基于finetuning的方法,并进行了对多种目标语言的评估。
  • results: 研究发现,提示基于finetuning在跨语言语理理解任务中表现出了有效性和多样性,并且在不同的几shot和全数据情况下表现出了不同的表现特征。
    Abstract Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the ProFiT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.
    摘要 多语言预训言语模型(MPLM)已经在零shot横跨不同自然语言理解任务中显示出了很大的性能提升,通过在源语言(如英语)的任务特定标注数据上精度MPLM,并在多种目标语言进行评估。据最新的研究显示,在几个shot场景中,提问基本的训练超越常训练。然而,跨语言提问基本学习的探索仍然受限。在这项研究中,我们提出了ProFiT管道,以探索跨语言理解的提问基本学习能力。我们在多种跨语言语理解任务(情感分类、重叠识别和自然语言推理)上进行了广泛的实验,并详细分析了提问基本训练在跨语言传递中的变化趋势。我们的结果表明提问基本训练在跨语言语理解中具有效果和多样性。我们发现,提问基本训练在全数据场景中比常训练高效,并在几个shot场景中表现出更大的优势。此外,我们分析了跨语言性和预训练数据大小等因素对提问基本训练的跨语言性能的影响。总之,我们的工作提供了关于提问基本训练的跨语言才能的有价值的视角。

CIDER: Context sensitive sentiment analysis for short-form text

  • paper_url: http://arxiv.org/abs/2307.07864
  • repo_url: https://github.com/jcy204/ciderpolarity
  • paper_authors: James C. Young, Rudy Arthur, Hywel T. P. Williams
  • for: This paper is written for researchers who are interested in sentiment analysis and natural language processing.
  • methods: The paper presents a new approach called CIDER (Context Informed Dictionary and sEntiment Reasoner), which performs context-sensitive sentiment analysis by inferring the valence of sentiment-laden terms from the whole corpus before scoring individual texts.
  • results: The paper demonstrates that CIDER outperforms state-of-the-art generalist sentiment analysis on a large collection of tweets about the weather.Here’s the Chinese translation of the three points:
  • for: 这篇论文是为研究者们编写的,他们关注情感分析和自然语言处理领域。
  • methods: 论文介绍了一种新方法,即Context Informed Dictionary and sEntiment Reasoner(CIDER),它通过从整个 corpus 中推理情感含义权值来对个体文本进行评分。
  • results: 论文表明,CIDER 在一大量天气关注的推文上超过了状态对比的通用情感分析。I hope this helps! Let me know if you have any further questions.
    Abstract Researchers commonly perform sentiment analysis on large collections of short texts like tweets, Reddit posts or newspaper headlines that are all focused on a specific topic, theme or event. Usually, general purpose sentiment analysis methods are used which perform well on average but miss the variation in meaning that happens across different contexts, for example, the word "active" has a very different intention and valence in the phrase "active lifestyle" versus "active volcano". This work presents a new approach, CIDER (Context Informed Dictionary and sEntiment Reasoner), which performs context sensitive sentiment analysis, where the valence of sentiment laden terms is inferred from the whole corpus before being used to score the individual texts. In this paper we detail the CIDER algorithm and demonstrate that it outperforms state-of-the-art generalist sentiment analysis on a large collection of tweets about the weather. We have made our implementation of CIDER available as a python package: https://pypi.org/project/ciderpolarity/.
    摘要

Transformers are Universal Predictors

  • paper_url: http://arxiv.org/abs/2307.07843
  • repo_url: https://github.com/danderfer/Comp_Sci_Sem_2
  • paper_authors: Sourya Basu, Moulik Choraria, Lav R. Varshney
  • for: 这个论文探讨了Transformer架构在自然语言处理中的限制,以及其在信息理论上的通用预测性。
  • methods: 该论文使用了信息理论来分析Transformer架构的非尺度数据 режиmes,并对不同的组成部分进行分析,以了解它们在数据效率训练中的作用。
  • results: 实验结果 validate了论文的理论分析,并在Synthetic和实际数据集上显示了Transformer架构的通用预测性。
    Abstract We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
    摘要 我们发现 transformer 架构在语言模型预测中有限制,并证明它具有一般预测性质在信息理论上。我们进一步分析transformer架构的性能在非 asymptotic 数据 режиме,以了解不同组件的作用,特别是在数据效率训练中。我们对both synthetic和实际数据进行实验来验证我们的理论分析。Note: "Transformer" is a specific type of neural network architecture, and "language modeling" refers to the task of predicting the next word or character in a sequence of text given the context of the previous words.