cs.CL - 2023-11-07

Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

  • paper_url: http://arxiv.org/abs/2311.04378
  • repo_url: https://github.com/hlzhang109/impossibility-watermark
  • paper_authors: Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak
    for:The paper is written to study the (im)possibility of strong watermarking schemes for generative models.methods:The paper uses a generic efficient watermark attack that is based on two assumptions: access to a “quality oracle” and “perturbation oracle”.results:The paper proves that strong watermarking is impossible to achieve under well-specified and natural assumptions, even in the private detection algorithm setting. The attack successfully removes the watermarks planted by three existing watermarking schemes for large language models, with only minor quality degradation.Here is the answer in Simplified Chinese text:for: 本文研究Generative Model上的强水印 schemes的可能性。methods: 本文使用一种基于两个假设的高效水印攻击:访问”质量oracle”和”干扰 oracle”。results: 本文证明了强水印是不可能实现的,即使在私有探测算法设置下。攻击成功地除去了三个现有的水印 schemes for large language models,仅带有少量质量下降。
    Abstract Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
    摘要 水印生成模型的主要目的是在模型输出中植入一个统计信号(水印),以便后续确认该输出是由给定模型生成的。一个强大的水印计划满足了对于计算能力有限的攻击者无法消除水印而不导致重要质量下降的属性。在这篇论文中,我们研究水印计划的可能性。我们证明,在我们指出的具体和自然假设下,强大的水印计划是不可能实现的。这个结论保持,即使在私密探测算法设置下,水印插入和检测算法共享一个秘密密钥,不知道攻击者的。为证明这一结论,我们引入了一种通用高效的水印攻击方法。攻击者不需要知道私密密钥或者具体使用哪种计划。我们的攻击基于两个假设:(1)攻击者有访问一个"质量oracle",可以评估提示下的候选输出是否为高质量响应;(2)攻击者有访问一个"杂化oracle",可以修改输出,并且具有高效混合Random Walk的性质,使得输出保持高质量。我们认为,这两个假设在实际应用中都可以被攻击者满足,而且这些假设将在模型技术和Modalities不断提高时变得越来越容易实现。我们通过对三个现有的水印计划进行实例化,证明了我们的攻击方法的可行性。这三个计划分别是Kirchenbauer et al. (2023)、Kuditipudi et al. (2023)和Zhao et al. (2023)。我们的攻击方法成功地从这三个计划中移除了植入的水印,并且只带有轻微的质量下降。

Evaluating multiple large language models in pediatric ophthalmology

  • paper_url: http://arxiv.org/abs/2311.04368
  • repo_url: None
  • paper_authors: Jason Holmes, Rui Peng, Yiwei Li, Jinyu Hu, Zhengliang Liu, Zihao Wu, Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu, Yi Shao
    for: The paper aims to evaluate the performance of large language models (LLMs) in pediatric ophthalmology consultations and compare their performance with medical students and physicians at different levels.methods: The study uses a 100-question exam based on pediatric ophthalmology to assess the performance of three LLMs (ChatGPT, GPT-4, and PaLM2) and three human cohorts (medical students, postgraduate students, and attending physicians).results: GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and PaLM2 outperformed medical students but slightly trailed behind postgraduate students. GPT-4 also exhibited greater stability and confidence when responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2.
    Abstract IMPORTANCE The response effectiveness of different large language models (LLMs) and various individuals, including medical students, graduate students, and practicing physicians, in pediatric ophthalmology consultations, has not been clearly established yet. OBJECTIVE Design a 100-question exam based on pediatric ophthalmology to evaluate the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels. DESIGN, SETTING, AND PARTICIPANTS This survey study assessed three LLMs, namely ChatGPT (GPT-3.5), GPT-4, and PaLM2, were assessed alongside three human cohorts: medical students, postgraduate students, and attending physicians, in their ability to answer questions related to pediatric ophthalmology. It was conducted by administering questionnaires in the form of test papers through the LLM network interface, with the valuable participation of volunteers. MAIN OUTCOMES AND MEASURES Mean scores of LLM and humans on 100 multiple-choice questions, as well as the answer stability, correlation, and response confidence of each LLM. RESULTS GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and PaLM2 outperformed medical students but slightly trailed behind postgraduate students. Furthermore, GPT-4 exhibited greater stability and confidence when responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2. CONCLUSIONS AND RELEVANCE Our results underscore the potential for LLMs to provide medical assistance in pediatric ophthalmology and suggest significant capacity to guide the education of medical students.
    摘要 重要性:不同的大语言模型(LLM)和各种个人,包括医学生、硬件硬件学生和实践医生,在педиatriatic ophthalmology的咨询中的回应效果没有得到明确的确定。目标:设计一份100题测试,用于评估不同的LLM在高度特殊化的 scenarios中的表现,并与医学生和医生在不同水平的表现进行比较。设计、场景和参与者:这项调查研究对三个LLM进行评估,namely ChatGPT(GPT-3.5)、GPT-4和PaLM2,并与三个人类凝聚体进行比较:医学生、硬件硬件学生和实践医生。通过LLM网络接口分发测试纸,获得了志愿者的参与。主要结果和度量:对100个多选题的平均分、LLM和人类的回答稳定性、相关性和回答自信度。结论和意义:我们的结果表明LLM可以在педиatriatic ophthalmology中提供医疗协助,并 suggeSTs signifiCant capacity to guide medical education。

Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

  • paper_url: http://arxiv.org/abs/2311.04364
  • repo_url: https://github.com/hlr/syntax-guided-transformers
  • paper_authors: Danial Kamali, Parisa Kordjamshidi
  • for: This paper aims to improve the ability of intelligent models to generalize to novel compositions in multimodal environments, by leveraging syntactic structure and attention masking techniques.
  • methods: The paper introduces and evaluates the effectiveness of using syntactic information in the multimodal grounding problem, specifically through dependency parsing and Weight Sharing across the Transformer encoder.
  • results: The results show that incorporating syntactic information into the grounding process leads to improved performance on diverse tasks, pushing the state-of-the-art in multimodal grounding and parameter-efficient modeling.
    Abstract Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.
    摘要 compositional generalization,AI智能模型能够在新的组合中推广理解组件的能力,是人工智能研究中的基本 yet 挑战。在多Modal环境中,我们解决这个挑战,利用语言结构来提高compositional generalization。本文强调语音结构的重要性,特别是通过文本输入解析来获得的注意力掩码技术。我们介绍并评估了使用语音信息在多Modalgrounding问题中的利用。我们的结果表明,在多Modalgrounding问题中,通过Weight Sharing在Transformer核心编码器中使用语音信息,可以提高grounded compositional generalization的性能,并提供了参考价值。

Uncovering Causal Variables in Transformers using Circuit Probing

  • paper_url: http://arxiv.org/abs/2311.04354
  • repo_url: https://github.com/mlepori1/circuit_probing
  • paper_authors: Michael A. Lepori, Thomas Serre, Ellie Pavlick
  • for: 这篇论文的目的是解释神经网络模型中的算法,以便更好地理解神经网络模型如何进行计算。
  • methods: 这篇论文提出了一种新的分析技术—电路探针,可以自动找出神经网络模型中的低级电路,并使用这些电路来进行 causal 分析。
  • results: 通过应用电路探针技术, authors 能够解释神经网络模型中的算法,并发现模型中的模块结构和在训练过程中的电路发展。 这些结果证明了电路探针技术的有效性,并且在实际应用中能够帮助理解神经网络模型的计算过程。
    Abstract Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. In order to understand these algorithms, it is often necessary to hypothesize intermediate variables involved in the network's computation. For example, does a language model depend on particular syntactic properties when generating a sentence? However, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. We compare circuit probing to other methods across these three experiments, and find it on par or more effective than existing analysis methods. Finally, we demonstrate circuit probing on a real-world use case, uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
    摘要

Formal Aspects of Language Modeling

  • paper_url: http://arxiv.org/abs/2311.04329
  • repo_url: https://github.com/Gninos/CIM-With-Transition-Systems
  • paper_authors: Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, Li Du
  • for: 这篇论文是为了探讨大语言模型的数学基础和如何实现的。
  • methods: 论文使用了形式化的理论方法来描述语言模型的定义和实现方法。
  • results: 论文提供了一个 theoretical 的解释,帮助开发者和研究人员更好地理解大语言模型的数学基础和如何实现。
    Abstract Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Z\"urich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.
    摘要 大型语言模型已成为人工智能领域最常用的自然语言处理发明之一。过去半个十年,它们在基本的自然语言处理工具中的整合,使得这些工具的性能得到了很大提高,并进入了人工智能的公共讨论。因此,开发者和研究人员都应该了解大型语言模型的数学基础和如何实现。这些笔记是ETH Zurich大学课程《大型语言模型》的理论部分的伴侣,涵盖了语言模型从形式、理论角度来看的定义。

Aspect-based Meeting Transcript Summarization: A Two-Stage Approach with Weak Supervision on Sentence Classification

  • paper_url: http://arxiv.org/abs/2311.04292
  • repo_url: None
  • paper_authors: Zhongfen Deng, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Quan Hung Tran, Shuaiqi Liu, Wenting Zhao, Tao Zhang, Yibo Wang, Philip S. Yu
  • for: aspect-based meeting transcript summarization
  • methods: sentence classifier and summarizer
  • results: outperform many strong baselinesHere’s the full text in Simplified Chinese:
  • for: 这个研究是为了实现会议笔记概要的分层分类,以便生成不同方面的多个概要。
  • methods: 我们使用了一个两个阶段的方法,首先使用一个句子分类器在一个预处理过的 AMI 数据集上进行 pseudo-labeling,然后选择相关的句子进行概要生成。
  • results: 我们的方法在 AMI 数据集上实验表现出色,胜过了许多强大的基线。
    Abstract Aspect-based meeting transcript summarization aims to produce multiple summaries, each focusing on one aspect of content in a meeting transcript. It is challenging as sentences related to different aspects can mingle together, and those relevant to a specific aspect can be scattered throughout the long transcript of a meeting. The traditional summarization methods produce one summary mixing information of all aspects, which cannot deal with the above challenges of aspect-based meeting transcript summarization. In this paper, we propose a two-stage method for aspect-based meeting transcript summarization. To select the input content related to specific aspects, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling. Then we merge the sentences selected for a specific aspect as the input for the summarizer to produce the aspect-based summary. Experimental results on the AMI corpus outperform many strong baselines, which verifies the effectiveness of our proposed method.
    摘要 Traditional meeting transcript summarization methods produce a single summary that combines information from all aspects, which cannot effectively address the challenges of aspect-based summarization. In this paper, we propose a two-stage method for aspect-based meeting transcript summarization. First, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling to select input content related to specific aspects. Then, we merge the sentences selected for a specific aspect as input for the summarizer to produce the aspect-based summary. Experimental results on the AMI corpus outperform many strong baselines, demonstrating the effectiveness of our proposed method.Here's the breakdown of the translation:* "Traditional meeting transcript summarization methods" is translated as "传统的会议笔记摘要方法".* "produce a single summary that combines information from all aspects" is translated as "生成一个混合所有方面信息的摘要".* "which cannot effectively address the challenges of aspect-based summarization" is translated as "这些方法无法有效地解决方面基的摘要挑战".* "In this paper, we propose a two-stage method for aspect-based meeting transcript summarization" is translated as "在这篇论文中,我们提出了一种两个阶段的方法 для方面基的会议笔记摘要".* "First, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling" is translated as "首先,我们使用 pseudo-labeling 技术在 AMI corpus 上构建了一个数据集,并对其进行了分类".* "to select input content related to specific aspects" is translated as "选择与特定方面相关的输入内容".* "Then, we merge the sentences selected for a specific aspect as input for the summarizer" is translated as "然后,我们将选择的每个方面的句子合并为摘要器的输入".* "to produce the aspect-based summary" is translated as "生成方面基的摘要".* "Experimental results on the AMI corpus outperform many strong baselines" is translated as "在 AMI corpus 上,我们的实验结果超过了许多强大的基线".Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

  • paper_url: http://arxiv.org/abs/2311.04199
  • repo_url: None
  • paper_authors: Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, Jaeboum Kim
  • for: 这个研究的目的是探索使用 Large Multimodal Models (LMMs) 在推荐任务中的潜力,并评估 GPT-4V 在不同领域的推荐能力。
  • methods: 这个研究使用了 GPT-4V 来进行推荐任务,并使用了多个领域的质感样本来评估其回应质量。
  • results: 研究结果显示 GPT-4V 在多个领域的推荐任务中表现出色,并且能够提供多样化的回应。但是,研究也发现 GPT-4V 在某些情况下会提供相似的回应。
    Abstract Large Multimodal Models (LMMs) have demonstrated impressive performance across various vision and language tasks, yet their potential applications in recommendation tasks with visual assistance remain unexplored. To bridge this gap, we present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We construct a series of qualitative test samples spanning multiple domains and employ these samples to assess the quality of GPT-4V's responses within recommendation scenarios. Evaluation results on these test samples prove that GPT-4V has remarkable zero-shot recommendation abilities across diverse domains, thanks to its robust visual-text comprehension capabilities and extensive general knowledge. However, we have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs. This report concludes with an in-depth discussion of the challenges and research opportunities associated with utilizing GPT-4V in recommendation scenarios. Our objective is to explore the potential of extending LMMs from vision and language tasks to recommendation tasks. We hope to inspire further research into next-generation multimodal generative recommendation models, which can enhance user experiences by offering greater diversity and interactivity. All images and prompts used in this report will be accessible at https://github.com/PALIN2018/Evaluate_GPT-4V_Rec.
    摘要 大型多模式模型(LMM)已经在视觉和语言任务中表现出色,但它们在推荐任务中的应用尚未得到广泛的探索。为了填补这个空白,我们提出了一个初步的案例研究,检查GPT-4V(ison),OpenAI最新发布的LMM,在推荐任务中的能力。我们构建了多个域类的资料,并使用这些资料来评估GPT-4V在推荐场景中的回答质量。评估结果表明,GPT-4V在多个领域的零shot推荐任务中表现出色,归功于它的强大视觉文本理解能力和广泛的通用知识。然而,我们也发现了使用GPT-4V进行推荐的一些限制,包括对同类输入提供相同的回答。这份报告结束于对使用GPT-4V进行推荐的挑战和研究机会的深入讨论。我们的目标是探索LMM在推荐任务中的可能性,并开发下一代多模式生成推荐模型,以提高用户体验,提供更多的多样性和互动性。所有图像和提示在这份报告中使用的将在GitHub上公开,可以通过https://github.com/PALIN2018/Evaluate_GPT-4V_Rec访问。

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

  • paper_url: http://arxiv.org/abs/2311.04192
  • repo_url: https://github.com/keio-smilab23/JaSPICE
  • paper_authors: Yuiga Wada, Kanta Kaneda, Komei Sugiura
  • for: 本研究的目的是提出一个用于日本文字的自动评估指标,以提高现有的自动评估指标之间的与人工评估的相似性。
  • methods: 本研究使用了内存链接和 predicate-argument 结构生成场景图,并将场景图扩展为同义词。
  • results: 实验结果显示,我们的指标与人工评估的相似性 coefficient 高于基eline指标。
    Abstract Image captioning studies heavily rely on automatic evaluation metrics such as BLEU and METEOR. However, such n-gram-based metrics have been shown to correlate poorly with human evaluation, leading to the proposal of alternative metrics such as SPICE for English; however, no equivalent metrics have been established for other languages. Therefore, in this study, we propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs. The proposed method generates a scene graph from dependencies and the predicate-argument structure, and extends the graph using synonyms. We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations. The results showed that our metric outperformed the baseline metrics for the correlation coefficient with the human evaluation.
    摘要 Image captioning研究强调自动评价指标,如BLEU和METEOR,但这些n-gram基于指标与人工评价 corr 不良相关,导致提出了alter Native指标,如SPICE для英语。然而,没有相应的指标被设立于其他语言。因此,在本研究中,我们提议一种自动评价指标called JaSPICE,该指标根据场景图评估日本caption。我们的方法首先生成场景图从依赖关系和 predicate-argument结构,然后使用同义词扩展图。我们在STAIR Captions和PFN-PIC上训练10个图文描述模型,并构建了Shichimi数据集,该数据集包含103,170个人工评价。结果显示,我们的指标与人工评价 corr 高于基eline指标。

SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish

  • paper_url: http://arxiv.org/abs/2311.04189
  • repo_url: None
  • paper_authors: Yevhen Kostiuk, Grigori Sidorov, Olga Kolesnikova
  • for: 这个论文的目的是提供一个大量标注的西班牙语动词-名词联合集和它们在文本中的句子,以便实现语义功能的层次分类。
  • methods: 这个论文使用了依赖树分析和匹配西班牙语句子中的短语,以生成一个大量的标注数据集。
  • results: 这个论文提供了一个包含37个lexical fonction的类别结构,用于层次分类西班牙语动词-名词联合。它们还提供了基准和数据分割 для每个目标。
    Abstract In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of the context and the relationships among words and phrases in text. It also needs large amounts of labeled data to train language models effectively. In this paper, we present a dataset of most frequent Spanish verb-noun collocations and sentences where they occur, each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task. Each class represents a relation between the noun and the verb in a collocation involving their semantic and syntactic features. We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure. The dataset was created by dependency tree parsing and matching of the phrases in Spanish news. We provide baselines and data splits for each objective.
    摘要 在自然语言处理(NLP)中,词功能是一个概念,用于不同含义和语法特征的词和短语在文本中的有效表示。层次分类词功能涉及将这些特征分类为树状结构中的类别或标签。这是一项复杂的任务,因为它需要对文本中的上下文和单词和短语之间的关系有一定的理解。它还需要大量标注数据来训练语言模型。在这篇论文中,我们提供了西班牙语动词和名词的最常见搭配和它们出现的句子,每个搭配都被分配到37种定义的类别中,这些类别用于在层次分类任务中对名词和动词之间的关系进行分类。每个类别表示在搭配中的语义和语法特征。我们将这些类别组织成树状结构,并引入每个层次的分类目标。这个数据集由西班牙语新闻中的依赖树分析和匹配短语而创建。我们提供了基准和数据分割 для每个目标。

Perturbed examples reveal invariances shared by language models

  • paper_url: http://arxiv.org/abs/2311.04166
  • repo_url: None
  • paper_authors: Ruchit Rawal, Mariya Toneva
  • for: 本研究旨在比较两种自然语言处理模型,以揭示它们共享的可解释输入扰动的共轭性。
  • methods: 该研究提出了一种新的比较框架,通过设计 targets 特定语言能力(如同义词扰动、简写扰动)来揭示模型之间的共轭性。经过实验表明,该框架可以为不同架构家族的模型和商业黑盒API模型(如InstructGPT家族)提供评估共轭性的方法。
  • results: 研究结果表明,大型语言模型在多个语言能力方面具有许多共轭性,而大型模型之间的共轭性只有大型模型才能够保持。这种共轭性可能是大型语言模型的近期成功的关键因素之一,并且该框架可以为新模型的研究和发展提供深刻的理解。
    Abstract An explosion of work in language is leading to ever-increasing numbers of available natural language processing models, with little understanding of how new models compare to better-understood models. One major reason for this difficulty is saturating benchmark datasets, which may not reflect well differences in model performance in the wild. In this work, we propose a novel framework for comparing two natural language processing models by revealing their shared invariance to interpretable input perturbations that are designed to target a specific linguistic capability (e.g., Synonym-Invariance, Typo-Invariance). Via experiments on models from within the same and across different architecture families, this framework offers a number of insights about how changes in models (e.g., distillation, increase in size, amount of pre-training) affect multiple well-defined linguistic capabilities. Furthermore, we also demonstrate how our framework can enable evaluation of the invariances shared between models that are available as commercial black-box APIs (e.g., InstructGPT family) and models that are relatively better understood (e.g., GPT-2). Across several experiments, we observe that large language models share many of the invariances encoded by models of various sizes, whereas the invariances encoded by large language models are only shared by other large models. Possessing a wide variety of invariances may be a key reason for the recent successes of large language models, and our framework can shed light on the types of invariances that are retained by or emerge in new models.
    摘要 “Language processing Model 的快速增长导致了越来越多的可用自然语言处理模型,但是对新模型的理解仍然很有限。一个主要的问题是饱和的测试数据集,可能不准确反映模型在实际中的性能差异。在这种情况下,我们提出了一种新的比较框架,通过揭示模型对可解释性输入的干扰的共同不变性来比较两个自然语言处理模型。我们通过在同一个和不同的架构家族中的模型上进行实验,发现了多种语言能力的不同方面的共同不变性。此外,我们还示出了如何使用我们的框架来评估黑盒API模型(如InstructGPT家族)和比较好理解的模型(如GPT-2)之间的共同不变性。我们在多个实验中发现,大型语言模型共享许多共同不变性,而大型模型中所编码的共同不变性只被其他大型模型共享。拥有多种共同不变性可能是大型语言模型的最近成功的关键因素,我们的框架可以把这些共同不变性的类型推导出来。”

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

  • paper_url: http://arxiv.org/abs/2311.04155
  • repo_url: https://github.com/thu-coai/bpo
  • paper_authors: Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
  • for: 提高大型自然语言模型(LLM)的用户指令遵循率,而不需要更新LLM的参数。
  • methods: 使用黑盒提示优化(BPO)方法,优化用户提示以适应LLM的输入理解,以实现用户意图的最佳实现。
  • results: BPO可以提高ChatGPT的赢利率22%,并且在GPT-4上也有10%的提高。同时,BPO可以超越PPO和DPO等方法的Alignment效果,并且与PPO或DPO结合使用可以带来额外的性能提升。
    Abstract Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them, that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods mostly focus on further training them. However, the extra training of LLMs are usually expensive in terms of GPU compute; worse still, LLMs of interest are oftentimes not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments. The idea is to optimize user prompts to suit LLMs' input understanding, so as to best realize users' intents without updating LLMs' parameters. BPO is model-agnostic and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version, and 10% for GPT-4. Importantly, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.
    摘要

What is Lost in Knowledge Distillation?

  • paper_url: http://arxiv.org/abs/2311.04142
  • repo_url: None
  • paper_authors: Manas Mohanty, Tanya Roosta, Peyman Passban
  • for: 这个论文的目的是研究知识塑造(KD)技术对模型压缩的影响,以及如果压缩过程会导致信息损失,以及损失是否遵循特定的模式。
  • methods: 该论文使用了知识塑造技术对模型进行压缩,并对压缩后的模型与原始模型进行比较,以检查压缩过程是否导致信息损失。
  • results: 研究发现,压缩过程可能会导致信息损失,并且损失的程度与压缩后模型的复杂性有关。此外,研究还发现,某些任务可能更敏感于压缩,而另外些任务可能更抗应压缩。
    Abstract Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by this, our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern. Our experiments aim to shed light on the type of tasks might be less or more sensitive to KD by reporting data points on the contribution of different factors, such as the number of layers or attention heads. Results such as ours could be utilized when determining effective and efficient configurations to achieve optimal information transfers between larger (teacher) and smaller (student) models.
    摘要 深度神经网络(DNN)已经大幅提高了自然语言处理(NLP)任务的性能,但训练和维护这些网络可能会很昂贵。为了解决这问题,模型压缩技术,如知识传递(KD),已经被提议。然而,压缩过程可能会导致信息损失。我们的工作想要了解压缩学生模型与其教师模型之间的差异,以及压缩过程是否会导致信息损失,以及损失是否遵循特定的模式。我们的实验旨在为确定效果和可行的配置来致导致优化信息传递 между更大的教师模型和更小的学生模型提供数据点。结果如我们所报道的可能用于确定有效和可行的配置,以便实现最佳的信息传递。

Modelling Sentiment Analysis: LLMs and data augmentation techniques

  • paper_url: http://arxiv.org/abs/2311.04139
  • repo_url: None
  • paper_authors: Guillem Senabre Prades
  • for: 本研究旨在提出一种基于小训练集的 binary sentiment classification 方法,以便在具有限制的训练数据情况下实现高度的准确率。
  • methods: 本研究使用了 LLMS 技术,包括 BERT、RoBERTa 和 XLNet,以实现 sentiment analysis。
  • results: 研究结果表明,使用 LLMS 技术可以在小训练集情况下实现高度的准确率,并且可以提高 sentiment analysis 的效果。I hope that helps! Let me know if you have any other questions.
    Abstract This paper provides different approaches for a binary sentiment classification on a small training dataset. LLMs that provided state-of-the-art results in sentiment analysis and similar domains are being used, such as BERT, RoBERTa and XLNet.
    摘要 这篇论文提供了不同的方法用于对小训练集进行二进制情感分类。使用了LLMs的state-of-the-art结果在情感分析和相关领域,如BERT、RoBERTa和XLNet。

Personality Style Recognition via Machine Learning: Identifying Anaclitic and Introjective Personality Styles from Patients’ Speech

  • paper_url: http://arxiv.org/abs/2311.04088
  • repo_url: None
  • paper_authors: Semere Kiros Bitew, Vincent Schelstraete, Klim Zaporojets, Kimberly Van Nieuwenhove, Reitske Meganck, Chris Develder
    for:The paper aims to investigate the possibility of automatically inferring personality types from speech utterances, with the goal of improving the accuracy of personality classification in psychopathology.methods:The authors use natural language processing (NLP) techniques and machine learning algorithms to analyze clinical diagnostic interviews (CDI) recorded from a sample of 79 patients diagnosed with major depressive disorder (MDD). They explore various linguistic features associated with each personality style and develop automatic classifiers based on standardized questionnaire responses, basic text features, advanced text features using LIWC, and audio features.results:The authors find that automated classification with language-derived features (based on LIWC) significantly outperforms questionnaire-based classification models. The best performance is achieved by combining LIWC with the questionnaire features, suggesting that more work should be put into developing linguistically based automated techniques for characterizing personality, while questionnaires still have some complementary value.
    Abstract In disentangling the heterogeneity observed in psychopathology, personality of the patients is considered crucial. While it has been demonstrated that personality traits are reflected in the language used by a patient, we hypothesize that this enables automatic inference of the personality type directly from speech utterances, potentially more accurately than through a traditional questionnaire-based approach explicitly designed for personality classification. To validate this hypothesis, we adopt natural language processing (NLP) and standard machine learning tools for classification. We test this on a dataset of recorded clinical diagnostic interviews (CDI) on a sample of 79 patients diagnosed with major depressive disorder (MDD) -- a condition for which differentiated treatment based on personality styles has been advocated -- and classified into anaclitic and introjective personality styles. We start by analyzing the interviews to see which linguistic features are associated with each style, in order to gain a better understanding of the styles. Then, we develop automatic classifiers based on (a) standardized questionnaire responses; (b) basic text features, i.e., TF-IDF scores of words and word sequences; (c) more advanced text features, using LIWC (linguistic inquiry and word count) and context-aware features using BERT (bidirectional encoder representations from transformers); (d) audio features. We find that automated classification with language-derived features (i.e., based on LIWC) significantly outperforms questionnaire-based classification models. Furthermore, the best performance is achieved by combining LIWC with the questionnaire features. This suggests that more work should be put into developing linguistically based automated techniques for characterizing personality, however questionnaires still to some extent complement such methods.
    摘要 在解剖医学中的疾病多样性中,患者的个性被视为非常重要。而研究表明,患者的语言使用很可能会反映他们的个性特质。我们提出的假设是,通过自动从语音中推断患者的个性类型,可能比传统的问卷方法更准确地分类患者的个性。为了证明这一假设,我们采用自然语言处理(NLP)和标准的机器学习工具进行分类。我们在一个记录了临床诊断 интервью(CDI)的样本上进行测试,该样本包含79名患有主要抑郁症(MDD)的患者,并将他们分为附属型和内在型个性风格两类。我们首先分析 интервью,以确定每个风格的语言特征,以更好地理解这两种风格。然后,我们开发了自动分类器,包括(a)标准问卷回答;(b)基本文本特征,即TF-IDF分数;(c)更高级的文本特征,使用LIWC(语言研究和词语计数)和BERT(变换器预测模型);(d)音频特征。我们发现,基于语言 derive 特征(即LIWC)的自动分类显著高于问卷基本分类模型。此外, combining LIWC 和问卷特征可以获得最佳性能。这种结果表明,更多的工作应该投入到开发语言基于的自动分类技术,但是问卷仍然有一定的补做作用。

Do LLMs exhibit human-like response biases? A case study in survey design

  • paper_url: http://arxiv.org/abs/2311.04076
  • repo_url: https://github.com/lindiatjuatja/biasmonkey
  • paper_authors: Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig
  • for: 这个论文旨在研究语言模型(LLM)是否能够模拟人类的意见,以及LLM是否会受到问题表达的影响。
  • methods: 作者使用了调查设计来研究人类响应偏见,并提出了一个框架来评估 LLM 是否会受到人类类似的响应偏见。
  • results: 研究发现,当前流行的开源和商业 LLM 通常不会模拟人类类似的行为,而且这种差异更加明显在模型经过了指令细化的情况下。此外,即使模型表现出了人类类似的变化,但在不同的问题表达中也可能会出现相似的变化,这可能是由于其他偶合关系引起的。这些结果表明使用 LLM 代替人类进行某些注解阶段可能存在隐患,并且更需要进一步的模型行为的细化。
    Abstract As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs is their sensitivity to prompt wording -- but interestingly, humans also display sensitivities to instruction changes in the form of response biases. As such, we argue that if LLMs are going to be used to approximate human opinions, it is necessary to investigate the extent to which LLMs also reflect human response biases, if at all. In this work, we use survey design as a case study, where human response biases caused by permutations in wordings of ``prompts'' have been extensively studied. Drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior. These inconsistencies tend to be more prominent in models that have been instruction fine-tuned. Furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change, suggesting that such a result could be partially due to other spurious correlations. These results highlight the potential pitfalls of using LLMs to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey
    摘要 large language models (LLMs) 的能力在不断提高,人们对使用 LLMs 作为人类的代理在实际任务中获得主观标签,如问卷和意见调查中的可能性感到越来越激动。然而, LLMS 的句子wording 敏感性受到广泛关注,而人类也会因为指令的变化而产生偏见。因此,如果 LLMS 要用来 aproximate 人类意见,那么必须研究 LLMS 是否也会产生人类偏见,如果有。在这项工作中,我们使用问卷设计作为Case study,人类受到句子wording 的变化而产生的偏见已经得到了广泛的研究。基于社会心理学前作,我们设计了一个数据集和一个框架,以评估 LLMS 是否会 Display 人类类似的偏见。我们对九种模型进行了全面的评估,发现流行的开源和商业 LLMS 通常不会Display 人类类似的行为。这些偏见通常在模型被 instruction fine-tuned 时更加抑制。此外,我们发现,即使模型表现出人类一样的改变,但是对人类而言并不重要的小变化也可能会导致类似的改变,这表明这种结果可能是由于其他的偶散关系引起的。这些结果提醒我们使用 LLMs 取代人类在注释管道中的部分可能存在隐患,并且重申了对模型行为的更加细化描述的重要性。我们的代码、数据集和收集到的样本可以在https://github.com/lindiatjuatja/BiasMonkey 中找到。

Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space

  • paper_url: http://arxiv.org/abs/2311.04260
  • repo_url: None
  • paper_authors: Motonari Kambara, Komei Sugiura
  • for: 本研究旨在开发一个基于视觉信息的机器人执行任务,响应自然语言指令进行Fetch-and-Carry with Object Grounding (FCOG)任务。
  • methods: 我们提出了一个框架,可以自动生成、执行和评估FCOG任务。此外,我们还引入了一种将FCOG任务分解成四个不同的子任务的方法。
  • results: 我们的框架可以自动生成和执行FCOG任务,并且可以在不同的环境下进行评估。这表明了我们的方法可以帮助机器人更好地执行Fetch-and-Carry with Object Grounding (FCOG)任务。
    Abstract This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. Although there have been many frameworks, they usually rely on manually given instruction sentences. Therefore, evaluations have only been conducted with fixed tasks. Furthermore, many multimodal language understanding models for the benchmarks only consider discrete actions. To address the limitations, we propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks. In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.
    摘要

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

  • paper_url: http://arxiv.org/abs/2311.04072
  • repo_url: None
  • paper_authors: Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen
  • for: 提高大型自然语言模型(LLM)的听众偏好性。
  • methods: 基于精细调教(SFT)和精细质量信号(token或短语水平)进行改进的听众对齐方法。
  • results: 比较多种基eline的效果,表明提出的方法可以更好地帮助LLM学习听众对齐。
    Abstract Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
    摘要 大型语言模型(LLM)的准确性是一项极其重要的性能指标。目前主流的准确性途径是基于人类反馈学习(RLHF)。 despite RLHF的效果,它具有复杂的实现和训练需求,因此 latest studies explore 如何开发基于监督精细调整(SFT)的准确性方法。然而,SFT 的一个主要局限性是,它实际上只是模仿学习,无法完全理解所期望的行为。为了解决这个问题,我们提出了一种改进的准确性方法,即 FIGA。与之前的方法不同,我们在 Fine-grained(i.e., 单词或短语水平)的质量信号中包含了对比好和坏回复的数据。我们的方法有两大贡献:首先,我们精心准备了一个精细匹配数据集,该数据集包含了初始回复和相应的修订回复。其次,我们开发了一种新的损失函数,可以利用精细质量信号来指导 LLM 的学习。我们的实验结果表明,我们的方法可以与一些竞争性的基准值进行比较,并且表现出色。

Implementation and Comparison of Methods to Extract Reliability KPIs out of Textual Wind Turbine Maintenance Work Orders

  • paper_url: http://arxiv.org/abs/2311.04064
  • repo_url: None
  • paper_authors: Marc-Alexander Lutz, Bastian Schäfermeier, Rachael Sexton, Michael Sharp, Alden Dima, Stefan Faulstich, Jagan Mohini Aluri
  • for: 本文旨在提高风力机操作和维护的优化,通过对维护工作令牌中的信息进行梳理和分析,从而提高风力机的可靠性指标。
  • methods: 本文提出了三种方法来计算风力机的可靠性指标,包括人工标注法、自动标注法和人工助け标注法。
  • results: 研究表明,三种方法可以帮助提高风力机的维护和操作效率,同时可以提高风力机的可靠性指标。其中人工标注法的结果被用作标准比较其他两种方法的效果。
    Abstract Maintenance work orders are commonly used to document information about wind turbine operation and maintenance. This includes details about proactive and reactive wind turbine downtimes, such as preventative and corrective maintenance. However, the information contained in maintenance work orders is often unstructured and difficult to analyze, making it challenging for decision-makers to use this information for optimizing operation and maintenance. To address this issue, this work presents three different approaches to calculate reliability key performance indicators from maintenance work orders. The first approach involves manual labeling of the maintenance work orders by domain experts, using the schema defined in an industrial guideline to assign the label accordingly. The second approach involves the development of a model that automatically labels the maintenance work orders using text classification methods. The third technique uses an AI-assisted tagging tool to tag and structure the raw maintenance information contained in the maintenance work orders. The resulting calculated reliability key performance indicator of the first approach are used as a benchmark for comparison with the results of the second and third approaches. The quality and time spent are considered as criteria for evaluation. Overall, these three methods make extracting maintenance information from maintenance work orders more efficient, enable the assessment of reliability key performance indicators and therefore support the optimization of wind turbine operation and maintenance.
    摘要 维护工作指令通常用于记录风力机的运行和维护信息。这些指令包括有预防和整合性维护的风力机停机信息。然而,维护工作指令中的信息通常是不结构化的,导致决策者很难使用这些信息优化运行和维护。为解决这个问题,这份工作提出了三种不同的方法来计算可靠性关键性表现指标从维护工作指令中。第一种方法是通过域专家手动标记维护工作指令,使用工业标准定义的架构来分别标记。第二种方法是开发一种自动标记维护工作指令的模型,使用文本分类方法来标记。第三种技术是使用人工智能助手标记和结构化维护工作指令中的原始维护信息。这三种方法可以更高效地提取维护信息从维护工作指令中,并且可以计算可靠性关键性表现指标。因此,这些方法可以支持风力机运行和维护的优化。评价标准包括质量和时间花费。维护工作指令中的信息是不结构化的,导致决策者很难使用这些信息优化风力机的运行和维护。这些方法可以帮助解决这个问题,从而提高风力机的可靠性和效率。

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

  • paper_url: http://arxiv.org/abs/2311.04046
  • repo_url: https://github.com/edoardopona/predicting-inductive-biases-rl
  • paper_authors: Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, Víctor Abia Alonso, Charlie Griffin, Bogdan-Ionut Cirstea
  • for: 这 paper investigate 大型语言模型(LLMs)在强化学习环境中是否遵循 inductive biases 的 principless.
  • methods: 研究使用 reinforcement learning 进行 fine-tuning phase, 并使用 controlled experiments 测试两个假设:一是 pre-training 后可以更容易提取的特征更有可能被使用,二是 features 的证据可以预测它们是否被使用.
  • results: 通过 controlled experiments 测试两个假设,发现存在 statistically significant 相关性,这是强化学习环境中 inductive biases 的证据.
    Abstract Many capable large language models (LLMs) are developed via self-supervised pre-training followed by a reinforcement-learning fine-tuning phase, often based on human or AI feedback. During this stage, models may be guided by their inductive biases to rely on simpler features which may be easier to extract, at a cost to robustness and generalisation. We investigate whether principles governing inductive biases in the supervised fine-tuning of LLMs also apply when the fine-tuning process uses reinforcement learning. Following Lovering et al (2021), we test two hypotheses: that features more $\textit{extractable}$ after pre-training are more likely to be utilised by the final policy, and that the evidence for/against a feature predicts whether it will be utilised. Through controlled experiments on synthetic and natural language tasks, we find statistically significant correlations which constitute strong evidence for these hypotheses.
    摘要 Many powerful大语言模型(LLM)是通过自我超vised pre-training followed by a reinforcement-learning fine-tuning phase来开发,经常基于人类或AI反馈。在这个阶段,模型可能受到其启发性偏好引导,使用更简单的特征,这可能导致模型的稳定性和泛化能力受到影响。我们调查了LLMs在精神投入学习 fine-tuning阶段的启发性偏好是否适用,以下是我们的两个假设:(1)预训练后可EXTRACTABLE的特征更可能被最终策略使用,(2)对特征的证据可以预测该特征是否被使用。通过控制的 sintetic和自然语言任务实验,我们发现了 statistically significant correlations,这 constitutes strong evidence for these hypotheses。

P-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models

  • paper_url: http://arxiv.org/abs/2311.04044
  • repo_url: None
  • paper_authors: Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yangqiu Song
  • for: 本研究旨在提供一个多元隐私评估 benchmark,以实际和直观地衡量语言模型(LM)的隐私泄露。
  • methods: 本研究使用了多元隐私评估 benchmark,将隐私泄露评估定义为多个方面的隐私目标,并建立了一个统一的管道来实现私人微调。
  • results: 本研究通过对三个GLUE dataset的实验,评估了不同的隐私保护Language Model(PPLM)的隐私泄露情况。
    Abstract The rapid development of language models (LMs) brings unprecedented accessibility and usage for both models and users. On the one hand, powerful LMs, trained with massive textual data, achieve state-of-the-art performance over numerous downstream NLP tasks. On the other hand, more and more attention is paid to unrestricted model accesses that may bring malicious privacy risks of data leakage. To address these issues, many recent works propose privacy-preserving language models (PPLMs) with differential privacy (DP). Unfortunately, different DP implementations make it challenging for a fair comparison among existing PPLMs. In this paper, we present P-Bench, a multi-perspective privacy evaluation benchmark to empirically and intuitively quantify the privacy leakage of LMs. Instead of only protecting and measuring the privacy of protected data with DP parameters, P-Bench sheds light on the neglected inference data privacy during actual usage. P-Bench first clearly defines multi-faceted privacy objectives during private fine-tuning. Then, P-Bench constructs a unified pipeline to perform private fine-tuning. Lastly, P-Bench performs existing privacy attacks on LMs with pre-defined privacy objectives as the empirical evaluation results. The empirical attack results are used to fairly and intuitively evaluate the privacy leakage of various PPLMs. We conduct extensive experiments on three datasets of GLUE for mainstream LMs.
    摘要 LM的快速发展带来了前所未有的访问和使用方便,对于模型和用户来说都是一种重要的进步。然而,随着模型的使用,更多的注意力被集中到了可能会导致数据泄露的不可靠访问上。为解决这些问题,许多最近的研究提出了隐私保护语言模型(PPLM),通过分布式隐私(DP)来保护用户的隐私。然而,不同的DP实现使得对现有PPLM的比较变得困难。在这篇论文中,我们提出了P-Bench,一个多角度隐私评估准则,用于实际和直观地评估LM的隐私泄露。而不是仅仅保护和测试保护数据的DP参数,P-Bench shed light on在实际使用时 neglected的推理数据隐私。P-Bench首先明确了在私有精度调整过程中的多种隐私目标。然后,P-Bench构建了一个统一的私有精度调整管道。最后,P-Bench对LM进行了先前定义的隐私目标为empirical评估结果。我们对GLUE数据集进行了广泛的实验,并对主流LM进行了评估。

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

  • paper_url: http://arxiv.org/abs/2311.04257
  • repo_url: https://github.com/x-plug/mplug-owl
  • paper_authors: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
  • for: 这篇论文旨在开发一种多Modal大型自然语言模型(mPLUG-Owl2),以便在文本和多Modal任务中提高性能。
  • methods: mPLUG-Owl2使用了分Module化网络设计,其语言解码器 acting as 多Modal Interface,共享函数模块来促进Modal协作,并引入Modal适应模块以保持Modal特有特征。
  • results: 实验表明,mPLUG-Owl2可以在文本任务和多Modal任务中实现状态之最的表现,并且可以在不同Modal任务中进行普适化。此外,mPLUG-Owl2是首个在纯文本和多Modal场景中实现Modal协作现象的 MLLM 模型,开拓了未来多Modal基础模型的发展道路。
    Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
    摘要 多Modal大语言模型(MLLM)已经展示了各种开放任务的吸引力,但以前的方法主要是增强多Modal的能力。在这项工作中,我们介绍了一种多Modal大语言模型,mPLUG-Owl2,它有效地利用模态协作来提高文本和多Modal任务的性能。mPLUG-Owl2采用分模卷网络设计,语言解码器作为多Modal任务的通用接口,共享功能模块来促进模态协作,并引入特有的模态适应模块以保持模式特异性。经验表明,mPLUG-Owl2能够通用于文本任务和多Modal任务,并在单一模型下实现状态级表现。特别是,mPLUG-Owl2是首个在纯文本和多Modal场景中展示模态协作现象的 MLLM 模型,开拓了未来多Modal基础模型的发展之路。

Analyzing Film Adaptation through Narrative Alignment

  • paper_url: http://arxiv.org/abs/2311.04020
  • repo_url: https://github.com/tanzir5/alignment_tool2.0
  • paper_authors: Tanzir Pial, Shahreen Salim, Charuta Pethe, Allen Kim, Steven Skiena
  • for: 研究电影改编过程中对原著剧本的修改和割辑,以及这些修改对改编过程中的忠诚度、对话重要性、情节顺序和性别表现的影响。
  • methods: 使用Smith-Waterman本地对照算法和SBERT嵌入距离计算文本相似性,并用这些相似性计算量化电影改编过程中的文本对应关系。
  • results: 对40部电影改编进行自动分析,发现改编过程中有很多 faithfulness of adaptation、对话重要性、情节顺序和性别表现的问题,并提供了一些可能的解决方案。
    Abstract Novels are often adapted into feature films, but the differences between the two media usually require dropping sections of the source text from the movie script. Here we study this screen adaptation process by constructing narrative alignments using the Smith-Waterman local alignment algorithm coupled with SBERT embedding distance to quantify text similarity between scenes and book units. We use these alignments to perform an automated analysis of 40 adaptations, revealing insights into the screenwriting process concerning (i) faithfulness of adaptation, (ii) importance of dialog, (iii) preservation of narrative order, and (iv) gender representation issues reflective of the Bechdel test.
    摘要 小说经常被改编成电影,但两媒体之间的差异通常需要从电影剧本中删除部分原始文本。我们在这里研究这种屏幕改编过程,使用斯密特-沃特曼本地对Alignment算法和SBERT嵌入距离来衡量场景和书单之间的文本相似性。我们使用这些对应关系来自动分析40次改编,揭示改编过程中的 faithfulness 问题(i)、对话的重要性(ii)、叙事顺序的保留(iii)以及 gender 表示问题(iv),包括杯儿德测试。

Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals

  • paper_url: http://arxiv.org/abs/2311.03998
  • repo_url: None
  • paper_authors: Sukannya Purkayastha, Anne Lauscher, Iryna Gurevych
  • for: 这 paper 是为了研究人们在不同领域的论据中如何使用基本信念和世界观来驱动自己的论据,以及如何通过采用柔道式辩论方法来更有效地回答对手的论据。
  • methods: 这 paper 使用了一种基于柔道式辩论的方法,即首先确定对方的基本信念和世界观,然后选择一种适应这些驱动器的抨擦方案,而不是直接驳斥对方的论据。
  • results: 这 paper 提出了一种新的任务:基于基本信念和世界观的抨擦生成。该任务的目的是通过让模型学习关于基本信念和世界观的知识,并将其应用于辩论中,以提高辩论效果。
    Abstract In many domains of argumentation, people's arguments are driven by so-called attitude roots, i.e., underlying beliefs and world views, and their corresponding attitude themes. Given the strength of these latent drivers of arguments, recent work in psychology suggests that instead of directly countering surface-level reasoning (e.g., falsifying given premises), one should follow an argumentation style inspired by the Jiu-Jitsu 'soft' combat system (Hornsey and Fielding, 2017): first, identify an arguer's attitude roots and themes, and then choose a prototypical rebuttal that is aligned with those drivers instead of invalidating those. In this work, we are the first to explore Jiu-Jitsu argumentation for peer review by proposing the novel task of attitude and theme-guided rebuttal generation. To this end, we enrich an existing dataset for discourse structure in peer reviews with attitude roots, attitude themes, and canonical rebuttals. To facilitate this process, we recast established annotation concepts from the domain of peer reviews (e.g., aspects a review sentence is relating to) and train domain-specific models. We then propose strong rebuttal generation strategies, which we benchmark on our novel dataset for the task of end-to-end attitude and theme-guided rebuttal generation and two subtasks.
    摘要 在许多辩论领域,人们的论点是由叫做的态度根和世界观所驱动的,而这些隐藏的驱动力对论点的影响力非常强。根据这些隐藏的驱动力,最近的心理学研究表明,而不是直接对表面上的逻辑(例如,证明给出的前提是错误的),应该采取一种基于柔道('软'战斗系统)的辩论风格。 specifically, 我们应该先 indentify一个辩者的态度根和主题,然后选择与这些驱动器相align的典型反驳,而不是直接驳斥这些。在这项工作中,我们是首次对Jiu-Jitsu辩论进行 peer review 的探索。为此,我们提出了一项新的任务:态度和主题导向的反驳生成。为实现这项任务,我们增加了现有的 peer review 数据集中的态度根、态度主题和标准反驳。为了实现这一点,我们重新定义了Established annotation concepts from the domain of peer reviews(例如, peer review 中的句子所关系的方面),并培训域 especific 的模型。然后,我们提出了强大的反驳生成策略,并在我们的新数据集上进行了终到端态度和主题导向的反驳生成和两个子任务的 benchmarking。

Factoring Hate Speech: A New Annotation Framework to Study Hate Speech in Social Media

  • paper_url: http://arxiv.org/abs/2311.03969
  • repo_url: None
  • paper_authors: Gal Ron, Effi Levi, Odelia Oshri, Shaul R. Shenhav
  • for: 本研究提出了一种新的报告方案,它将仇恨言论分解成五个不同的讲话类别。
  • methods: 为了评估该方案,研究人员创建了一个包含290万条推特帖子中包含仇恨表达的词汇的词库,并将选择的1050条帖子进行了标注。
  • results: 研究人员通过统计分析标注数据,并提供了标注示例,并将在未来的研究中提出了一些有前途的方向。
    Abstract In this work we propose a novel annotation scheme which factors hate speech into five separate discursive categories. To evaluate our scheme, we construct a corpus of over 2.9M Twitter posts containing hateful expressions directed at Jews, and annotate a sample dataset of 1,050 tweets. We present a statistical analysis of the annotated dataset as well as discuss annotation examples, and conclude by discussing promising directions for future work.
    摘要 在这项工作中,我们提出了一种新的注释方案,它将仇恨言语分解为五种不同的演讲类别。为评估我们的方案,我们构建了包含超过290万条推特发布的恶意表达目录,并对1,050条微博进行了注释。我们发布了统计分析结果以及注释示例,并结束于未来工作的有前途方向。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

An Analysis of Dialogue Repair in Voice Assistants

  • paper_url: http://arxiv.org/abs/2311.03952
  • repo_url: None
  • paper_authors: Matthew Galbraith
  • for: investigate the significance of interactional language in dialogue repair between virtual assistants and users
  • methods: analyze interactions with Google Assistant and Siri, focusing on their utilization and response to the other-initiated repair strategy “huh?”
  • results: reveal several assistant-generated strategies but an inability to replicate human-like repair strategies such as “huh?”, with differences in users’ repair strategy preferences and assistant usage between English and Spanish speakers.
    Abstract Spoken dialogue systems have transformed human-machine interaction by providing real-time responses to queries. However, misunderstandings between the user and system persist. This study explores the significance of interactional language in dialogue repair between virtual assistants and users by analyzing interactions with Google Assistant and Siri, focusing on their utilization and response to the other-initiated repair strategy "huh?" prevalent in human-human interaction. Findings reveal several assistant-generated strategies but an inability to replicate human-like repair strategies such as "huh?". English and Spanish user acceptability surveys show differences in users' repair strategy preferences and assistant usage, with both similarities and disparities among the two surveyed languages. These results shed light on inequalities between interactional language in human-human interaction and human-machine interaction, underscoring the need for further research on the impact of interactional language in human-machine interaction in English and beyond.
    摘要

Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

  • paper_url: http://arxiv.org/abs/2311.03928
  • repo_url: https://github.com/taeheejeon22/morphsubdecomp-korean
  • paper_authors: Taehee Jeon, Bongseok Yang, Changhwan Kim, Yoonseob Lim
  • for: 提高预训练语言模型(PLM)在韩语中的语法和 semantics 性能
  • methods: 使用 sub-character decomposition 实现 morpheme-aware subword tokenization,并且在 PLM 中 balance 语言准确性和计算效率
  • results: 在 NIKL-CoLA 任务中显示出良好的总体表现,尤其是在语法任务中提高表现,这表明 integrating morpheme type information 可以提高语言模型的语法和 semantics 能力,并且可以进一步提高性能 beyond standard morphological analysis。
    Abstract We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.
    摘要 我们介绍了一种基于字符分解的 morpheme-aware 子字符 tokenization 方法,用于解决在韩语中应用字对编码 (BPE) 时存在的挑战。我们的方法在预训练语言模型 (PLM) 中寻求语言学正确性和计算效率的平衡。我们的评估结果表明,该技术在总体来说获得了好的表现,特别是在NIKL-CoLA sintactic 任务中显著提高了结果。这表明,在语言模型中 integrate morpheme 类型信息可以提高语言模型的 sintactic 和 semantic 能力,这也表明,采用更多的语言学发现可以进一步提高性能,超出标准 morphological analysis 的限制。

iACOS: Advancing Implicit Sentiment Extraction with Informative and Adaptive Negative Examples

  • paper_url: http://arxiv.org/abs/2311.03896
  • repo_url: None
  • paper_authors: Xiancai Xu, Jia-Dong Zhang, Lei Xiong, Zhishang Liu
  • for: 本研究旨在提出一种新的 quadruple extraction 方法 iACOS,用于抽取含义层次的方面、类别、意见和情感。
  • methods: iACOS 方法包括四个主要步骤:一、 append two implicit tokens 到文本末尾以获取上下文感知表示; 二、 使用 sequences labeling 模型在上下文感知表示上进行同时抽取 explicit 和 implicit 方面和意见; 三、 开发一种特殊多头注意力的多标签分类器用于同时发现方面意见对应的类别和情感; 四、 使用 informative 和 adaptive negative examples 进行多任务学习来训练 multi-label classifier 和其他两个分类器。
  • results: 实验结果表明,iACOS 方法与其他 quadruple extraction 基线方法相比,在两个公共 benchmark 数据集上显著提高了 F1 分数。
    Abstract Aspect-based sentiment analysis (ABSA) have been extensively studied, but little light has been shed on the quadruple extraction consisting of four fundamental elements: aspects, categories, opinions and sentiments, especially with implicit aspects and opinions. In this paper, we propose a new method iACOS for extracting Implicit Aspects with Categories and Opinions with Sentiments. First, iACOS appends two implicit tokens at the end of a text to capture the context-aware representation of all tokens including implicit aspects and opinions. Second, iACOS develops a sequence labeling model over the context-aware token representation to co-extract explicit and implicit aspects and opinions. Third, iACOS devises a multi-label classifier with a specialized multi-head attention for discovering aspect-opinion pairs and predicting their categories and sentiments simultaneously. Fourth, iACOS leverages informative and adaptive negative examples to jointly train the multi-label classifier and the other two classifiers on categories and sentiments by multi-task learning. Finally, the experimental results show that iACOS significantly outperforms other quadruple extraction baselines according to the F1 score on two public benchmark datasets.
    摘要 对象是基于四元素的 sentiment analysis (ABSA) 已经受到广泛研究,但是对于四元素中的两个基本元素:意见和看法,尤其是对于隐藏的意见和看法,所知甚少。在这篇论文中,我们提出了一个新的方法 iACOS,用于提取隐藏的意见和看法。iACOS 的方法包括四个步骤:1. iACOS 在文本结构中附加了两个隐藏的元素,以捕捉所有的元素,包括隐藏的意见和看法。2. iACOS 使用一个顺序标签模型,以同时提取明确和隐藏的意见和看法。3. iACOS 开发了一个多头注意力的多类别分类器,用于同时发现属性和意见的对应关系,并且预测属性和意见的分类。4. iACOS 使用具有启发性和适应性的负例,进行多项任务学习,以对多项分类器进行联合训练。总结来说,iACOS 的实验结果显示,它与其他四元素提取基eline的比较,在两个公共的 benchmark 数据集上取得了优秀的 F1 分数。

Sparse Contrastive Learning of Sentence Embeddings

  • paper_url: http://arxiv.org/abs/2311.03881
  • repo_url: None
  • paper_authors: Ruize An, Chen Zhang, Dawei Song
  • for: 这 paper 的目的是探讨对句子嵌入模型进行 Parametric Sparsification,以提高句子嵌入模型的性能。
  • methods: 这 paper 使用了对句子嵌入模型进行 Parametric Sparsification,通过对每个参数的贡献度进行评估,并将低贡献度的参数简化为0。
  • results: 这 paper 的结果显示,使用 Parametric Sparsification 可以提高句子嵌入模型的性能,并且对 embedding space 进行了深入的分析,发现 embedding space 的吸引性和一致性都得到了提高。
    Abstract Recently, SimCSE has shown the feasibility of contrastive learning in training sentence embeddings and illustrates its expressiveness in spanning an aligned and uniform embedding space. However, prior studies have shown that dense models could contain harmful parameters that affect the model performance, and it is no wonder that SimCSE can as well be invented with such parameters. Driven by this, parameter sparsification is applied, where alignment and uniformity scores are used to measure the contribution of each parameter to the overall quality of sentence embeddings. Drawing from a preliminary study, we consider parameters with minimal contributions to be detrimental, as their sparsification results in improved model performance. To discuss the ubiquity of detrimental parameters and remove them, more experiments on the standard semantic textual similarity (STS) tasks and transfer learning tasks are conducted, and the results show that the proposed sparsified SimCSE (SparseCSE) has excellent performance in comparison with SimCSE. Furthermore, through in-depth analysis, we establish the validity and stability of our sparsification method, showcasing that the embedding space generated by SparseCSE exhibits improved alignment compared to that produced by SimCSE. Importantly, the uniformity yet remains uncompromised.
    摘要

OLaLa: Ontology Matching with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03837
  • repo_url: None
  • paper_authors: Sven Hertling, Heiko Paulheim
  • for: 本文旨在探讨如何使用大语言模型提高 Ontology (和更广泛的知识图) 匹配任务中的信息处理。
  • methods: 本文使用 zero-shot 和 few-shot 提示法,与多个开源的大语言模型进行不同的 OAEI 任务的实验,以探讨提示的设计和模型选择等问题。
  • results: 研究发现,只需要一些示例和Well-designed提示,可以达到与超级vised匹配系统使用更大量真实数据匹配的水平。
    Abstract Ontology (and more generally: Knowledge Graph) Matching is a challenging task where information in natural language is one of the most important signals to process. With the rise of Large Language Models, it is possible to incorporate this knowledge in a better way into the matching pipeline. A number of decisions still need to be taken, e.g., how to generate a prompt that is useful to the model, how information in the KG can be formulated in prompts, which Large Language Model to choose, how to provide existing correspondences to the model, how to generate candidates, etc. In this paper, we present a prototype that explores these questions by applying zero-shot and few-shot prompting with multiple open Large Language Models to different tasks of the Ontology Alignment Evaluation Initiative (OAEI). We show that with only a handful of examples and a well-designed prompt, it is possible to achieve results that are en par with supervised matching systems which use a much larger portion of the ground truth.
    摘要

Conversations in Galician: a Large Language Model for an Underrepresented Language

  • paper_url: http://arxiv.org/abs/2311.03812
  • repo_url: https://gitlab.irlab.org/irlab/cabuxa
  • paper_authors: Eliseo Bao, Anxo Pérez, Javier Parapar
  • for: 本研究旨在提高 Га利西안语言处理(NLP)技术,以便更好地包括少数语言社区在大语言模型的开发中。
  • methods: 本研究使用了两个新资源,包括加利西安语言适应Alpaca数据集和LLaMA-7B模型的精度调整。
  • results: 研究发现,通过使用加利西安语言适应Alpaca数据集和LLaMA-7B模型,可以帮助模型更好地理解和回答加利西安语言的问题,并且可以通过关注 português语言的知识来生成 coherent text。
    Abstract The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.
    摘要 现今的信息时代,大型对话语言模型的普及化吸引了广泛的关注,它们的经济意义不容忽视。然而,目前的模型主要是根据受欢迎的语言数据集进行训练,导致低资源语言的没有相应的进步,这种情况进一步削弱了这些语言的native speaker的经济位置。本文介绍了两个新的资源,用于提高 galician 语言的自然语言处理(NLP)。我们提供了一个galician adaptation的 Alpaca 数据集,包含 52,000 个指令和示例。这个数据集非常有价,可以帮助提高语言模型,使其更好地遵循提供的指令。此外,我们还调整了 LLama-7B 模型,以便在 galician 语言上理解和回答,这是原本模型不支持的语言。这项研究对于多种语言模型的开发进行了贡献,并且探索了如何在资源缺乏时,通过知道相关语言的知识来生成coherent的文本。 galician Alpaca 数据集和 Cabuxa-7B 模型都公开 accessible 在我们的 Huggingface Hub,并且我们已经公开了脚本来促进这个实验的重复和进一步发展。

Noisy Pair Corrector for Dense Retrieval

  • paper_url: http://arxiv.org/abs/2311.03798
  • repo_url: None
  • paper_authors: Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo
  • for: 本研究探讨了 dense retrieval 中 implicit 的假设:训练 Query-Document 对应是 preciselly 匹配的。由于 manually annotate corpus 是 expensive, training pairs 通常是自动收集的,这会导致 noise 的存在。
  • methods: 我们提出了一种 novel approach,即 Noisy Pair Corrector (NPC),它包括 detection 模块和 correction 模块。 detection 模块 根据 annotated positive 和 easy negative 文档的 perplexity 来估计噪声对应。 correction 模块 使用 exponential moving average (EMA) 模型提供了一个软监督信号,以帮助 mitigate 噪声的影响。
  • results: 我们在 Natural Question 和 TriviaQA 等文本检索benchmark上进行了实验,结果显示 NPC 能够 effectively 处理 synthetic 和 realistic 噪声。
    Abstract Most dense retrieval models contain an implicit assumption: the training query-document pairs are exactly matched. Since it is expensive to annotate the corpus manually, training pairs in real-world applications are usually collected automatically, which inevitably introduces mismatched-pair noise. In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS. Experimental results show that NPC achieves excellent performance in handling both synthetic and realistic noise.
    摘要 大多数密集检索模型假设训练查询文档对是精确匹配的。由于在实际应用中annotate文档是费时的,因此训练对的收集通常会受到匹配错误的干扰。在这篇论文中,我们研究了一个有趣且挑战的 dense retrieval 问题:如何训练有效的模型在匹配错误的情况下。为解决这个问题,我们提出了一种新的方法called Noisy Pair Corrector (NPC),它包括检测模块和修正模块。检测模块通过计算注释正确文档和易于获得的负文档的准确率来估算干扰对。修正模块使用指数移动平均(EMA)模型提供一个软件支持信号,以帮助缓解干扰的影响。我们在Natural Question和TriviaQA等文本检索 benchmarks上进行了实验,并在 StaQC 和 SO-DS 等代码检索 benchmarks上进行了实验。实验结果表明,NPC在处理 synthetic 和实际干扰的情况下表现出色。

Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment

  • paper_url: http://arxiv.org/abs/2311.03792
  • repo_url: None
  • paper_authors: Jakir Hasan, Shrestha Datta, Ameya Debnath
    for:The paper is written for the purpose of developing an Artificial Intelligence and Machine Learning model to map Bangla words to their International Phonetic Alphabet (IPA) representations.methods:The authors use a transformer-based sequence-to-sequence model at the letter and symbol level to map Bangla words to their IPA representations. They also utilize manual mapping to handle punctuation marks and foreign languages in the text.results:The authors achieve the top position in the public ranking of DataVerse Challenge - ITVerse 2023 with a word error rate of 0.10582.
    Abstract The International Phonetic Alphabet (IPA) is indispensable in language learning and understanding, aiding users in accurate pronunciation and comprehension. Additionally, it plays a pivotal role in speech therapy, linguistic research, accurate transliteration, and the development of text-to-speech systems, making it an essential tool across diverse fields. Bangla being 7th as one of the widely used languages, gives rise to the need for IPA in its domain. Its IPA mapping is too diverse to be captured manually giving the need for Artificial Intelligence and Machine Learning in this field. In this study, we have utilized a transformer-based sequence-to-sequence model at the letter and symbol level to get the IPA of each Bangla word as the variation of IPA in association of different words is almost null. Our transformer model only consisted of 8.5 million parameters with only a single decoder and encoder layer. Additionally, to handle the punctuation marks and the occurrence of foreign languages in the text, we have utilized manual mapping as the model won't be able to learn to separate them from Bangla words while decreasing our required computational resources. Finally, maintaining the relative position of the sentence component IPAs and generation of the combined IPA has led us to achieve the top position with a word error rate of 0.10582 in the public ranking of DataVerse Challenge - ITVerse 2023 (https://www.kaggle.com/competitions/dataverse_2023/).
    摘要 国际音律字母(IPA)是语言学习和理解的不可或缺工具,帮助用户更正确地发音和理解。它在语音疾病治疗、语言研究、精确转写和文本读取系统的发展中扮演着关键性角色,因此在多个领域都是必备的工具。旁遮普语是全球第七大常用语言,因此IPA在这个领域的需求增加。旁遮普语IPA映射非常复杂,无法由人工手动记录,因此我们需要使用人工智能和机器学习。在这个研究中,我们使用了一种基于序列-序列模型的变换器模型,以获取每个旁遮普语单词的IPA,因为旁遮普语单词的IPA变化非常小。我们的变换器模型只有8500万参数,只有一个解oder和编码器层。此外,我们使用了手动映射来处理文本中的括号和外语,以避免模型学习括号和外语与旁遮普语单词之间的分化。最后,我们保持了每个句子元素IPA的相对位置和生成总IPA,使我们在DataVerse Challenge - ITVerse 2023(https://www.kaggle.com/competitions/dataverse_2023)中获得了第一名,word error rate为0.10582。

Language Representation Projection: Can We Transfer Factual Knowledge across Languages in Multilingual Language Models?

  • paper_url: http://arxiv.org/abs/2311.03788
  • repo_url: None
  • paper_authors: Shaoyang Xu, Junzhuo Li, Deyi Xiong
  • for: 本研究旨在探讨multilingual pretrained language models是否可以Explicitly transfer rich factual knowledge from English to non-English languages.
  • methods: 我们提出了two parameter-free Language Representation Projection modules (LRP2),其中第一个模块将非英语表示转换为英语类似的Equivalents,而第二个模块则将英语类似的表示还原回对应的非英语语言表示。
  • results: 实验结果表明,LRP2可以显著提高了factual knowledge retrieval的准确率,并且可以帮助知识传递性适用于多种不同的非英语语言。我们还从表示空间和 crossing-lingual knowledge neuron的角度进行了工作机制的研究。
    Abstract Multilingual pretrained language models serve as repositories of multilingual factual knowledge. Nevertheless, a substantial performance gap of factual knowledge probing exists between high-resource languages and low-resource languages, suggesting limited implicit factual knowledge transfer across languages in multilingual pretrained language models. This paper investigates the feasibility of explicitly transferring relatively rich factual knowledge from English to non-English languages. To accomplish this, we propose two parameter-free $\textbf{L}$anguage $\textbf{R}$epresentation $\textbf{P}$rojection modules (LRP2). The first module converts non-English representations into English-like equivalents, while the second module reverts English-like representations back into representations of the corresponding non-English language. Experimental results on the mLAMA dataset demonstrate that LRP2 significantly improves factual knowledge retrieval accuracy and facilitates knowledge transferability across diverse non-English languages. We further investigate the working mechanism of LRP2 from the perspectives of representation space and cross-lingual knowledge neuron.
    摘要 多语言预训言模型作为多语言事实知识库,然而高资源语言和低资源语言之间的事实知识探测性能存在显著差距,这表明多语言预训言模型之间的隐式事实知识传递有限。本文研究将英语的事实知识传递到非英语语言的可能性。为此,我们提出了两个参数自由的语言表示 проекции模块(LRP2)。第一个模块将非英语表示转换成英语类似的表示,第二个模块将英语类似的表示恢复回到对应的非英语语言表示。实验结果表明,LRP2可以大幅提高事实知识检索精度和跨语言知识传递性。我们进一步研究LRP2的工作机制从表示空间和跨语言知识神经的角度。

Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation

  • paper_url: http://arxiv.org/abs/2311.03767
  • repo_url: https://github.com/iampushpdeep/gender-bias-hi-en-eval
  • paper_authors: Pushpdeep Singh
  • for: 这项研究的目的是为了评估Neural Machine Translation(NMT)模型中的社会偏见,特别是对于不同语言源文的评估。
  • methods: 这项研究使用了Hindi作为源语言,并构建了两组gender-specific sentence:OTSC-Hindi和WinoMT-Hindi,以自动评估不同的Hindi-English(HI-EN)NMT系统中的性偏见。
  • results: 研究发现,当使用不同的语言源文时,NMT模型中的社会偏见也会有所不同。这种偏见可以通过grammatical gendercue在源 sentence中进行正确的识别。这项研究highlights the importance of考虑语言的特点when designing extrinsic bias evaluation datasets。
    Abstract Neural Machine Translation (NMT) models are state-of-the-art for machine translation. However, these models are known to have various social biases, especially gender bias. Most of the work on evaluating gender bias in NMT has focused primarily on English as the source language. For source languages different from English, most of the studies use gender-neutral sentences to evaluate gender bias. However, practically, many sentences that we encounter do have gender information. Therefore, it makes more sense to evaluate for bias using such sentences. This allows us to determine if NMT models can identify the correct gender based on the grammatical gender cues in the source sentence rather than relying on biased correlations with, say, occupation terms. To demonstrate our point, in this work, we use Hindi as the source language and construct two sets of gender-specific sentences: OTSC-Hindi and WinoMT-Hindi that we use to evaluate different Hindi-English (HI-EN) NMT systems automatically for gender bias. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.
    摘要

Multilingual Mathematical Autoformalization

  • paper_url: http://arxiv.org/abs/2311.03755
  • repo_url: https://github.com/albertqjiang/mma
  • paper_authors: Albert Q. Jiang, Wenda Li, Mateja Jamnik
  • for: 这个论文的目的是提出一个大型、灵活、多语言、多领域的数据集,用于自动化数学表述转换。
  • methods: 这个论文使用了一种语言模型,将数学语句转换成相应的不正式语句。
  • results: 实验表明,在这个数据集上进行微调后,语言模型可以在两个标准测试benchmark上生成16-18%的句子,需要最小的修改。这比基本模型的0%有所提高。此外,微调在多语言ormal数据上还能够提高自动化数学表述转换模型的能力,即使在单语言任务上使用。
    Abstract Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create $\texttt{MMA}$, a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on $\texttt{MMA}$ produce $16-18\%$ of statements acceptable with minimal corrections on the $\texttt{miniF2F}$ and $\texttt{ProofNet}$ benchmarks, up from $0\%$ with the base model. We demonstrate that fine-tuning on multilingual formal data results in more capable autoformalization models even when deployed on monolingual tasks.
    摘要 自然语言材料的自动化正式化任务是将自然语言材料翻译成机器可验证的正式表示。研究进步受到缺乏大量的正式-非正式对应对的数据集的限制。现有的方法通常是手动精心约定小量资料或使用大语言模型进行几招学习。但这些方法受到数据稀缺和正式语言学习困难的限制。在这项工作中,我们创建了$\texttt{MMA}$ dataset,这是一个大型、灵活、多语言和多领域的正式-非正式对应对数据集,通过反向翻译,即从正式数学陈述翻译到相应的非正式陈述。实验显示,将语言模型在$\texttt{MMA}$上进行微调后,其在$\texttt{miniF2F}$和$\texttt{ProofNet}$benchmark上的表达可接受度为16-18%,比基本模型为0%提高了。我们示示了在多语言正式数据上微调后,即使在单语言任务上部署时,自动化正式化模型也会更具能力。

Which is better? Exploring Prompting Strategy For LLM-based Metrics

  • paper_url: http://arxiv.org/abs/2311.03754
  • repo_url: None
  • paper_authors: Joonghoon Kim, Saeran Park, Kiyoon Jeong, Sangmin Lee, Seung Hun Han, Jiyoon Lee, Pilsung Kang
  • for: 本研究旨在探讨使用大型自然语言处理器(LLM)来评估自然语言生成(NLG)质量,以便更好地评估NLGTask中的系统。
  • methods: 本研究使用了多种提示和提示策略,并 compare了三种汇集策略来评估NLG质量。另外,研究还开发了一种生成证据的策略,以便解释LLM-based评估结果。
  • results: 研究发现,使用提示策略和汇集策略可以提高NLG质量评估的准确性。此外,研究还发现,使用open-source LLMs可以提供更加可靠的评估结果。
    Abstract This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task, where systems were submitted to two tracks: small and large summarization tracks. With advanced Large Language Models (LLMs) such as GPT-4, evaluating the quality of Natural Language Generation (NLG) has become increasingly paramount. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks. To address this issue, we explore the potential capability of LLM-based metrics, especially leveraging open-source LLMs. In this study, wide range of prompts and prompting techniques are systematically analyzed with three approaches: prompting strategy, score aggregation, and explainability. Our research focuses on formulating effective prompt templates, determining the granularity of NLG quality scores and assessing the impact of in-context examples on LLM-based evaluation. Furthermore, three aggregation strategies are compared to identify the most reliable method for aggregating NLG quality scores. To examine explainability, we devise a strategy that generates rationales for the scores and analyzes the characteristics of the explanation produced by the open-source LLMs. Extensive experiments provide insights regarding evaluation capabilities of open-source LLMs and suggest effective prompting strategies.
    摘要 The research focuses on three approaches: prompting strategy, score aggregation, and explainability. The study systematically analyzes a wide range of prompts and prompting techniques, and formulates effective prompt templates to improve the quality of NLG. Additionally, the research determines the granularity of NLG quality scores and assesses the impact of in-context examples on LLM-based evaluation.To examine explainability, the study devises a strategy that generates rationales for the scores, and analyzes the characteristics of the explanations produced by open-source LLMs. Extensive experiments provide insights into the evaluation capabilities of open-source LLMs and suggest effective prompting strategies.

Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning

  • paper_url: http://arxiv.org/abs/2311.03748
  • repo_url: https://github.com/psunlpgroup/fish-dip
  • paper_authors: Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Peng Shi, Wenpeng Yin, Rui Zhang
  • for: 这篇论文旨在提高sequence labeling问题的解析效率,以便更好地利用大型语言模型知识来进行结构预测。
  • methods: 作者提出了一种名为FISH-DIP的散发式精简调整策略,具体来说是在精简调整过程中选择一部分参数,并由受测例提供的反馈来导引。
  • results: 在五个sequence labeling任务中,作者展示了FISH-DIP可以在低资源设定下顺利地调整模型,提高表现约40%,具体取决于目标评估设定。此外,相比内容学习和其他参数有效精简调整方法,FISH-DIP在极低资源设定下表现较好或更好,特别是在极端低资源设定下。
    Abstract Unified Sequence Labeling that articulates different sequence labeling problems such as Named Entity Recognition, Relation Extraction, Semantic Role Labeling, etc. in a generalized sequence-to-sequence format opens up the opportunity to make the maximum utilization of large language model knowledge toward structured prediction. Unfortunately, this requires formatting them into specialized augmented format unknown to the base pretrained language model (PLMs) necessitating finetuning to the target format. This significantly bounds its usefulness in data-limited settings where finetuning large models cannot properly generalize to the target format. To address this challenge and leverage PLM knowledge effectively, we propose FISH-DIP, a sample-aware dynamic sparse finetuning strategy that selectively focuses on a fraction of parameters, informed by feedback from highly regressing examples, during the fine-tuning process. By leveraging the dynamism of sparsity, our approach mitigates the impact of well-learned samples and prioritizes underperforming instances for improvement in generalization. Across five tasks of sequence labeling, we demonstrate that FISH-DIP can smoothly optimize the model in low resource settings offering upto 40% performance improvements over full fine-tuning depending on target evaluation settings. Also, compared to in-context learning and other parameter-efficient fine-tuning approaches, FISH-DIP performs comparably or better, notably in extreme low-resource settings.
    摘要 通过把不同的序列标记问题(如命名实体识别、关系提取、semantic role labeling等)转化为通用的序列-到-序列格式,大语言模型的知识可以得到最大化利用。然而,这需要将它们转换为特定格式,不知道基础预训练语言模型(PLMs)的格式,因此需要训练。这会限制其在数据有限的设置中的使用,因为大型模型在目标格式上不能良好泛化。为解决这个挑战并有效地利用PLM知识,我们提出了鱼钻抑制策略(FISH-DIP),它是一种样本相关的动态稀疏训练策略。在训练过程中,它会选择一部分参数,由高度抑制的示例返回,并在这些示例上进行精度强调。通过利用动态稀疏性,我们的方法可以减轻高度学习的样本的影响,并且优先级推进下降性能的实例进行改进。在五种序列标记任务上,我们示示了FISH-DIP可以在低资源设置下缓和优化模型,提供最高达40%的性能提升,具体取决于目标评估设置。此外,相比卷积学习和其他参数有效的细化训练方法,FISH-DIP在极低资源设置下表现相当或更好。

Leveraging Structured Information for Explainable Multi-hop Question Answering and Reasoning

  • paper_url: http://arxiv.org/abs/2311.03734
  • repo_url: https://github.com/bcdnlp/structure-qa
  • paper_authors: Ruosen Li, Xinya Du
  • for: 提高多步问答模型的推理能力和解释性
  • methods: 使用链条思维机制生成推理链和答案,以及利用提取的semantic结构进行多步问答
  • results: 对两个 benchmark dataset 实现了substantial 的提高,并且提取的结构本身就提供了固有的解释,比如生成的推理链和saliency-based解释更具有人类偏好。
    Abstract Neural models, including large language models (LLMs), achieve superior performance on multi-hop question-answering. To elicit reasoning capabilities from LLMs, recent works propose using the chain-of-thought (CoT) mechanism to generate both the reasoning chain and the answer, which enhances the model's capabilities in conducting multi-hop reasoning. However, several challenges still remain: such as struggling with inaccurate reasoning, hallucinations, and lack of interpretability. On the other hand, information extraction (IE) identifies entities, relations, and events grounded to the text. The extracted structured information can be easily interpreted by humans and machines (Grishman, 2019). In this work, we investigate constructing and leveraging extracted semantic structures (graphs) for multi-hop question answering, especially the reasoning process. Empirical results and human evaluations show that our framework: generates more faithful reasoning chains and substantially improves the QA performance on two benchmark datasets. Moreover, the extracted structures themselves naturally provide grounded explanations that are preferred by humans, as compared to the generated reasoning chains and saliency-based explanations.
    摘要 On the other hand, information extraction (IE) is a technique that identifies entities, relations, and events grounded in the text. The extracted structured information can be easily interpreted by both humans and machines. In this study, we explore the use of constructed and leveraged extracted semantic structures (graphs) for multi-hop question answering, particularly in the reasoning process. Our empirical results and human evaluations show that our framework:1. Generates more faithful reasoning chains, and2. Substantially improves QA performance on two benchmark datasets.Moreover, the extracted structures themselves naturally provide grounded explanations that are preferred by humans, as compared to the generated reasoning chains and saliency-based explanations.

Learning to Learn for Few-shot Continual Active Learning

  • paper_url: http://arxiv.org/abs/2311.03732
  • repo_url: None
  • paper_authors: Stella Ho, Ming Liu, Shang Gao, Longxiang Gao
  • for: 这 paper 是关于 continual active learning (CAL) Setting,旨在在具有限制性的标签数据和庞大的无标签数据之间平衡稳定性和 пластично性。
  • methods: 作者提出了一种简单 yet efficient 的方法,即 Meta-Continual Active Learning,该方法利用 meta-学习和经验回放来解决稳定性和 пластично性之间的负担。
  • results: 实验结果表明,随机抽样是最佳的默认策略 both for active learning 和 memory sample selection,以解决几个 shot CAL 问题。
    Abstract Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in CL are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning (CAL) setting where labeled data is inadequate, and unlabeled data is abundant but with a limited annotation budget. We propose a simple but efficient method, called Meta-Continual Active Learning. Specifically, we employ meta-learning and experience replay to address the trade-off between stability and plasticity. As a result, it finds an optimal initialization that efficiently utilizes annotated information for fast adaptation while preventing catastrophic forgetting of past tasks. We conduct extensive experiments to validate the effectiveness of the proposed method and analyze the effect of various active learning strategies and memory sample selection methods in a few-shot CAL setup. Our experiment results demonstrate that random sampling is the best default strategy for both active learning and memory sample selection to solve few-shot CAL problems.
    摘要

A Survey of Large Language Models Attribution

  • paper_url: http://arxiv.org/abs/2311.03731
  • repo_url: https://github.com/HITsz-TMG/awesome-llm-attributions
  • paper_authors: Dongfang Li, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Ziyang Chen, Baotian Hu, Aiguo Wu, Min Zhang
  • for: 这篇论文旨在探讨开放领域生成系统中使用的归因机制,尤其是大语言模型。
  • methods: 论文总结了各种归因方法,包括权重归因、词 embeddings 归因、推荐归因等。
  • results: 论文指出,归因机制可以提高开放领域生成系统的可靠性和准确性,但同时也存在一些问题,如知识库的含糊性、内生偏见和过度归因的缺点。
    Abstract Open-domain generative systems have gained significant attention in the field of conversational AI (e.g., generative search engines). This paper presents a comprehensive review of the attribution mechanisms employed by these systems, particularly large language models. Though attribution or citation improve the factuality and verifiability, issues like ambiguous knowledge reservoirs, inherent biases, and the drawbacks of excessive attribution can hinder the effectiveness of these systems. The aim of this survey is to provide valuable insights for researchers, aiding in the refinement of attribution methodologies to enhance the reliability and veracity of responses generated by open-domain generative systems. We believe that this field is still in its early stages; hence, we maintain a repository to keep track of ongoing studies at https://github.com/HITsz-TMG/awesome-llm-attributions.
    摘要 Open-domain generative systems 已经在对话AI领域获得了广泛的关注(例如生成搜索引擎)。这篇评论文章把这些系统使用的归因机制进行了全面的评论,特别是大型自然语言模型。尽管归因或引用可以提高对实际性和可靠性的改进,但是存在混乱知识库、内在偏见以及过度归因的问题可能会阻碍这些系统的效果。本评论的目的是为研究人员提供有价值的洞察,以便通过修改归因方法来提高开放领域生成系统的可靠性和真实性。我们认为这个领域仍处于早期阶段,因此我们维护了一个存储库,以跟踪进行中的研究:https://github.com/HITsz-TMG/awesome-llm-attributions。

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

  • paper_url: http://arxiv.org/abs/2311.03696
  • repo_url: https://github.com/shyyhs/CourseraParallelCorpusMining
  • paper_authors: Haiyue Song, Raj Dabre, Chenhui Chu, Atsushi Fujita, Sadao Kurohashi
  • For: 本研究的目的是为了提高在线课程学习译文质量,但建立高质量的讲义机器翻译系统所需的公共可用并行 corpora 缺乏。为此,我们提出了并行 corpora 挖掘框架,可以快速和有效地挖掘公共可用讲义上的并行 corpora。* Methods: 我们提出了一种基于动态编程的句子对齐算法,利用机器翻译后的句子之间的偏度相似性进行对齐。此外,我们还提出了一种使用 BERTScore、LASER 和 sentBERT 等方法来评估对齐效果的方法。* Results: 我们通过机器翻译实验表明,使用我们提出的对齐算法可以获得高品质的讲义翻译结果。此外,我们还发现在使用合适的评估方法和数据预处理技术时,可以在不同的语言翻译 Task 上达到高度的翻译质量。
    Abstract Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at https://github.com/shyyhs/CourseraParallelCorpusMining.
    摘要 讲义笔记翻译帮助学习者理解在线课程,但建立高质量讲义机器翻译系统缺乏公开可用的平行 Corpora。为解决这个问题,我们研究了平行 Corpora 挖掘框架,该框架可以快速地挖掘公开可用的 Coursera 讲义中的平行 Corpora。为创建平行 Corpora,我们提议使用动态编程基于句子对齐算法,该算法利用机器翻译句子的央正似值。句子对齐 F1 分数达到 96%,高于使用 BERTScore、LASER 或 sentBERT 方法。对英语-日语和英语-中文讲义翻译,我们提取了约 50,000 行的平行 Corpora,并通过手动筛选创建了开发和测试集。通过机器翻译实验,我们示出了挖掘 Corpora 可以提高讲义笔记翻译质量。此外,本研究还提供了收集和清洁 Corpora 的指南,以及 Mine 平行句子、清除挖掘数据中的噪声和创建高质量评估分割的方法。为保持可重复性,我们已经发布了 Corpora 以及创建它们的代码。数据集可以在 GitHub 上获取:https://github.com/shyyhs/CourseraParallelCorpusMining。

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03687
  • repo_url: None
  • paper_authors: Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu
  • for: 本研究旨在对大语言模型(LLMs)的预训练、精度调整和服务性能进行性能测试,以帮助用户更好地选择适合其需求的硬件和软件架构。
  • methods: 本研究使用了多种优化技术,包括ZeRO、量化、重复计算和FlashAttention,以提高LLMs的性能。
  • results: 研究发现,在不同的硬件和软件架构上,LLMs的运行时间可以有显著的差异。此外,通过进一步分析LLMs的子模块,我们发现了一些可能的优化机会,以帮助未来的研究人员进一步提高LLMs的运行时间性能。
    Abstract Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
    摘要 首先,我们测试了不同大小的 LLMs(7B、13B和70B)在三个8核 GPU 平台上的端到端性能,包括预训练、精度调整和服务。然后,我们进行了详细的运行时分析,探讨 LLMs 中计算和通信操作的性能。对于普通用户,我们的测试和发现可以帮助他们更好地理解不同优化技术、训练和推理框架以及硬件平台的配置,以便更好地部署 LLMs。对于研究人员,我们的深入模块分析可能会揭示未来进行 LLMs 的 runtime 性能优化的潜在机会。

CBSiMT: Mitigating Hallucination in Simultaneous Machine Translation with Weighted Prefix-to-Prefix Training

  • paper_url: http://arxiv.org/abs/2311.03672
  • repo_url: None
  • paper_authors: Mengge Liu, Wen Zhang, Xiang Li, Yanzhi Tian, Yuhang Guo, Jian Luan, Bin Wang, Shuoying Chen
  • for: 提高同时翻译(SiMT)的质量和稳定性,特别是在源语言字符串不完整时开始翻译。
  • methods: 使用预测目标元素的方法,通过对部分源前缀进行预测,学习目标元素。但是由于语言的单词顺序不同,可能会导致模型出现误差,即目标输出不准确反映源输入。
  • results: 提出一种自信度基于的同时翻译机制(CBSiMT),通过模型自信度来识别幻化元素,并通过加权前缀到前缀训练来减轻其影响。实验结果表明,我们的方法可以在不同的延迟 режиime中提高翻译质量,最高可以达2个BLEU分数提高。
    Abstract Simultaneous machine translation (SiMT) is a challenging task that requires starting translation before the full source sentence is available. Prefix-to-prefix framework is often applied to SiMT, which learns to predict target tokens using only a partial source prefix. However, due to the word order difference between languages, misaligned prefix pairs would make SiMT models suffer from serious hallucination problems, i.e. target outputs that are unfaithful to source inputs. Such problems can not only produce target tokens that are not supported by the source prefix, but also hinder generating the correct translation by receiving more source words. In this work, we propose a Confidence-Based Simultaneous Machine Translation (CBSiMT) framework, which uses model confidence to perceive hallucination tokens and mitigates their negative impact with weighted prefix-to-prefix training. Specifically, token-level and sentence-level weights are calculated based on model confidence and acted on the loss function. We explicitly quantify the faithfulness of the generated target tokens using the token-level weight, and employ the sentence-level weight to alleviate the disturbance of sentence pairs with serious word order differences on the model. Experimental results on MuST-C English-to-Chinese and WMT15 German-to-English SiMT tasks demonstrate that our method can consistently improve translation quality at most latency regimes, with up to 2 BLEU scores improvement at low latency.
    摘要 同时机器翻译(SiMT)是一项具有挑战性的任务,需要在源句子完全可用之前开始翻译。预FIX框架经常应用于SiMT,这种方法学习预测目标元素使用只有部分源前缀。然而,由于语言之间的单词顺序差异,不准确对应的前缀对SiMT模型会产生严重的幻觉问题,即目标输出不 faithful于源输入。这些问题不仅会生成不支持源前缀的目标元素,还会阻碍模型生成正确的翻译,因为接下来的源单词会受到这些幻觉元素的影响。在这种情况下,我们提出了一种信息Content-based Simultaneous Machine Translation(CBSiMT)框架,使用模型信息来感知幻觉元素,并通过权重PREFIX-to-PREFIX训练来缓解其负面影响。具体来说,根据模型信息计算 token-level 和 sentence-level 权重,然后将其加到损失函数中。我们Explicitly quantify the faithfulness of the generated target tokens using the token-level weight,并使用 sentence-level weight来缓解不同语言单词顺序对模型的影响。我们在 MuST-C 英语-中文和 WMT15 德语-英语 SiMT 任务上进行了实验,结果表明,我们的方法可以在不同的延迟 режиме下 consistently 提高翻译质量,最高达到 2 BLEU 分数提高。

Principles from Clinical Research for NLP Model Generalization

  • paper_url: http://arxiv.org/abs/2311.03663
  • repo_url: None
  • paper_authors: Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor
  • for: 本研究旨在探讨NLG模型的普适性,以及各种因素对其影响。
  • methods: 本研究使用严格的实验方法保证内部有效性,并分析了模型在不同数据集上的普适性。
  • results: 研究发现,模型在不同数据集上的表现会受到各种因素的影响,包括数据中的偶极性关系。此外,研究还提供了分析普适性失败的方法。
    Abstract The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution'' effects. Here, we explore the foundations of generalizability and study the various factors that affect it, articulating generalizability lessons from clinical studies. In clinical research generalizability depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We present the need to ensure internal validity when building machine learning models in natural language processing, especially where results may be impacted by spurious correlations in the data. We demonstrate how spurious factors, such as the distance between entities in relation extraction tasks, can affect model internal validity and in turn adversely impact generalization. We also offer guidance on how to analyze generalization failures.
    摘要 nlp社区通常通过模型在保留测试集上的表现来评估泛化性。在数据外部测试集上观察到的性能下降通常被归结为“非标准”效应。我们探究泛化性的基础和它受到哪些因素的影响,并从临床研究中提出泛化性评估的教训。在自然语言处理领域建立机器学习模型时,特别是在数据中存在偶极相关性的情况下,确保内部有效性非常重要。我们示例了如何使用距离Entity之间的关系EXTRACT任务中的偶极相关性会影响模型的内部有效性,从而导致泛化性下降。我们还提供了分析泛化失败的方法。

Innovation and Word Usage Patterns in Machine Learning

  • paper_url: http://arxiv.org/abs/2311.03633
  • repo_url: https://github.com/vitorbborges/monografia-PET22
  • paper_authors: Vítor Bandeira Borges, Daniel Oliveira Cajueiro
  • for: 本研究探讨机器学习研究的动态景观进化。
  • methods: 通过独特的 Dirichlet Allocation 方法,挖掘出机器学习领域中的重要主题和基本概念,然后进行了全面的演化分析,跟踪这些主题的演化轨迹。
  • results: 通过使用 Kullback-Leibler Divergence 统计量,衡量研究贡献的新颖度和分化程度,并发现了一些关键的研究者和学术会议在机器学习领域中的重要作用。
    Abstract In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the novelty and divergence of research contributions, we employ the Kullback-Leibler Divergence metric. This statistical measure serves as a proxy for ``surprise'', indicating the extent of differentiation between the content of academic papers and the subsequent developments in research. By amalgamating these insights, we gain the ability to ascertain the pivotal roles played by prominent researchers and the significance of specific academic venues (periodicals and conferences) within the machine learning domain.
    摘要 在这项研究中,我们探究机器学习研究发展的动态领域。首先,通过利用秘密分配法,我们分析出机器学习领域中的重要主题和基本概念。然后,我们进行了全面的分析,跟踪这些标识的主题的演化轨迹。为了量化研究贡献的新鲜度和分化程度,我们使用卡尔巴克-莱布尔散度度量。这个统计指标作为“surprise”的代理,反映了学术论文中的内容和后续研究的差异。通过结合这些洞察,我们得到了机器学习领域中的重要研究者和期刊(报刊和会议)的重要性。

GNAT: A General Narrative Alignment Tool

  • paper_url: http://arxiv.org/abs/2311.03627
  • repo_url: None
  • paper_authors: Tanzir Pial, Steven Skiena
  • for: 用于对纤维体系中的文档进行对比和对Alignment。
  • methods: 使用Smith-Waterman算法和现代文本相似度指标。
  • results: 可以对不同类型的文档进行对比,并且可以定义准确的p值来评估对比结果的有效性。
    Abstract Algorithmic sequence alignment identifies similar segments shared between pairs of documents, and is fundamental to many NLP tasks. But it is difficult to recognize similarities between distant versions of narratives such as translations and retellings, particularly for summaries and abridgements which are much shorter than the original novels. We develop a general approach to narrative alignment coupling the Smith-Waterman algorithm from bioinformatics with modern text similarity metrics. We show that the background of alignment scores fits a Gumbel distribution, enabling us to define rigorous p-values on the significance of any alignment. We apply and evaluate our general narrative alignment tool (GNAT) on four distinct problem domains differing greatly in both the relative and absolute length of documents, namely summary-to-book alignment, translated book alignment, short story alignment, and plagiarism detection -- demonstrating the power and performance of our methods.
    摘要 算法序列对alignment发现文档对比中的相似段落,是许多自然语言处理任务的基础。但是寻找远程版本的故事相似性,特别是简要摘要和缩写,是很困难的。我们开发了一种总体方法,将生物信息学中的Smith-Waterman算法与现代文本相似度度量结合起来。我们发现对Alignment scores的背景适合Gumbel分布,因此我们可以定义准确的p值,用于评估任何对齐。我们运用和评估我们的通用故事对齐工具(GNAT)在四个不同的问题领域中,即摘要到书籍对齐、翻译书籍对齐、短篇对齐和抄袭检测中,并示出了我们的方法的能力和性能。