cs.CL - 2023-11-20

Unifying Corroborative and Contributive Attributions in Large Language Models

paper_url: http://arxiv.org/abs/2311.12233
repo_url: None
paper_authors: Theodora Worledge, Judy Hanwen Shen, Nicole Meister, Caleb Winston, Carlos Guestrin
for: 本研究旨在提供一个统一的大语言模型归因框架，以涵盖现有不同类型的归因方法，包括引用生成和训练数据归因。
methods: 本研究使用了现有的归因方法，并将其们 integrate 到一个统一的框架中。
results: 本研究提出了一个统一的大语言模型归因框架，可以涵盖现有不同类型的归因方法，并可以用于解释现实世界中的应用场景。

Abstract
As businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. However, methods for explaining language model outputs largely fall across two distinct fields of study which both use the term "attribution" to refer to entirely separate techniques: citation generation and training data attribution. In many modern applications, such as legal document generation and medical question answering, both types of attributions are important. In this work, we argue for and present a unified framework of large language model attributions. We show how existing methods of different types of attribution fall under the unified framework. We also use the framework to discuss real-world use cases where one or both types of attributions are required. We believe that this unified framework will guide the use case driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation.

摘要
As businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. However, methods for explaining language model outputs largely fall across two distinct fields of study, both of which use the term "attribution" to refer to entirely separate techniques: citation generation and training data attribution. In many modern applications, such as legal document generation and medical question answering, both types of attributions are important. In this work, we argue for and present a unified framework of large language model attributions. We show how existing methods of different types of attribution fall under the unified framework. We also use the framework to discuss real-world use cases where one or both types of attributions are required. We believe that this unified framework will guide the use case-driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation.Here's the translation in Traditional Chinese:当商业、产品和服务逐渐发展around大型自然语言模型时，这些模型的可信度将直接受到其输出的可追溯性影响。然而，对于大型自然语言模型的输出解释方法主要分布在两个不同的领域中，它们都使用“参考”这个名称来描述完全不同的技术：引用生成和训练数据参考。在现代应用中，如法律文件生成和医疗问题回答，都需要这两种参考。在这个工作中，我们认为并提出了一个统一框架，以涵盖现有不同类型的参考方法。我们还使用这个框架来讨论实际应用中需要一或二种参考的问题。我们相信这个统一框架将导引use case驱动的开发系统，以及参考评估的标准化。

Leveraging Closed-Access Multilingual Embedding for Automatic Sentence Alignment in Low Resource Languages

paper_url: http://arxiv.org/abs/2311.12179
repo_url: https://github.com/abumafrim/cohere-align
paper_authors: Idris Abdulmumin, Auwal Abubakar Khalid, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Lukman Jibril Aliyu, Babangida Sani, Bala Mairiga Abduljalil, Sani Ahmad Hassan
for: 本研究旨在提高机器翻译中的质量，通过使用高质量的并行数据集来改进翻译模型的性能。
methods: 本研究使用了Cohere多语言嵌入的关闭访问，并开发了一个简单 yet 高效的并行句子对齐器。
results: 该方法在FLORES和MAFAND-MT数据集上达到了$94.96$和$54.83$的f1分数，与LASER相比具有了显著的改进（超过5个BLEU分数）。

Abstract
The importance of qualitative parallel data in machine translation has long been determined but it has always been very difficult to obtain such in sufficient quantity for the majority of world languages, mainly because of the associated cost and also the lack of accessibility to these languages. Despite the potential for obtaining parallel datasets from online articles using automatic approaches, forensic investigations have found a lot of quality-related issues such as misalignment, and wrong language codes. In this work, we present a simple but qualitative parallel sentence aligner that carefully leveraged the closed-access Cohere multilingual embedding, a solution that ranked second in the just concluded #CoHereAIHack 2023 Challenge (see https://ai6lagos.devpost.com). The proposed approach achieved $94.96$ and $54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of LASER respectively. Our method also achieved an improvement of more than 5 BLEU scores over LASER, when the resulting datasets were used with MAFAND-MT dataset to train translation models. Our code and data are available for research purposes here (https://github.com/abumafrim/Cohere-Align).

摘要
In this work, we present a simple yet effective parallel sentence aligner that leverages the closed-access Cohere multilingual embedding. This approach achieved F1 scores of $94.96$ and $54.83$ on FLORES and MAFAND-MT, respectively, outperforming LASER by more than 5 BLEU scores. Our method and code are available for research purposes at https://github.com/abumafrim/Cohere-Align.

Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

paper_url: http://arxiv.org/abs/2311.12131
repo_url: https://github.com/shachardon/mid-journey-to-alignment
paper_authors: Shachar Don-Yehiya, Leshem Choshen, Omri Abend
for: 本研究探讨了在生成图像时，用户需要进行多次尝试，并通过反馈来更新提示，以实现更好的图像生成。
methods: 该研究采用了文本到图像模型，并对用户的提示进行分析，以了解用户在尝试过程中的行为。
results: 研究发现，用户的提示在尝试过程中会predictably converge到特定的特征，并且这种吸引力可能是由于用户意外地忽略了重要细节，或者是由于模型的偏好，生成更加适合特定语言风格的图像。

Abstract
Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model's ``preferences'', producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model's preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.

摘要
通常，通过文本到图像模型生成图像需要多次尝试，用户会在反馈基础上不断更新提示。以认知工作的参考游戏和对话Alignment为 inspirations，这篇论文分析了用户提示的动态。我们编译了人类用户与Midjourney的多轮交互的数据集。我们的分析显示，提示逐渐趋向特定特征。我们进一步研究了这种吸引力是由于用户注意到重要细节的不足，或者是由于模型的偏好而生成更好的图像。我们发现了这两种可能性。用户适应模型的偏好可能会导致 reuse user data for further training，但是这些提示可能会受到模型的偏好而不是人类的意图和自然表达方式。

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

paper_url: http://arxiv.org/abs/2311.12023
repo_url: https://github.com/hanguo97/lq-lora
paper_authors: Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim
for: 本研究旨在提出一种简单的方法，以提高预训练语言模型的内存效率。
methods: 该方法使用迭代算法将预训练矩阵分解成高精度低级组件和内存效率量化组件。在finetuning时，量化组件保持不变，只有低级组件进行更新。我们还提出了一种基于欧几里得方程的量化组件的整数线性编程方法，可以在给定的总内存预算下动态配置量化参数（例如，比特宽和块大小）。
results: 我们的研究表明，使用我们的low-rank plus quantized matrix decomposition方法（LQ-LoRA）可以超过强QLoRA和GPTQ-LoRA基elines，并且可以实现更加致命的量化。例如，在OpenAssistant标准 benchmark上，LQ-LoRA可以学习一个2.5比特LLaMA-2模型，与4比特QLoRA基eline相当。此外，当finetuning在语言模型调整数据集上时，LQ-LoRA也可以用于模型压缩，其中2.75比特LLaMA-2-70B模型（其中2.85比特是包括低级组件的平均值，需要27GB的GPU内存）与原始模型相当。

Abstract
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on adapting RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and moreover enables more aggressive quantization. For example, on the OpenAssistant benchmark LQ-LoRA is able to learn a 2.5-bit LLaMA-2 model that is competitive with a model finetuned with 4-bit QLoRA. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) is competitive with the original model in full precision.

摘要
我们提出了一种简单的方法来实现卷积语言模型的内存高效化。我们的方法使用迭代算法将预训练的矩阵 decomposes 成高精度低级成分和内存高效化的量化组件。在训练中，量化组件保持不变，只有低级成分被更新。我们提出了一种整数线性程序表示法，可以在给定总内存预算下动态配置量化参数（比如位宽、块大小）。此外，我们还探讨了基于数据的版本，使用估计的施密特信息矩阵来衡量重建目标时的权重。我们的LQ-LoRA方法在适应RoBERTa和LLaMA-2（7B和70B）上进行了实验，比强QLoRA和GPTQ-LoRA基eline高效，并且允许更加谨慎的量化。例如，在OpenAssistant benchmark上，LQ-LoRA可以学习一个2.5位的LLaMA-2模型，与4位QLoRA基eline相当。当训练在语言模型准确性调整数据集上时，LQ-LoRA还可以用于模型压缩；在这种情况下，我们的2.75位LLaMA-2-70B模型（具有2.85位的平均位数，需要27GB的GPU内存）与原始模型相当。

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

paper_url: http://arxiv.org/abs/2311.12015
repo_url: None
paper_authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
for: 该论文目的是提高一种通用视觉语言模型，以便更好地控制 робо。
methods: 该系统使用视频中人类行为的观察，创建可执行的 робо程序，并将环境和动作细节转化为文本。然后，使用 GPT-4 掌握任务规划，并使用视觉系统重新分析视频，以便更好地了解手持物体的时间和方式。
results: 实验结果表明，该方法可以在不同的场景下，由人类示例而不需要更多的训练，快速地将人类示例转化为机器人操作。

Abstract
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), by integrating observations of human actions to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs that incorporate affordance insights. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. In the following analyses, vision systems reanalyze the video with the task plan. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data (e.g., grasp type, way points, and body postures). Experiments across various scenarios demonstrate this method's efficacy in achieving real robots' operations from human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

摘要
我们介绍一个管道，把通用视觉语言模型GPT-4V（视觉）与人类动作观察结合，以便为机器人操作提供帮助。这个系统分析视频中人们完成任务的方式，并生成可执行的机器人程序，并包括环境和动作详细信息。计算开始于使用GPT-4V分析视频，并将环境和动作详细信息转换为文本。接着，使用GPT-4 empowered task planner进行计划。然后，视觉系统重新分析视频，并使用开放词汇对象检测器将对象名称固定。强调手对象关系可以检测抓取和释放的时刻。这种空间时间固定 Allow the vision system to further gather affordance data (e.g., grasp type, way points, and body postures).实验在不同enario中展示了这种方法在实现人类示例动作下的机器人操作 Zero-shot manner 的能力。GPT-4V/GPT-4的提示可以在这个项目页面上找到：https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

H-COAL: Human Correction of AI-Generated Labels for Biomedical Named Entity Recognition

paper_url: http://arxiv.org/abs/2311.11981
repo_url: None
paper_authors: Xiaojing Duan, John P. Lalor
for: 这个论文是为了解决人工智能生成标签的准确性问题，提出了一种新的人工 corrections of AI-generated labels (H-COAL) 框架。
methods: 该框架使用了一种排名算法，可以选择性地更正 AI 生成的标签，以达到黄金标准性表现（100% 的人工标注），但需要远少于人工努力。
results: 研究发现，对标签的5% 的更正可以提高 AI 和人类表现的差距，相对提高64%；对标签的20% 的更正可以提高 AI 和人类表现的差距，相对提高86%。

Abstract
With the rapid advancement of machine learning models for NLP tasks, collecting high-fidelity labels from AI models is a realistic possibility. Firms now make AI available to customers via predictions as a service (PaaS). This includes PaaS products for healthcare. It is unclear whether these labels can be used for training a local model without expensive annotation checking by in-house experts. In this work, we propose a new framework for Human Correction of AI-Generated Labels (H-COAL). By ranking AI-generated outputs, one can selectively correct labels and approach gold standard performance (100% human labeling) with significantly less human effort. We show that correcting 5% of labels can close the AI-human performance gap by up to 64% relative improvement, and correcting 20% of labels can close the performance gap by up to 86% relative improvement.

摘要
随着机器学习模型在自然语言处理任务中的快速发展，收集高品质标注从AI模型是一个现实性的可能性。现在，公司通过预测为服务（PaaS）提供AI给客户。这包括医疗领域的Paas产品。然而，是否可以使用这些标注来训练本地模型，而无需高昂的人工标注检查，是一个未知的问题。在这项工作中，我们提出了一个新的人工纠正AI生成标注的框架（H-COAL）。通过对AI生成输出进行排名，可以选择性地纠正标注，并接近金标准性表现（100%人工标注），但是具有显著更少的人工努力。我们显示，只纠正5%的标注可以减少AI与人性能差距的相对改善，达到64%的相对改善；只纠正20%的标注可以减少AI与人性能差距的相对改善，达到86%的相对改善。

On the Potential and Limitations of Few-Shot In-Context Learning to Generate Metamorphic Specifications for Tax Preparation Software

paper_url: http://arxiv.org/abs/2311.11979
repo_url: None
paper_authors: Dananjay Srinivas, Rohan Das, Saeid Tizpaz-Niari, Ashutosh Trivedi, Maria Leonor Pacheco
for: 本研究旨在提高法定税软件的正确性，以避免税务纠纷和罚款。
methods: 本研究使用了变态测试，以测试和调试法定税软件。变态测试可以帮助找出软件中的错误和漏洞。
results: 本研究提出了一种基于自然语言和首领逻辑的方法，用于自动生成变态规则。这种方法可以帮助减少人工干预，提高测试效率和正确性。

Abstract
Due to the ever-increasing complexity of income tax laws in the United States, the number of US taxpayers filing their taxes using tax preparation software (henceforth, tax software) continues to increase. According to the U.S. Internal Revenue Service (IRS), in FY22, nearly 50% of taxpayers filed their individual income taxes using tax software. Given the legal consequences of incorrectly filing taxes for the taxpayer, ensuring the correctness of tax software is of paramount importance. Metamorphic testing has emerged as a leading solution to test and debug legal-critical tax software due to the absence of correctness requirements and trustworthy datasets. The key idea behind metamorphic testing is to express the properties of a system in terms of the relationship between one input and its slightly metamorphosed twinned input. Extracting metamorphic properties from IRS tax publications is a tedious and time-consuming process. As a response, this paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents - expressed in natural language - to a contrastive first-order logic form. We perform a systematic analysis on the potential and limitations of in-context learning with Large Language Models(LLMs) for this task, and outline a research agenda towards automating the generation of metamorphic specifications for tax preparation software.

摘要
The core idea behind metamorphic testing is to express the properties of a system in terms of the relationship between one input and its slightly modified twin input. However, extracting metamorphic properties from IRS tax publications is a time-consuming and laborious process. To address this challenge, this paper proposes the task of generating metamorphic specifications as a translation task between properties extracted from tax documents (expressed in natural language) and a contrastive first-order logic form.We conduct a systematic analysis of the potential and limitations of in-context learning with Large Language Models (LLMs) for this task and outline a research agenda towards automating the generation of metamorphic specifications for tax preparation software.

Context-aware Neural Machine Translation for English-Japanese Business Scene Dialogues

paper_url: http://arxiv.org/abs/2311.11976
repo_url: https://github.com/su0315/discourse_context_mt
paper_authors: Sumire Honda, Patrick Fernandes, Chrysoula Zerva
for: 这个论文的目的是提高现有的神经机器翻译模型（NMT）的性能，以便更好地翻译英日商务对话。
methods: 这篇论文使用了预训练的mBART模型，并在多句对话数据上进行了微调。这 позволяет作者们实验不同的上下文大小和extra-sentential信息编码方法。
results: 作者们发现，模型可以利用上一句和extra-sentential context（通过CXMI指标提高），并且在增加上下文大小和包含场景和speaker信息时，翻译质量有所提高， measured by BLEU和COMET指标。

Abstract
Despite the remarkable advancements in machine translation, the current sentence-level paradigm faces challenges when dealing with highly-contextual languages like Japanese. In this paper, we explore how context-awareness can improve the performance of the current Neural Machine Translation (NMT) models for English-Japanese business dialogues translation, and what kind of context provides meaningful information to improve translation. As business dialogue involves complex discourse phenomena but offers scarce training resources, we adapted a pretrained mBART model, finetuning on multi-sentence dialogue data, which allows us to experiment with different contexts. We investigate the impact of larger context sizes and propose novel context tokens encoding extra-sentential information, such as speaker turn and scene type. We make use of Conditional Cross-Mutual Information (CXMI) to explore how much of the context the model uses and generalise CXMI to study the impact of the extra-sentential context. Overall, we find that models leverage both preceding sentences and extra-sentential context (with CXMI increasing with context size) and we provide a more focused analysis on honorifics translation. Regarding translation quality, increased source-side context paired with scene and speaker information improves the model performance compared to previous work and our context-agnostic baselines, measured in BLEU and COMET metrics.

摘要
尽管机器翻译技术有了很大的进步，但当 dealing with 高度上下文语言如日语时，当前句子水平的模型遇到了挑战。在这篇论文中，我们explore了如何使用上下文意识来提高当前的神经机器翻译（NMT）模型在英日商务对话翻译中的表现，以及哪些类型的上下文提供了有用的信息以改善翻译。商务对话 involve 复杂的语言现象，但受训资源匮乏，我们采用了预训练的 mBART 模型，并在多句话对话数据上进行了微调。我们研究了不同上下文大小的影响，并提出了新的上下文符号编码方式，包括发言人Turn和场景类型。我们使用 Conditional Cross-Mutual Information（CXMI）来探索模型如何使用上下文，并推广 CXMI 来研究额外上下文的影响。总之，我们发现模型可以利用上一句和extra-sentential context（CXMI 随上下文大小增加），并提供了更加专注的 honorifics 翻译分析。在翻译质量方面，增加源语言的上下文和场景信息可以提高模型的表现，比之前的工作和我们的上下文无关基线， measured 在 BLEU 和 COMET метриках中。

Adaptive Training Distributions with Scalable Online Bilevel Optimization

paper_url: http://arxiv.org/abs/2311.11973
repo_url: None
paper_authors: David Grangier, Pierre Ablin, Awni Hannun
for: 这篇论文是关于现代机器学习中大脑网络的预训练和应用领域之间的匹配问题。
methods: 该论文提出了一种基于在线双层优化问题的算法，以适应小样本数据的情况。该算法优先计算在训练点上的梯度，以提高目标分布上的损失。
results: 该论文通过实验表明，在某些情况下，该方法可以超过现有领域适应技术，但在其他情况下可能不成功。该论文还提出了一种简单的测试方法，以评估该方法在不同情况下的效果。

Abstract
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.

摘要
大型神经网络在现代机器学习中处于中心位置，这些神经网络通常在庞大的网络数据上进行预训练。在这种情况下，预训练数据的分布与应用领域的分布rarely匹配。本文考虑在有限个数据点反映Targeted测试条件时修改预训练分布。我们提出一种基于最近的online, bilateral优化问题的算法。以可扩展性为目标，我们的算法在训练点上计算梯度，以提高Targeted分布上的损失。实际证明了，在某些情况下，我们的方法可以超越现有的领域适应Literature中的策略，但在其他情况下可能无法成功。我们提出一种简单的测试方法来评估我们的方法在哪些情况下能够效果，并指出了进一步研究的方向。

Automatic Analysis of Substantiation in Scientific Peer Reviews

paper_url: http://arxiv.org/abs/2311.11967
repo_url: None
paper_authors: Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis Vazirgiannis, Chloé Clavel
for: 提高 AI 会议中异常评审的自动化质量控制方法。
methods: 使用科学 peer review 中的声明-证据对 extracted 问题，并使用 argued mining 系统自动分析评审的支持程度。
results: 使用 SubstanReview 数据集进行数据分析，获得 NLP 会议评审质量的有用洞察。

Abstract
With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation -- one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence -- and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years.

摘要
随着顶尖AI会议中的问题评审量度的增加，学术社区紧迫需要自动质量控制机制。在这篇论文中，我们仅考虑证据——一种受欢迎的质量特征，表示评论中的laim是否得到了足够的证据支持。我们提供一种自动评估这个问题的解决方案。为 достичь这个目标，我们首先将问题定义为科学 peer review 中的laim-evidence对 Extraction问题，并收集了 SubstanReview，这是第一个对这个任务进行注释的数据集。SubstanReview包含550篇来自NLP会议的评论，由领域专家注释。基于这个数据集，我们训练了一个 Argument Mining 系统，以自动分析 peer review 中的证据水平。我们还对 SubstanReview 数据集进行了数据分析，从而获得了对 NLP会议最近几年 peer review 质量的有用发现。

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

paper_url: http://arxiv.org/abs/2311.11904
repo_url: None
paper_authors: Songhao Han, Le Zhuo, Yue Liao, Si Liu
for: 提高图像分类的精度和可解释性，提出了一种新的图像分类框架，即Iterative Optimization with Visual Feedback（短回归遍游 optimization with visual feedback）。
methods: 该方法首先使用大语言模型（LLM）生成图像分类器，然后使用一种演化优化策略来优化类别描述符。这个过程中，我们将视觉反馈从VLM分类指标中引入，以帮助优化过程具体化。
results: 我们在多种图像分类 benchmark上进行了实验，并 obtianed 3.47%的平均提升率，比存在的方法高。此外，我们还发现，通过使用这些描述符，可以在不同的底层模型上实现更好的性能。

Abstract
Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We identify two primary causes: 1) The prevalent reliance on textual interactions with LLMs, leading to a mismatch between the generated text and the visual content in VLMs' latent space - a phenomenon we term the "explain without seeing" dilemma. 2) The oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. To address these issues, we propose a novel image classification framework combining VLMs with LLMs, named Iterative Optimization with Visual Feedback. In particular, our method develops an LLM-based agent, employing an evolutionary optimization strategy to refine class descriptors. Crucially, we incorporate visual feedback from VLM classification metrics, thereby guiding the optimization process with concrete visual data. Our method leads to improving accuracy on a wide range of image classification benchmarks, with 3.47\% average gains over state-of-the-art methods. We also highlight the resulting descriptions serve as explainable and robust features that can consistently improve the performance across various backbone models.

摘要
vision-language模型（VLM）提供了一个有前途的思路，通过比较图像和分类embedding之间的相似性来进行图像分类。然而，一个挑战是制定精确的文本表述来描述分类名称。在先前的研究中，人们利用大型语言模型（LLM）来增强这些描述器，但其输出经常受到不确定性和不准确性的影响。我们认为这有两个主要原因：1）文本与LLM的交互过多，导致VLM的 latent space中的文本与图像之间存在匹配问题，我们称之为“解释无法看到”的困难。2）缺乏分类关系的考虑，导致描述器无法分类类型效果地区分类。为了解决这些问题，我们提出了一种 combining VLM和LLM的图像分类框架，名为Iterative Optimization with Visual Feedback。具体来说，我们的方法通过利用进化优化策略来练化分类描述器。关键是，我们在VLM的分类指标上 incorporate visual feedback，以导航优化过程中的具体视觉数据。我们的方法在各种图像分类benchmark上显示3.47%的平均提升，并且显示出的描述器为可解释和稳定的特征，可以在不同的背景模型上进行改进表现。

Evil Geniuses: Delving into the Safety of LLM-based Agents

paper_url: http://arxiv.org/abs/2311.11855
repo_url: https://github.com/t1ans1r/evil-geniuses
paper_authors: Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, Hang Su
for: 这篇论文探讨了LLM-基于代理的安全问题，以寻求答案。
methods: 该论文采用了手动囚犯提示和虚拟聊天对LLM-基于代理进行了系列的安全检测和评估。
results: 研究发现：1）LLM-基于代理agent具有减少的抗攻击能力。2）被攻击后的agent可以提供更加细腻的回应。3）检测生成的不当回应的困难度更高。这些现象提醒我们LLM-基于代理agent的安全性存在问题，并且在不同的角色专业水平和系统/代理层面都存在漏洞。

Abstract
The rapid advancements in large language models (LLMs) have led to a resurgence in LLM-based agents, which demonstrate impressive human-like behaviors and cooperative capabilities in various interactions and strategy formulations. However, evaluating the safety of LLM-based agents remains a complex challenge. This paper elaborately conducts a series of manual jailbreak prompts along with a virtual chat-powered evil plan development team, dubbed Evil Geniuses, to thoroughly probe the safety aspects of these agents. Our investigation reveals three notable phenomena: 1) LLM-based agents exhibit reduced robustness against malicious attacks. 2) the attacked agents could provide more nuanced responses. 3) the detection of the produced improper responses is more challenging. These insights prompt us to question the effectiveness of LLM-based attacks on agents, highlighting vulnerabilities at various levels and within different role specializations within the system/agent of LLM-based agents. Extensive evaluation and discussion reveal that LLM-based agents face significant challenges in safety and yield insights for future research. Our code is available at https://github.com/T1aNS1R/Evil-Geniuses.

摘要
大量语言模型（LLM）的快速进步导致了LLM基于代理的复活，这些代理展现出人类化行为和合作能力在不同的互动和策略设计中。然而，评估LLM基于代理的安全性仍然是一个复杂的挑战。这篇论文通过手动囚室询问和虚拟聊天带动的邪恶天才团队（Evil Geniuses）进行了系列的探索，以全面探讨LLM基于代理的安全性问题。我们的调查发现了三个吸引注意的现象：1）LLM基于代理的代理具有较弱的抗攻击性。2）遭受攻击的代理可以提供更细腻的回应。3）检测生产的不当回应的可能性更高。这些发现促使我们质疑LLM基于代理的攻击是否有效，并高亮了系统/代理中LLM基于代理的代理存在的漏洞和不同角色尝试的攻击性。我们的评估和讨论表明，LLM基于代理的安全性面临着重大挑战，并提供了未来研究的发展方向。我们的代码可以在https://github.com/T1aNS1R/Evil-Geniuses上获取。

Deepparse : An Extendable, and Fine-Tunable State-Of-The-Art Library for Parsing Multinational Street Addresses

paper_url: http://arxiv.org/abs/2311.11846
repo_url: None
paper_authors: David Beauchemin, Marouane Yassine
for: 本文旨在提供一个开源、可编辑、可精度调整的地址分解解决方案，用于解决多国地址的分解问题。
methods: 本文使用了深度学习算法来实现地址分解，并在60多个国家的数据上进行了训练。
results: 据说，本文的预训练模型在训练国家上达到了99%的分解精度，而且不需要预处理或后处理。此外，库还支持自定义地址分解器的生成。

Abstract
Segmenting an address into meaningful components, also known as address parsing, is an essential step in many applications from record linkage to geocoding and package delivery. Consequently, a lot of work has been dedicated to develop accurate address parsing techniques, with machine learning and neural network methods leading the state-of-the-art scoreboard. However, most of the work on address parsing has been confined to academic endeavours with little availability of free and easy-to-use open-source solutions. This paper presents Deepparse, a Python open-source, extendable, fine-tunable address parsing solution under LGPL-3.0 licence to parse multinational addresses using state-of-the-art deep learning algorithms and evaluated on over 60 countries. It can parse addresses written in any language and use any address standard. The pre-trained model achieves average $99~\%$ parsing accuracies on the countries used for training with no pre-processing nor post-processing needed. Moreover, the library supports fine-tuning with new data to generate a custom address parser.

摘要
分析地址的各个 Component 是许多应用程序中的重要步骤，从record linkage到地理编码和快递配送。因此，很多工作都投入到了发展高精度地址分析技术中，其中机器学习和神经网络方法现在领先于所有其他方法。然而，大多数地址分析工作都受到了学术研究的限制，而且有限的免费和易用的开源解决方案。本文介绍了Deepparse，一个基于Python的开源、可扩展、可调整地址分析解决方案，采用LGPL-3.0许可证。它可以分析来自60多个国家的多国语言地址，使用当前最先进的深度学习算法。无需预处理或后处理，模型可以达到99%的平均分析精度。此外，库支持自定义地址分析器的自定义。

How to Use Large Language Models for Text Coding: The Case of Fatherhood Roles in Public Policy Documents

paper_url: http://arxiv.org/abs/2311.11844
repo_url: None
paper_authors: Lorenzo Lupo, Oscar Magnusson, Dirk Hovy, Elin Naurin, Lena Wängnerud
for: 这项研究旨在评估大语言模型（LLM）在政治科学文本分析方面的应用，以及如何使用LLM进行文本编码。
methods: 本研究使用了三种原创的非英文政治科学文本编码任务，并提供了一个通用的LLM使用工作流程。
results: 研究发现，当提供了详细的标签定义和编码示例时，一个LLM可以与人类标注员相当或甚至更好，具有更快的速度（达百度倍的速度）、更低的成本（至多60%比人类编码便宜）和更易扩展到大量文本。总之，LLMs 是大多数文本编码项目的可靠选择。

Abstract
Recent advances in large language models (LLMs) like GPT-3 and GPT-4 have opened up new opportunities for text analysis in political science. They promise automation with better results and less programming. In this study, we evaluate LLMs on three original coding tasks of non-English political science texts, and we provide a detailed description of a general workflow for using LLMs for text coding in political science research. Our use case offers a practical guide for researchers looking to incorporate LLMs into their research on text analysis. We find that, when provided with detailed label definitions and coding examples, an LLM can be as good as or even better than a human annotator while being much faster (up to hundreds of times), considerably cheaper (costing up to 60% less than human coding), and much easier to scale to large amounts of text. Overall, LLMs present a viable option for most text coding projects.

摘要
(Simplified Chinese translation)最近的大语言模型（LLM）如GPT-3和GPT-4的发展，为政治科学中的文本分析带来了新的机遇。它们提供了更好的自动化结果和更少的编程。在这项研究中，我们对非英文政治科学文本进行了三项原创编码任务的评估，并提供了用于在政治科学研究中使用LLM进行文本编码的通用工作流程的详细描述。我们的使用 случа子为研究人员寻求在研究中使用LLM的实践指南。我们发现，当给LLM提供详细的标签定义和编码示例时，LLM可以和人工标注员相当或者更好，而且比人工标注更快（可以达到百万倍），更加便宜（可以达到60%的成本减少），并且更易扩展到大量文本。总之，LLMs 是大多数文本编码项目的可靠选择。

Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

paper_url: http://arxiv.org/abs/2311.11813
repo_url: None
paper_authors: Andrey Bout, Alexander Podolskiy, Sergey Nikolenko, Irina Piontkovskaya
for: 提高神经语法错误修正（GEC）的进步受限于缺乏高质量的手动标注数据。
methods: 我们提出了两个方法来更好地使用可用数据：一是采用预先训练的auxiliary任务，二是调整训练数据的顺序和实例顺序。
results: 我们的方法可以达到 significan improvements，比如使用小型模型（400M参数）超越最佳T5-XXL（11B参数）模型。

Abstract
Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such as XXL-T5 as the backbone. In this work, we explore an orthogonal direction: how to use available data more efficiently. First, we propose auxiliary tasks that exploit the alignment between the original and corrected sentences, such as predicting a sequence of corrections. We formulate each task as a sequence-to-sequence problem and perform multi-task training. Second, we discover that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance, so we set out to find the best training schedule. Together, these two ideas lead to significant improvements, producing results that improve state of the art with much smaller models; in particular, we outperform the best models based on T5-XXL (11B parameters) with a BART-based model (400M parameters).

摘要
进步在神经 grammatical error correction（GEC）方面受到欠缺标注训练数据的阻碍。实际上， sufficient amounts of high-quality manually annotated data 不可得，所以latest research 仅可靠生成 sintethic data，先行预训练，然后精确地训练 real datasets; performance 增长仅可能通过 ensemble 或使用巨大的 pretrained models 如 XXL-T5 作为 backbone。在这个工作中，我们探索了一个 orthogonal 方向：如何更有效地使用可用的数据。首先，我们提出了auxiliary tasks，利用原始和修复句子之间的对齐，例如预测修复序列。我们将每个任务推定为一个 sequence-to-sequence 问题，并进行多任务训练。其次，我们发现了训练和获得数据的顺序和个别实例在数据集中的顺序可能具有重要的影响，因此我们展开了寻找最佳训练计划。总之，这两个想法共同带来了重要的改进，生成了比前一代模型更好的结果，特别是我们超越了基于 T5-XXL（11B 个参数）的最佳模型，使用 BART 型基本模型（400M 个参数）。

Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

paper_url: http://arxiv.org/abs/2311.11745
repo_url: None
paper_authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim
for: 模型多个说话者的特征表达，包括各种说话者的特征特征，如嗓音、语速、语调等。
methods: 提出一种新的方法，通过精细化特征和conditioning方法来表达目标说话者的speech特征，从而实现不需要额外训练目标说话者的数据集。
results: 对比seen说话者的best-performing多说话者模型，提出的方法在主观相似度评价中获得了显著更高的相似性mean opinion score（SMOS），并且在零戳法中也表现出了显著的优势。此外，方法还可以生成新的人工说话者，并且表明编码的秘密特征具有足够的信息可以重建原始说话者的speech。

Abstract
In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a best-performing multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks.

摘要
在这项工作中，我们提出了一种新的方法，用于模拟多个说话者，该方法可以详细表达说话者的总体特征，就如同一个训练过的多说话者模型，无需额外训练目标说话者的数据集。虽然过去有很多类似的研究，但它们的性能还未达到多说话者模型的水平，这是因为它们的基本限制。为了超越这些限制，我们提出了有效的特征学习方法和表达目标说话者的speech特征，通过离散特征和conditioning它们到一个speech生成模型。我们的方法在主观相似性评价中获得了比较高的相似性mean opinion score（SMOS）， même avec des speakers不 connus。此外，我们的方法还在生成新的人工说话者方面表现出色，并且表明了编码的秘密特征足够具有重建原始说话者的speech的能力。这意味着我们的方法可以用于通用的编码和重建说话者特征的方法ологи。

Addressing the Length Bias Problem in Document-Level Neural Machine Translation

paper_url: http://arxiv.org/abs/2311.11601
repo_url: https://github.com/salvation-z/D2DToolkits
paper_authors: Zhuocheng Zhang, Shuhao Gu, Min Zhang, Yang Feng
for: 解决文本翻译 lengths bias问题，提高文本翻译质量
methods: 提出了改进DNMT模型的训练方法、注意机制和解码策略
results: 实验结果显示，我们的方法可以在多个公开数据集上带来显著改进，并且分析结果表明，我们的方法可以有效缓解 lengths bias问题。Here’s the full Chinese text in simplified Chinese characters:
for: 这个研究是为了解决文本翻译 lengths bias问题，提高文本翻译质量。
methods: 我们提出了改进DNMT模型的训练方法、注意机制和解码策略。
results: 实验结果显示，我们的方法可以在多个公开数据集上带来显著改进，并且分析结果表明，我们的方法可以有效缓解 lengths bias问题。

Abstract
Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information. However, this approach also introduces a length bias problem, whereby DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length during training. %i.e., the length bias problem. To solve the length bias problem, we propose to improve the DNMT model in training method, attention mechanism, and decoding strategy. Firstly, we propose to sample the training data dynamically to ensure a more uniform distribution across different sequence lengths. Then, we introduce a length-normalized attention mechanism to aid the model in focusing on target information, mitigating the issue of attention divergence when processing longer sequences. Lastly, we propose a sliding window strategy during decoding that integrates as much context information as possible without exceeding the maximum sequence length. The experimental results indicate that our method can bring significant improvements on several open datasets, and further analysis shows that our method can significantly alleviate the length bias problem.

摘要
文档级神经机器翻译（DNMT）已经显示了有前途的结果，通过包含更多上下文信息。然而，这种方法也会导致长度偏见问题，DNMT在训练时documenmt decoding documents that are much shorter or longer than the maximum sequence length during training. то есть，长度偏见问题。为解决长度偏见问题，我们提议通过训练方法、注意机制和decoding策略进行改进。首先，我们提议在训练数据中采样 dynamically ensure a more uniform distribution across different sequence lengths。然后，我们引入length-normalized attention mechanism，以帮助模型关注目标信息，避免长序信息处理时的注意力散失。最后，我们提议在decoding中使用滑块策略，integrate as much context information as possible without exceeding the maximum sequence length。实验结果表明，我们的方法可以在多个公开数据集上提供显著改进，并且分析表明，我们的方法可以有效缓解长度偏见问题。

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

paper_url: http://arxiv.org/abs/2311.11598
repo_url: https://github.com/thunlp-mt/fiig
paper_authors: Ziyue Wang, Chi Chen, Peng Li, Yang Liu
for: 这个论文主要是为了提高大语言模型（LLM）在视觉问答 зада务中的表现，以及使LLM能够更好地利用图像信息。
methods: 这篇论文使用了一种框架，允许LLM提问更多细节信息，以及一些筛选器来约束生成的信息。
results: 论文的实验结果表明，使用这种框架和筛选器可以持续提高OK-VQA和A-OKVQA的基eline方法表现，具体来说，平均提高2.15%的性能。

Abstract
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge not only in natural language tasks, but also in some vision-language tasks such as open-domain knowledge-based visual question answering (OK-VQA). As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. This leads to discrepancies between images and their textual representations presented to LLMs, which consequently impedes final reasoning performance. To fill the information gap and better leverage the reasoning capability, we design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image, along with filters for refining the generated information. We validate our idea on OK-VQA and A-OKVQA. Our method continuously boosts the performance of baselines methods by an average gain of 2.15% on OK-VQA, and achieves consistent improvements across different LLMs.

摘要

How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction

paper_url: http://arxiv.org/abs/2311.11583
repo_url: https://github.com/mohanraj-nlp/chatgpt-malaysian-english
paper_authors: Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam
for: 本研究用于评估 chatGPT 在马来西亚英语新闻文本（MEN）上的实体抽取和关系抽取能力。
methods: 本研究采用了三步方法，即教育-预测-评估（educate-predict-evaluate）。
results: chatGPT 在马来西亚英语新闻文本中的实体抽取性能不高，最高 F1 分为 0.497。进一步分析发现，马来西亚英语中的 morphosyntactic 变化减少了 chatGPT 的性能。然而，这种 morphosyntactic 变化并不影响 chatGPT 的关系抽取性能。

Abstract
Recently, ChatGPT has attracted a lot of interest from both researchers and the general public. While the performance of ChatGPT in named entity recognition and relation extraction from Standard English texts is satisfactory, it remains to be seen if it can perform similarly for Malaysian English. Malaysian English is unique as it exhibits morphosyntactic and semantical adaptation from local contexts. In this study, we assess ChatGPT's capability in extracting entities and relations from the Malaysian English News (MEN) dataset. We propose a three-step methodology referred to as \textbf{\textit{educate-predict-evaluate}. The performance of ChatGPT is assessed using F1-Score across 18 unique prompt settings, which were carefully engineered for a comprehensive review. From our evaluation, we found that ChatGPT does not perform well in extracting entities from Malaysian English news articles, with the highest F1-Score of 0.497. Further analysis shows that the morphosyntactic adaptation in Malaysian English caused the limitation. However, interestingly, this morphosyntactic adaptation does not impact the performance of ChatGPT for relation extraction.

摘要
最近，ChatGPT已经吸引了许多研究者和普通民众的关注。虽然ChatGPT在标准英语文本中的名实体识别和关系提取表现良好，但是还未经过测试是否可以在马来西亚英语文本中表现良好。马来西亚英语独特，它在本地语言上具有语法 sintactic 和semantic 的适应。在这项研究中，我们评估了ChatGPT在马来西亚英语新闻（MEN）数据集中的实体和关系提取能力。我们提出了一种三步方法，称之为“教育-预测-评估”。我们使用F1-Score指标评估ChatGPT在18种不同的提示设定下的表现。从我们的评估结果来看，ChatGPT在马来西亚英语新闻文章中提取实体的表现不佳，最高F1-Score为0.497。进一步分析发现，马来西亚英语中的语法 sintactic 适应限制了ChatGPT的表现。但是奇怪的是，这种语法 sintactic 适应不会影响ChatGPT的关系提取表现。

KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model

paper_url: http://arxiv.org/abs/2311.11564
repo_url: https://github.com/ngwlh-gl/kbioxlm
paper_authors: Lei Geng, Xu Yan, Ziqiang Cao, Juntao Li, Wenjie Li, Sujian Li, Xinjie Zhou, Yang Yang, Jun Zhang
for: 本研究旨在提高生物医学领域的自然语言处理模型的多语言能力。
methods: 我们提出了一种名为KBioXLM的模型，它将基于XLM-R模型进行知识启发式转换，以适应生物医学领域的多语言需求。我们将三级别的知识启发（实体、事实和段落水平） integrate into 英语版本的医学文献，并设计三种相应的训练任务（实体覆盖、关系覆盖和段落关系预测）。
results: 我们通过将英文benchmarks中的多个任务翻译成中文，以及对XLM-R模型进行练习和提升，达到了跨语言零shot和几shot情况下的显著改进，提高了10+点。

Abstract
Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements. The scarcity of non-English domain corpora, not to mention parallel data, poses a significant hurdle in training multilingual biomedical models. Since knowledge forms the core of domain-specific corpora and can be translated into various languages accurately, we propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach. We achieve a biomedical multilingual corpus by incorporating three granularity knowledge alignments (entity, fact, and passage levels) into monolingual corpora. Then we design three corresponding training tasks (entity masking, relation masking, and passage relation prediction) and continue training on top of the XLM-R model to enhance its domain cross-lingual ability. To validate the effectiveness of our model, we translate the English benchmarks of multiple tasks into Chinese. Experimental results demonstrate that our model significantly outperforms monolingual and multilingual pretrained models in cross-lingual zero-shot and few-shot scenarios, achieving improvements of up to 10+ points. Our code is publicly available at https://github.com/ngwlh-gl/KBioXLM.

摘要
大多数生物医学预训模型都是单语言的，无法满足增长的跨语言要求。因为非英语领域数据的罕见性，以及并不存在并行数据，训练多语言生物医学模型具有 significiant hurdle。我们提议一种名为KBioXLM的模型，通过知识anchor的方法将多语言预训模型XLM-R转换到生物医学领域。我们创建了三级别知识对应（实体、事实、段落），并将其包含到单语言 corpora 中。然后，我们设计了三种相应的训练任务（实体覆盖、关系覆盖、段落关系预测），继续在 XLM-R 模型之上进行训练，以提高其跨语言领域的适应能力。为了证明我们的模型的有效性，我们将英文benchmarks的多个任务翻译成中文。实验结果表明，我们的模型在跨语言零shot和几shot scenarios中与单语言和多语言预训模型相比，提高了10+点的性能。我们的代码公开可用于https://github.com/ngwlh-gl/KBioXLM。

Adapt in Contexts: Retrieval-Augmented Domain Adaptation via In-Context Learning

paper_url: http://arxiv.org/abs/2311.11551
repo_url: None
paper_authors: Quanyu Long, Wenya Wang, Sinno Jialin Pan
For: This paper focuses on the problem of Unsupervised Domain Adaptation (UDA) for language models (LLMs) in an in-context learning setting, where the goal is to adapt LLMs from a source domain to a target domain without any target labels.* Methods: The proposed method retrieves a subset of cross-domain elements that are most similar to the query, and elicits the language model to adapt in an in-context manner by learning both the target domain distribution and the discriminative task signal simultaneously with the augmented cross-domain in-context examples. The method uses different prompting and training strategies, accounting for different LM architectures to learn the target distribution via language modeling.* Results: The paper demonstrates significant improvements over baseline models through extensive experiments on Sentiment Analysis (SA) and Named Entity Recognition (NER) tasks, thoroughly studying the effectiveness of In-Context Learning (ICL) for domain transfer.Here is the same information in Simplified Chinese text:* For: 这篇论文关注了无监督领域适应（UDA）语言模型（LLM）在卷积学习设置下，目标是将LLM从源频谱中适应目标频谱 без任何目标标签。* Methods: 提议的方法选择源频谱中最相似的查询，并通过增强跨频谱卷积示例来引导语言模型在卷积学习中适应目标频谱和任务信号。* Results: 论文通过对各种语言模型的启发和训练策略进行了广泛的实验，证明了ICL在频谱转移中的有效性，并在 Sentiment Analysis（SA）和Named Entity Recognition（NER）任务上达到了显著的改进。

Abstract
Large language models (LLMs) have showcased their capability with few-shot inference known as in-context learning. However, in-domain demonstrations are not always readily available in real scenarios, leading to cross-domain in-context learning. Besides, LLMs are still facing challenges in long-tail knowledge in unseen and unfamiliar domains. The above limitations demonstrate the necessity of Unsupervised Domain Adaptation (UDA). In this paper, we study the UDA problem under an in-context learning setting to adapt language models from the source domain to the target domain without any target labels. The core idea is to retrieve a subset of cross-domain elements that are the most similar to the query, and elicit language model to adapt in an in-context manner by learning both target domain distribution and the discriminative task signal simultaneously with the augmented cross-domain in-context examples. We devise different prompting and training strategies, accounting for different LM architectures to learn the target distribution via language modeling. With extensive experiments on Sentiment Analysis (SA) and Named Entity Recognition (NER) tasks, we thoroughly study the effectiveness of ICL for domain transfer and demonstrate significant improvements over baseline models.

摘要

Multi-teacher Distillation for Multilingual Spelling Correction

paper_url: http://arxiv.org/abs/2311.11518
repo_url: None
paper_authors: Jingfen Zhang, Xuan Guo, Sravan Bodapati, Christopher Potts
for: 这个论文是为了解决现代搜索界面中的精准拼写检查问题，特别是在移动设备和语音至文字转换 interfaces 中。
methods: 这篇论文使用多教师浸泡法来解决这个问题，其中每种语言/地区都有一个单语言教师模型，这些个教师模型被浸泡到一个多语言学生模型中，以满足所有语言/地区的需求。
results: 在使用开源数据以及世界各地搜索服务用户数据的实验中，我们示出了这种方法可以生成高效的拼写检查模型，能够适应部署服务的紧张延迟要求。

Abstract
Accurate spelling correction is a critical step in modern search interfaces, especially in an era of mobile devices and speech-to-text interfaces. For services that are deployed around the world, this poses a significant challenge for multilingual NLP: spelling errors need to be caught and corrected in all languages, and even in queries that use multiple languages. In this paper, we tackle this challenge using multi-teacher distillation. On our approach, a monolingual teacher model is trained for each language/locale, and these individual models are distilled into a single multilingual student model intended to serve all languages/locales. In experiments using open-source data as well as user data from a worldwide search service, we show that this leads to highly effective spelling correction models that can meet the tight latency requirements of deployed services.

摘要
现代搜索界面中，精准的拼写修正是一项重要的步骤，尤其是在移动设备和语音到文本转换器的时代。为全球部署的服务而言，这对多语言NLP提出了一大挑战：拼写错误需要在所有语言和地区中捕捉和修正。在这篇论文中，我们使用多教师浸泡法来解决这个问题。我们的方法是训练每种语言/地区的单语言教师模型，然后将这些个体模型浸泡到一个可以服务所有语言/地区的多语言学生模型中。在使用开源数据以及全球搜索服务的用户数据进行实验中，我们发现这种方法可以创造出高效的拼写修正模型，可以满足部署服务的紧张延迟要求。

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

paper_url: http://arxiv.org/abs/2311.11509
repo_url: None
paper_authors: Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Vishy Swaminathan
for: 本研究旨在提高大型自然语言模型（LLM）对针对攻击的识别，以减少模型在不正常输入 Situation 下的敏感性。
methods: 本研究提出了一种基于Token-level检测方法，利用LLM对下一个Token的概率预测来识别针对攻击。研究者们还利用了周围Token信息，以促进检测趋势性的针对攻击序列。
results: 研究者们提出了两种方法：一种是判断每个Token是否属于针对攻击序列中的一部分，另一种是估计每个Token是否属于针对攻击序列。两种方法均可以帮助提高LLM对针对攻击的识别能力。

Abstract
In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that lead to undesirable outputs. The inherent vulnerability of LLMs stems from their input-output mechanisms, especially when presented with intensely out-of-distribution (OOD) inputs. This paper proposes a token-level detection method to identify adversarial prompts, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity and incorporate neighboring token information to encourage the detection of contiguous adversarial prompt sequences. As a result, we propose two methods: one that identifies each token as either being part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.

摘要
Recently, Large Language Models (LLM) have become crucial tools in various applications, but they are vulnerable to adversarial prompt attacks. These attacks manipulate the input strings to elicit undesirable outputs from the models. The vulnerability stems from the input-output mechanisms of LLMs, especially when faced with highly out-of-distribution (OOD) inputs. This paper proposes a token-level detection method to identify adversarial prompts, leveraging the LLM's ability to predict the next token's probability. We measure the model's perplexity and incorporate neighboring token information to detect contiguous adversarial prompt sequences. As a result, we propose two methods: one that identifies each token as either part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.Here's the text in Traditional Chinese:近年来，大语言模型（LLM）已成为不同应用中的重要工具，但它们受到了对抗提示攻击的威胁。这些攻击可以专门设计input字串，以让模型产生不适合的输出。这些攻击的根源在于LLM的输入输出机制，尤其是面对高度out-of-distribution（OOD）的输入。本文提出了一种token级检测方法，利用LLM对下一个字串的概率预测来识别对抗提示。我们 mesure the model的困惑度和包含相邻字串信息，以实现检测连续的对抗提示序列。因此，我们提出了两种方法：一种是将每个字串标记为是否是对抗提示的一部分，另一种是估算每个字串是否是对抗提示的概率。

What’s left can’t be right – The remaining positional incompetence of contrastive vision-language models

paper_url: http://arxiv.org/abs/2311.11477
repo_url: None
paper_authors: Nils Hoehing, Ellen Rushe, Anthony Ventresque
for: 这 paper 是为了解释 contrastive vision-language models 缺乏空间理解能力的可能性。
methods: 作者通过分析 dataset 和嵌入空间来分析这种现象。他们主要关注简单的左右位置关系，并证明这种行为可以预测，即使使用大规模 dataset。此外，他们还示出可以使用 sintetic data 教导这种关系，并且这种方法可以良好地适应自然图像，提高 Visual Genome Relations 中左右关系的表现。
results: 作者的研究表明，通过教导 left-right 位置关系，可以提高 contrastive vision-language models 的空间理解能力。

Abstract
Contrastive vision-language models like CLIP have been found to lack spatial understanding capabilities. In this paper we discuss the possible causes of this phenomenon by analysing both datasets and embedding space. By focusing on simple left-right positional relations, we show that this behaviour is entirely predictable, even with large-scale datasets, demonstrate that these relations can be taught using synthetic data and show that this approach can generalise well to natural images - improving the performance on left-right relations on Visual Genome Relations.

摘要
clip 类型的视觉语言模型缺乏空间理解能力。在这篇论文中，我们分析了数据集和嵌入空间，探讨了这种现象的可能的原因。通过关注简单的左右位置关系，我们表明这种行为是可预测的，甚至使用大规模数据集，并证明这种方法可以通过人工数据进行教育，并且这种方法可以良好地泛化到自然图像上，提高Visual Genome Relations中的左右关系表现。