methods: 这个研究使用了 Falcon-7b 和 Open AI 的 GPT-4 开源模型,并评估了它们对不同角色的回应。
results: 研究发现,GPT-4 的监管器模型可以确保 AI 的调整,但是它们在创造不同角色的偏见时不够有用。Abstract
In this study we intentionally introduce biases into large language model responses in an attempt to create specific personas for interactive media purposes. We explore the differences between open source models such as Falcon-7b and the GPT-4 model from Open AI, and we quantify some differences in responses afforded by the two systems. We find that the guardrails in the GPT-4 mixture of experts models with a supervisor, while useful in assuring AI alignment in general, are detrimental in trying to construct personas with a variety of uncommon viewpoints. This study aims to set the groundwork for future exploration in intentional biases of large language models such that these practices can be applied in the creative field, and new forms of media.
摘要
在这个研究中,我们故意引入大语言模型的偏见,以创造特定的人物形象,用于互动媒体目的。我们比较了开源模型 falcon-7b 和 open AI 的 GPT-4 模型,并量化了两者响应的一些不同。我们发现,GPT-4 的混合专家模型的监督器,虽有用于保证 AI Compatibility,但在构建多种不同观点的人物时,是不利的。本研究的目的是为未来在大语言模型中意外偏见的实践提供基础,以便在艺术领域和新媒体中应用这些技术。
paper_authors: Luke Bates, Peter Ebert Christensen, Preslav Nakov, Iryna Gurevych
For: The paper aims to improve the understanding of memes and their context, and to develop a method to inject context into machine learning models for better meme classification.* Methods: The authors release a large knowledge base of memes and information from www.knowyourmeme.com, and create a non-parametric majority-based classifier called Template-Label Counter (TLC) to test their hypothesis that meme templates can provide missing context for machine learning models.* Results: The authors conduct thorough classification experiments and exploratory data analysis to demonstrate the effectiveness of their method and the value of their knowledge base for meme analysis tasks.Abstract
Memes are a modern form of communication and meme templates possess a base semantics that is customizable by whomever posts it on social media. Machine learning systems struggle with memes, which is likely due to such systems having insufficient context to understand memes, as there is more to memes than the obvious image and text. Here, to aid understanding of memes, we release a knowledge base of memes and information found on www.knowyourmeme.com, which we call the Know Your Meme Knowledge Base (KYMKB), composed of more than 54,000 images. The KYMKB includes popular meme templates, examples of each template, and detailed information about the template. We hypothesize that meme templates can be used to inject models with the context missing from previous approaches. To test our hypothesis, we create a non-parametric majority-based classifier, which we call Template-Label Counter (TLC). We find TLC more effective than or competitive with fine-tuned baselines. To demonstrate the power of meme templates and the value of both our knowledge base and method, we conduct thorough classification experiments and exploratory data analysis in the context of five meme analysis tasks.
摘要
现代通信的形式之一是memes,它们具有可自定义的基本 semantics,可以在社交媒体上分享。机器学习系统对memes表示困难,可能是因为这些系统缺乏memes的Context,因为memes比图像和文本更多。为了帮助理解memes,我们发布了www.knowyourmeme.com上的知识库,称之为知识库(KYMKB),包含超过54,000个图像。KYMKB包括流行的meme模板,每个模板的示例和详细信息。我们提出的假设是,meme模板可以用来补充过去方法缺失的Context。为了测试这个假设,我们创建了一种非 Parametric多数策略,称之为模板标签计数器(TLC)。我们发现TLC比或与精心调整的基线相当有效。为了证明meme模板和我们的知识库以及方法的力量,我们在五种meme分析任务中进行了严格的分类实验和探索数据分析。
Robust Text Classification: Analyzing Prototype-Based Networks
results: 我们的实验结果表明,PBNs在面对现实的拟合干扰时保持了鲁棒性。此外,PBNs的鲁棒性主要归功于保持概念可读性的目标函数,而与普通模型相比,PBNs在数据越复杂时的鲁棒性差异越加鲜明。Abstract
Downstream applications often require text classification models to be accurate, robust, and interpretable. While the accuracy of the stateof-the-art language models approximates human performance, they are not designed to be interpretable and often exhibit a drop in performance on noisy data. The family of PrototypeBased Networks (PBNs) that classify examples based on their similarity to prototypical examples of a class (prototypes) is natively interpretable and shown to be robust to noise, which enabled its wide usage for computer vision tasks. In this paper, we study whether the robustness properties of PBNs transfer to text classification tasks. We design a modular and comprehensive framework for studying PBNs, which includes different backbone architectures, backbone sizes, and objective functions. Our evaluation protocol assesses the robustness of models against character-, word-, and sentence-level perturbations. Our experiments on three benchmarks show that the robustness of PBNs transfers to NLP classification tasks facing realistic perturbations. Moreover, the robustness of PBNs is supported mostly by the objective function that keeps prototypes interpretable, while the robustness superiority of PBNs over vanilla models becomes more salient as datasets get more complex.
摘要
PerceptionGPT: Effectively Fusing Visual Perception into LLM
results: 这篇论文的研究结果是什么? + 对比之前的方法,这篇论文的方法可以更好地处理多个视觉输出,并且可以减少训练时间和数据量,同时减少批处理时间。这种方法可以帮助未来的研究更好地具备VLLM的视觉感知能力。Abstract
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate visual perception tasks remains a challenge. In this paper, we present a novel end-to-end framework named PerceptionGPT, which efficiently and effectively equips the VLLMs with visual perception abilities by leveraging the representation power of LLMs' token embedding. Our proposed method treats the token embedding of the LLM as the carrier of spatial information, then leverage lightweight visual task encoders and decoders to perform visual perception tasks (e.g., detection, segmentation). Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens, and enables achieving superior performance with fewer trainable parameters, less training data and shorted training time. Moreover, as only one token embedding is required to decode the visual outputs, the resulting sequence length during inference is significantly reduced. Consequently, our approach enables accurate and flexible representations, seamless integration of visual perception tasks, and efficient handling of a multiple of visual outputs. We validate the effectiveness and efficiency of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with much fewer trainable parameters and GPU hours, which facilitates future research in enabling LLMs with visual perception abilities.
摘要
摘要:将视觉输入与大语言模型(LLM)结合,已经导致多模态能力的很大进步,产生了视觉大语言模型(VLLM)。然而,使VLLM进行复杂的视觉感知任务仍然是一大挑战。在这篇论文中,我们提出了一种新的端到端框架,名为PerceptionGPT,可以高效地和有效地让VLLM具备视觉感知能力。我们的提议方法是将LLM的 Token embedding作为空间信息的传递者,然后使用轻量级的视觉任务编码器和解码器来完成视觉感知任务(例如检测和分割)。我们的方法可以减少前一些方法的训练困难,只需要 fewer 的可训练参数和训练数据,同时减少训练时间。此外,只需要一个 Token embedding 来解码视觉输出,因此在推理过程中的序列长度减少了。这使得我们的方法可以实现高精度和灵活的表示,同时实现多个视觉输出的有效集成。我们通过广泛的实验 validate 了我们的方法的有效性和效率。结果表明,我们的方法可以与之前的方法相比,减少很多可训练参数和GPU时间,这为未来启用LLM的视觉感知能力提供了可能性。
BizBench: A Quantitative Reasoning Benchmark for Business and Finance
results: 通过对开源和商业模型进行评估, illustrate 该 benchmark 对数理逻辑能力的评估是一项挑战性的任务。Abstract
As large language models (LLMs) impact a growing number of complex domains, it is becoming increasingly important to have fair, accurate, and rigorous evaluation benchmarks. Evaluating the reasoning skills required for business and financial NLP stands out as a particularly difficult challenge. We introduce BizBench, a new benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises 8 quantitative reasoning tasks. Notably, BizBench targets the complex task of question-answering (QA) for structured and unstructured financial data via program synthesis (i.e., code generation). We introduce three diverse financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate distinct financial reasoning capabilities required to solve these QA tasks: reading comprehension of financial text and tables, which is required to extract correct intermediate values; and understanding domain knowledge (e.g., financial formulas) needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to extract numeric entities from financial documents, and capacity to solve problems with code. We conduct an in-depth evaluation of open-source and commercial LLMs, illustrating that BizBench is a challenging benchmark for quantitative reasoning in the finance and business domain.
摘要
As large language models (LLMs) impact an increasing number of complex domains, it is becoming increasingly important to have fair, accurate, and rigorous evaluation benchmarks. Evaluating the reasoning skills required for business and financial NLP is a particularly difficult challenge. We introduce BizBench, a new benchmark for evaluating models' ability to reason about realistic financial problems. BizBench consists of 8 quantitative reasoning tasks. Notably, BizBench targets the complex task of question-answering (QA) for structured and unstructured financial data via program synthesis (i.e., code generation). We introduce three diverse financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate distinct financial reasoning capabilities required to solve these QA tasks, including reading comprehension of financial text and tables, which is necessary to extract correct intermediate values, and understanding domain knowledge (e.g., financial formulas) needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to extract numeric entities from financial documents, and capacity to solve problems with code. We conduct an in-depth evaluation of open-source and commercial LLMs, illustrating that BizBench is a challenging benchmark for quantitative reasoning in the finance and business domain.
From Classification to Generation: Insights into Crosslingual Retrieval Augmented ICL
results: 在分类任务中,该方法得到了稳定的提升,但在生成任务中遇到了挑战。我们的评估带来了域内学习在分类和生成领域的性能动态。Abstract
The remarkable ability of Large Language Models (LLMs) to understand and follow instructions has sometimes been limited by their in-context learning (ICL) performance in low-resource languages. To address this, we introduce a novel approach that leverages cross-lingual retrieval-augmented in-context learning (CREA-ICL). By extracting semantically similar prompts from high-resource languages, we aim to improve the zero-shot performance of multilingual pre-trained language models (MPLMs) across diverse tasks. Though our approach yields steady improvements in classification tasks, it faces challenges in generation tasks. Our evaluation offers insights into the performance dynamics of retrieval-augmented in-context learning across both classification and generation domains.
摘要
LLMs的出色能力理解和遵从指令有时会受到低资源语言的ICL性能的限制。为解决这个问题,我们提出了一种新的方法,即跨语言检索增强ICL(CREA-ICL)。通过从高资源语言提取相似的提示,我们希望提高多语言预训练语言模型(MPLM)的零配置性能。虽然我们的方法在分类任务中得到了稳定的改善,但在生成任务中遇到了挑战。我们的评估对于检索增强ICL在分类和生成领域的性能动态进行了评估。
Zero-Shot Cross-Lingual Sentiment Classification under Distribution Shift: an Exploratory Study
paper_authors: Maarten De Raedt, Semere Kiros Bitew, Fréderic Godin, Thomas Demeester, Chris Develder
for: This paper is focused on studying the generalization of multi-lingual language models to out-of-distribution (OOD) test data in zero-shot cross-lingual transfer settings, and analyzing the impact of both language and domain shifts on performance.
methods: The paper uses counterfactually augmented data (CAD) to improve OOD generalization in the cross-lingual setting, and proposes two new approaches that avoid the costly annotation process associated with CAD.
results: The paper evaluates the performance of three multilingual models (LaBSE, mBERT, and XLM-R) on OOD test sets in 13 languages, and finds that the proposed cost-effective approaches reach similar or up to +3.1% better accuracy than CAD for Amazon and Restaurant reviews.Abstract
The brittleness of finetuned language model performance on out-of-distribution (OOD) test samples in unseen domains has been well-studied for English, yet is unexplored for multi-lingual models. Therefore, we study generalization to OOD test data specifically in zero-shot cross-lingual transfer settings, analyzing performance impacts of both language and domain shifts between train and test data. We further assess the effectiveness of counterfactually augmented data (CAD) in improving OOD generalization for the cross-lingual setting, since CAD has been shown to benefit in a monolingual English setting. Finally, we propose two new approaches for OOD generalization that avoid the costly annotation process associated with CAD, by exploiting the power of recent large language models (LLMs). We experiment with 3 multilingual models, LaBSE, mBERT, and XLM-R trained on English IMDb movie reviews, and evaluate on OOD test sets in 13 languages: Amazon product reviews, Tweets, and Restaurant reviews. Results echo the OOD performance decline observed in the monolingual English setting. Further, (i) counterfactuals from the original high-resource language do improve OOD generalization in the low-resource language, and (ii) our newly proposed cost-effective approaches reach similar or up to +3.1% better accuracy than CAD for Amazon and Restaurant reviews.
摘要
英文语言模型在不同领域的 OUT-OF-DISTRIBUTION(OOD)测试样本上的 brittleness已经得到了广泛的研究,然而对多语言模型的研究尚未得到了探讨。因此,我们研究了在零shot跨语言传输 Setting中的OOD总结能力,分析了语言和领域之间的数据偏移对测试数据的影响。此外,我们还评估了基于counterfactual augmented data(CAD)的方法在跨语言设置中的有效性,因为CAD在英文设置中已经被证明有助于提高OOD总结能力。最后,我们提出了两种新的OOD总结方法,以避免与CAD相关的昂贵的注释过程,通过利用最新的大语言模型(LLMs)。我们在英语 IMDb 电影评论上训练了3个多语言模型:LaBSE、mBERT和XLM-R,并对13种语言的OOD测试集进行评估:Amazon产品评论、推特和餐厅评论。结果表明,OOD性能减降与英文设置中观察到的类似。此外,(i)原始高资源语言中的counterfactuals实际上提高了低资源语言中的OOD总结能力,和(ii)我们新提出的经济性方法达到了类似或更高于CAD的准确率,为Amazon和餐厅评论达到了+3.1%的提升。
Enhancing Public Understanding of Court Opinions with Automated Summarizers
results: 调查实验表明,简化摘要可以帮助非专家更好地理解法律案例的关键特征。In English, this translates to:
for: To help non-experts understand legal cases
methods: Using an AI assistant to generate simplified summaries
results: A survey experiment shows that simplified summaries can help non-experts understand the key features of a ruling.Abstract
Written judicial opinions are an important tool for building public trust in court decisions, yet they can be difficult for non-experts to understand. We present a pipeline for using an AI assistant to generate simplified summaries of judicial opinions. These are more accessible to the public and more easily understood by non-experts, We show in a survey experiment that the simplified summaries help respondents understand the key features of a ruling. We discuss how to integrate legal domain knowledge into studies using large language models. Our results suggest a role both for AI assistants to inform the public, and for lawyers to guide the process of generating accessible summaries.
摘要
Translated into Simplified Chinese:written judicial opinions are an important tool for building public trust in court decisions, yet they can be difficult for non-experts to understand. we present a pipeline for using an AI assistant to generate simplified summaries of judicial opinions. these are more accessible to the public and more easily understood by non-experts. we show in a survey experiment that the simplified summaries help respondents understand the key features of a ruling. we discuss how to integrate legal domain knowledge into studies using large language models. our results suggest a role both for AI assistants to inform the public, and for lawyers to guide the process of generating accessible summaries.
Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation
paper_authors: Marta R. Costa-jussà, David Dale, Maha Elbayad, Bokai Yu
for: 这 paper 的目的是提出一种新的pipeline来识别添加的毒性并mitigate这个问题,该pipeline在推理时间实现。
methods: 这 paper 使用了一种多modal的毒性检测分类器(speech和text),该分类器可以在大规模语言中工作。mitigation方法直接应用于文本输出中。
results: 这 paper 使用 MinTox pipeline在 SEAMLESSM4T 系统上实现了显著的添加毒性 Mitigation, across domains, modalities和语言方向。 MinTox 能够约Filter出25%-95%的添加毒性(根据模式和领域),保持翻译质量。Abstract
Added toxicity in the context of translation refers to the fact of producing a translation output with more toxicity than there exists in the input. In this paper, we present MinTox which is a novel pipeline to identify added toxicity and mitigate this issue which works at inference time. MinTox uses a toxicity detection classifier which is multimodal (speech and text) and works in languages at scale. The mitigation method is applied to languages at scale and directly in text outputs. MinTox is applied to SEAMLESSM4T, which is the latest multimodal and massively multilingual machine translation system. For this system, MinTox achieves significant added toxicity mitigation across domains, modalities and language directions. MinTox manages to approximately filter out from 25% to 95% of added toxicity (depending on the modality and domain) while keeping translation quality.
摘要
加入毒性在翻译上指的是生成翻译输出中存在更多的毒性 чем输入。在这篇论文中,我们介绍了一种名为MinTox的新的管道,用于识别加入毒性并缓解这个问题,它在推理时间进行应用。MinTox使用一个多Modal(语音和文本)的毒性检测类ifier,可以在多种语言和模式下进行检测。这种缓解方法直接应用于文本输出中。MinTox在SEAMLESSM4T上进行应用,SEAMLESSM4T是最新的多Modal和大量多语言翻译系统。对这个系统来说,MinTox在域、modal和语言方向上都实现了显著的加入毒性缓解,可以将25%-95%的加入毒性(根据模式和领域)约束出去,而不会影响翻译质量。
results: 研究发现,大量的Memorization容量会妨碍Generalization的机会。提出一种使用Minimum Description Length(MDL)来在训练过程中决定保留哪些记忆和哪些记忆数量。Abstract
Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this tradeoff. It relies on Minimum Description Length (MDL) to determine during training which memories to store, as well as how many of them.
摘要
协同记忆架构是设计来储存信息,但同时也提供了一种通过回溯方法对未见输入进行泛化的机会:储存的记忆可以被看作是类型的范例。专注于现代赫珀维尔网络(MHN),我们表明了大量储存容量会对泛化机会造成干扰。我们提出了一个解决方案,它基于最小描述长度(MDL)来决定在训练过程中哪些记忆要储存,以及哪些记忆要保留多少。
L3 Ensembles: Lifelong Learning Approach for Ensemble of Foundational Language Models
results: 经验表明,提出的L3 ensemble方法可以提高模型精度,同时保持或超过当前语言模型(T5)的性能。在STSbenchmark中,L3模型的准确率比原始 Fine-tuned FLM 提高15.4%。Abstract
Fine-tuning pre-trained foundational language models (FLM) for specific tasks is often impractical, especially for resource-constrained devices. This necessitates the development of a Lifelong Learning (L3) framework that continuously adapts to a stream of Natural Language Processing (NLP) tasks efficiently. We propose an approach that focuses on extracting meaningful representations from unseen data, constructing a structured knowledge base, and improving task performance incrementally. We conducted experiments on various NLP tasks to validate its effectiveness, including benchmarks like GLUE and SuperGLUE. We measured good performance across the accuracy, training efficiency, and knowledge transfer metrics. Initial experimental results show that the proposed L3 ensemble method increases the model accuracy by 4% ~ 36% compared to the fine-tuned FLM. Furthermore, L3 model outperforms naive fine-tuning approaches while maintaining competitive or superior performance (up to 15.4% increase in accuracy) compared to the state-of-the-art language model (T5) for the given task, STS benchmark.
摘要
精度调整预训练基础语言模型(FLM) для特定任务是经常不可能,特别是在有限的设备资源下。这种情况需要开发一个生命时间学习(L3)框架,可以高效地适应流行的自然语言处理(NLP)任务。我们提出了一种方法,强调提取未经见过的数据中有意义的表示,建立结构化的知识库,并在不断更新的任务中提高表现。我们在多个 NLP 任务上进行了实验,以验证其效果,包括 GLUE 和 SuperGLUE 的benchmark。我们发现,在精度、训练效率和知识传递指标方面,L3 ensemble方法表现良好。初步实验结果表明,我们提议的 L3 模型比 fine-tuned FLM 提高4%~36%的模型精度。此外,L3 模型还能在与状态艺术语言模型(T5)相同或更高的精度水平上保持竞争性或超越性(最多提高15.4%的精度),对 STS benchmark进行了证明。
DocGen: Generating Detailed Parameter Docstrings in Python
results: 与现有的生成模型进行比较,通过自动指标和人 centered评估17名开发者,证明了该方法与现有方法之间的超越。Abstract
Documentation debt hinders the effective utilization of open-source software. Although code summarization tools have been helpful for developers, most would prefer a detailed account of each parameter in a function rather than a high-level summary. However, generating such a summary is too intricate for a single generative model to produce reliably due to the lack of high-quality training data. Thus, we propose a multi-step approach that combines multiple task-specific models, each adept at producing a specific section of a docstring. The combination of these models ensures the inclusion of each section in the final docstring. We compared the results from our approach with existing generative models using both automatic metrics and a human-centred evaluation with 17 participating developers, which proves the superiority of our approach over existing methods.
摘要
文档债务阻碍开源软件的有效利用。虽然代码概要工具有帮助开发者,但大多数开发者更偏好每个函数参数的详细账户而不是高级概要。然而,生成这样的概要是单一生成模型无法可靠地生成的由于缺乏高质量的训练数据。因此,我们提议一种多步骤方法,将多个任务特定的模型相互结合,以确保每个部分在最终的概要中包含。我们与已有的生成模型进行比较,并通过17名参与者进行人中心评估,证明我们的方法在现有方法之上。
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
results: 该论文通过对BREAD数据集进行分析,发现了一些语言模型训练数据中的重复文本问题,并提供了一些参考实现方法来解决这些问题。Abstract
Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A typical and insidious issue, affecting both training data and model output, is data that is repetitive and dominated by linguistically uninteresting boilerplate, such as price catalogs or computer-generated log files. Though this problem permeates many web-scraped corpora, there has yet to be a benchmark to test against, or a systematic study to find simple metrics that generalize across languages and agree with human judgements of data quality. In the present work, we create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content, spanning 360 languages. We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD. We hope that the community will use this resource to develop better filtering methods, and that our reference implementations of CRED scores can become standard corpus evaluation tools, driving the development of cleaner language modeling corpora, especially in low-resource languages.
摘要
“资料质量是NLP领域中不断重现的问题,不论任务、领域或架构,它尤其严重 для低资源语言。一个常见的问题是训练数据和模型输出中的重复和 linguistically 无趣的� boilerplate,如价格目录或计算机生成的日志档案。这个问题在许多网页抓取数据中广泛存在,但是还没有一个底线来测试,或一个系统性的研究来找到简单的度量标准,以及与人类判断资料质量的一致性。在现在的工作中,我们创建了BREAD,一个人工标注的底线,涵盖360种语言。我们释出了多个基线CRED(Character REDundancy)分数,并评估它们在BREAD上的效果。我们希望社区可以使用这个资源,发展更好的筛选方法,以提高语言模型数据库的质量,特别是低资源语言。”