cs.CL - 2023-12-07

Improved Visual Grounding through Self-Consistent Explanations

paper_url: http://arxiv.org/abs/2312.04554
repo_url: None
paper_authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, Vicente Ordonez
for: 这个论文的目的是提高视觉语言模型的局部化能力，使其能够更好地匹配图像和文本。
methods: 作者提出了一种自编码学习策略，使用大型语言模型生成重叠的文本重叠，并通过自编码学习来优化模型。
results: 作者实验结果显示，自编码学习策略可以提高模型的性能，比如在Flickr30k、ReferIt和RefCOCO+等测试集上的性能提高了4.69%、7.68%和3.74%。

Abstract
Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).

摘要
我们的工作显示，通过自适应视觉解释方法来改进视语模型的地方化能力（"地方化"）可以进一步提高视语模型的泛化能力。我们提出了一种策略，通过大语言模型生成重叠的文本来增强现有的文本图像集。 Specifically, 我们尝试通过生成文本短语的多个重叠来让模型对应的地方在图像中匹配。我们认为这将扩展模型能处理的词汇量，并提高GradCAM等梯度基本视觉解释方法所提供的对象位置质量。我们示出，使用SelfEQ可以在Flickr30k、ReferIt和RefCOCO+上比基eline方法和先前的方法表现更好，特别是不使用任何类型的盒子注释时。我们在Flickr30k上得到了84.07%（相对提高4.69%），在ReferIt上得到了67.40%（相对提高7.68%），并在RefCOCO+的测试集A和B上得到了75.10%和55.49%（相对提高3.74%的平均提升）。

Efficient Monotonic Multihead Attention

paper_url: http://arxiv.org/abs/2312.04515
repo_url: None
paper_authors: Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello
for: simultaneous speech-to-text translation on Spanish and English translation tasks
methods: 使用Efficient Monotonic Multihead Attention（EMMA）模型，并提出了改进的训练和推断策略，包括同时精度调整和减少幂性对齐异常幂值
results: 实验结果表明，提案的模型在同时演示和笔记翻译任务中达到了状态级表现。

Abstract
We introduce the Efficient Monotonic Multihead Attention (EMMA), a state-of-the-art simultaneous translation model with numerically-stable and unbiased monotonic alignment estimation. In addition, we present improved training and inference strategies, including simultaneous fine-tuning from an offline translation model and reduction of monotonic alignment variance. The experimental results demonstrate that the proposed model attains state-of-the-art performance in simultaneous speech-to-text translation on the Spanish and English translation task.

摘要
我们介绍了高效简单多头注意力（EMMA）模型，这是当前最佳同时翻译模型，具有稳定和不偏的单调对齐估计。此外，我们还提出了改进的训练和执行策略，包括同时精度调整从线上翻译模型和减少单调对齐差异。实验结果表明，我们的模型在西班牙语和英语翻译任务上达到了当前最佳性能。

An LLM Compiler for Parallel Function Calling

paper_url: http://arxiv.org/abs/2312.04511
repo_url: https://github.com/squeezeailab/llmcompiler
paper_authors: Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
for: LLMCompiler is designed to efficiently orchestrate multi-function calling in LLMs, addressing the limitations of current methods such as high latency and cost.
methods: LLMCompiler uses three components - LLM Planner, Task Fetching Unit, and Executor - to execute functions in parallel, streamlining parallel function calling based on classical compiler principles.
results: LLMCompiler achieves consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% compared to ReAct, with up to 1.35x latency gain over OpenAI’s recent parallel function calling method.

Abstract
Large Language Models (LLMs) have shown remarkable results on various complex reasoning benchmarks. The reasoning capabilities of LLMs enable them to execute function calls, using user-provided functions to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has expanded LLMs' scope to include multi-function calling, where LLMs are equipped with a variety of functions and select the proper functions based on the context. Multi-function calling abilities of LLMs have catalyzed LLM-based software development, allowing them to tackle more complex problems. However, current methods for multi-function calling often require sequential reasoning and acting for each function which can result in high latency, cost, and sometimes inaccurate behavior. To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multi-function calling. Drawing from the principles of classical compilers, LLMCompiler streamlines parallel function calling with three components: (i) an LLM Planner, formulating execution strategies and dependencies; (ii) a Task Fetching Unit, dispatching function calling tasks; and (iii) an Executor, executing these tasks in parallel. LLMCompiler automatically computes an optimized orchestration for the function calls and can be used with open-source models such as LLaMA-2. We have benchmarked LLMCompiler on a range of tasks including cases with non-trivial inter-dependency between function calls, as well as cases that require dynamic replanning based on intermediate results. We observe consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% as compared to ReAct. Additionally, LLMCompiler achieves up to 1.35x latency gain over OpenAI's recent parallel function calling, while achieving similar accuracy.

摘要
大型语言模型（LLM）在不同的复杂逻辑测试上表现出色。 LLM 的逻辑能力使其能够执行用户提供的函数调用，以超越其内置的限制，如知识割裂、数学能力不足或缺乏私有数据访问。这种发展已经扩展了 LLM 的范围，包括多函数调用，其中 LLM 可以根据上下文选择适合的函数。多函数调用能力的发展刺激了基于 LLM 的软件开发，使其能够解决更复杂的问题。然而，目前的多函数调用方法通常需要顺序的逻辑和行为，可能会导致高延迟、成本和不准确的行为。为解决这个问题，我们介绍了 LLMCompiler，它通过并行执行函数来有效地协调多函数调用。 drew 从类型的编译器的原理，LLMCompiler 使用三个组件：（一）LLM 规划器，制定执行策略和依赖关系；（二）任务抓取单元，抓取函数调用任务；（三）执行器，在并行方式下执行这些任务。 LLMCompiler 自动计算优化的函数调用协调，并可以与开源模型如 LLaMA-2 结合使用。我们对 LLMCompiler 进行了多种任务的测试，包括一些具有非常轻量级的函数调用依赖关系的情况，以及需要动态重新规划基于中间结果的情况。我们观察到 LLMC 在执行速度、成本和准确性方面具有一定的提升，相比于 ReAct，LLMC 可以在执行速度方面获得最高达 3.7 倍的吞吐量提升，成本上可以得到最高达 6.7 倍的成本减少，并在准确性方面可以获得最高达 ~9% 的提升。此外，LLMC 在与 OpenAI 最新的并行函数调用相比，可以在执行速度方面具有最高达 1.35 倍的吞吐量提升，同时保持相似的准确性。

A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation

paper_url: http://arxiv.org/abs/2312.04510
repo_url: None
paper_authors: Jarad Forristal, Niloofar Mireshghallah, Greg Durrett, Taylor Berg-Kirkpatrick
for: 这篇论文旨在提出一种基于能量的语言模型，用于控制性的文本生成。
methods: 该论文使用了一种新的 Metropolis-Hastings 采样器，可以在每一步中提出整个序列的重写，并通过 iterative prompting 使用大语言模型。
results: 该采样器可以更有效地采样目标分布，并可以通过采样过程来 determin 生成长度。在两个控制性生成任务中，该采样器表现出了更高的下游性能和更准确的目标分布采样。

Abstract
Recent work has shown that energy-based language modeling is an effective framework for controllable text generation because it enables flexible integration of arbitrary discriminators. However, because energy-based LMs are globally normalized, approximate techniques like Metropolis-Hastings (MH) are required for inference. Past work has largely explored simple proposal distributions that modify a single token at a time, like in Gibbs sampling. In this paper, we develop a novel MH sampler that, in contrast, proposes re-writes of the entire sequence in each step via iterative prompting of a large language model. Our new sampler (a) allows for more efficient and accurate sampling from a target distribution and (b) allows generation length to be determined through the sampling procedure rather than fixed in advance, as past work has required. We perform experiments on two controlled generation tasks, showing both downstream performance gains and more accurate target distribution sampling in comparison with single-token proposal techniques.

摘要
最近的研究表明，能量基本语言模型是一种有效的框架 для可控文本生成，因为它允许自由地集成任意识别器。然而，由于能量基本语言模型是全球Normalized，因此需要使用伪估量-哈斯坦斯（MH）等近似技术进行推断。过去的工作主要探索了简单的提案分布，如吉布斯抽样，来修改单个字符。在这篇论文中，我们开发了一种新的MH抽样方法，它在每步提交整个序列的 rewrite 请求，通过迭代提示大语言模型来实现。我们的新抽样方法（a）允许更高效地和更准确地抽样目标分布，（b）允许生成长度在抽样过程中被确定，而不是在先前的工作中需要预先确定。我们在两个控制性生成任务上进行了实验，并表明了下游性能提升和更准确地抽样目标分布，相比单个字符提案技术。

On the Learnability of Watermarks for Language Models

paper_url: http://arxiv.org/abs/2312.04469
repo_url: https://github.com/chenchenygu/watermark-learnability
paper_authors: Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto
For: This paper focuses on the learnability of watermarks for responsible deployment of language models.* Methods: The authors propose a method called watermark distillation, which trains a student model to mimic the behavior of a teacher model that uses decoding-based watermarking.* Results: The authors test their approach on three different decoding-based watermarking strategies and various hyperparameter settings, and find that models can learn to generate watermarked text with high detectability. However, they also identify limitations to learnability, such as the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.Here is the text in Simplified Chinese:* For: 这篇论文关注语言模型负责任部署的水印学习可行性。* Methods: 作者提出了一种方法 called watermark distillation，它通过训练一个学生模型模仿一个教师模型使用编码基于的水印来实现。* Results: 作者对三种不同的编码基于的水印策略和各种 гипер参数设置进行测试，发现模型可以学习生成高检测率的水印文本。但是，他们还发现了学习水印的限制，包括在正常文本练习下失去水印功能和低抖度水印学习高样本复杂性。

Abstract
Watermarking of language model outputs enables statistical detection of model-generated text, which has many applications in the responsible deployment of language models. Existing watermarking strategies operate by altering the decoder of an existing language model, and the ability for a language model to directly learn to generate the watermark would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, allowing for open models to benefit from watermarking. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three distinct decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.

摘要
水印语言模型输出的启用可以实现统计检测语言模型生成的文本，这有很多应用程序在负责投入语言模型。现有的水印策略通过改变现有语言模型的解码器来实现，而语言模型直接学习生成水印的能力会有对实际应用中水印的重要影响。首先，学习的水印可以用于建立开放的模型，让开放的模型从水印中获得 benefiting 。其次，如果使用水印来确定生成的文本的起源，then an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text。为了研究水印的学习性，我们提出了水印蒸馏，它使用教师模型来帮助学生模型模仿使用解码器来实现水印的特征。我们测试了我们的方法在三种不同的解码器基础上和不同的Hyperparameter设置下，发现模型可以学习生成高检测性的水印文本。我们还发现了学习水印的限制，包括在练习 normal text 时失去水印功能以及高样本复杂性在学习低损失水印时。

OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization

paper_url: http://arxiv.org/abs/2312.04440
repo_url: https://github.com/liatschiff/openasp
paper_authors: Shmuel Amar, Liat Schiff, Ori Ernst, Asi Shefer, Ori Shapira, Ido Dagan
for: 这篇论文目的是提高自动概要模型在实际场景中的性能，特别是在Targeted Aspect-based Summarization（TABS） Setting中。
methods: 这篇论文使用了一种新的和Cost-effective的注解协议，将open aspect dataset derivated from existing generic multi-document summarization datasets。
results: 研究人员发现，现有的自动概要模型和大语言模型在OpenAsp中表现不佳，表明OpenAsp提供了一种真实的开放方面场景，为未来的研究提供了新的挑战。

Abstract
The performance of automatic summarization models has improved dramatically in recent years. Yet, there is still a gap in meeting specific information needs of users in real-world scenarios, particularly when a targeted summary is sought, such as in the useful aspect-based summarization setting targeted in this paper. Previous datasets and studies for this setting have predominantly concentrated on a limited set of pre-defined aspects, focused solely on single document inputs, or relied on synthetic data. To advance research on more realistic scenarios, we introduce OpenAsp, a benchmark for multi-document \textit{open} aspect-based summarization. This benchmark is created using a novel and cost-effective annotation protocol, by which an open aspect dataset is derived from existing generic multi-document summarization datasets. We analyze the properties of OpenAsp showcasing its high-quality content. Further, we show that the realistic open-aspect setting realized in OpenAsp poses a challenge for current state-of-the-art summarization models, as well as for large language models.

摘要
近年来，自动概要模型的性能有了大幅提升。然而，在实际场景中满足用户特定需求仍然存在差距，特别是在targeted概要 Setting中，如本文中所述的有用方面基本概要 Summarization Setting。先前的数据集和研究主要集中在有限的预定的方面上，单纯地针对单个文档输入，或者使用人工生成的数据。为了推进研究在更真实的场景中，我们介绍OpenAsp，一个多文档的 open 方面基本概要 Summarization 数据集。我们使用一种新的和经济的注释协议，将open aspect数据集 derivated from existing generic multi-document summarization datasets。我们分析OpenAsp的性能特点，并证明了其高质量内容。此外，我们还证明了open-aspect Setting在OpenAsp中对当前状态的概要模型和大语言模型带来挑战。

When Input Integers are Given in the Unary Numeral Representation

paper_url: http://arxiv.org/abs/2312.04348
repo_url: None
paper_authors: Tomoyuki Yamakami
for: 这个论文的目的是研究如何使用不同的数字表示方式对各种 комбиatorial问题的计算复杂性产生影响。
methods: 该论文使用了许多NP完全（或者NP硬）问题的实例，并对这些问题的计算复杂性进行了分析。具体来说，它们使用了不同的数字表示方式（即binary和unary），并对这些表示方式对问题的计算复杂性产生的影响进行了比较。
results: 研究发现，当输入整数使用unary表示方式时，许多NP完全（或者NP硬）问题变得非常容易解决。此外，该论文还发现了一些NP完全问题的特点，即它们在不同的数字表示方式下的计算复杂性差异很大。

Abstract
Many NP-complete problems take integers as part of their input instances. These input integers are generally binarized, that is, provided in the form of the "binary" numeral representation, and the lengths of such binary forms are used as a basis unit to measure the computational complexity of the problems. In sharp contrast, the "unarization" (or the "unary" numeral representation) of numbers has been known to bring a remarkably different effect onto the computational complexity of the problems. When no computational-complexity difference is observed between binarization and unarization of instances, on the contrary, the problems are said to be strong NP-complete. This work attempts to spotlight an issue of how the unarization of instances affects the computational complexity of various combinatorial problems. We present numerous NP-complete (or even NP-hard) problems, which turn out to be easily solvable when input integers are represented in unary. We then discuss the computational complexities of such problems when taking unary-form integer inputs. We hope that a list of such problems signifies the structural differences between strong NP-completeness and non-strong NP-completeness.

摘要
多种NP完全问题中的输入整数通常被二进制化，即提供二进制 numeral 表示形式，并使用二进制形式的长度作为计算复杂性的基准单位。然而，对数的“unalization”（或“unary” numeral 表示形式）对问题的计算复杂性带来了截然不同的影响。当不存在计算复杂性差异 между二进制化和unalization的实例时，则称该问题为强NP完全。本工作的目标是探讨一些NP完全（或者NP硬）问题的输入整数在unalization 形式下的计算复杂性。我们提供了许多NP完全（或者NP硬）问题，其中大多数可以轻松地解决，只要输入整数在unalization 形式下。我们 затем讨论了这些问题在unalization 形式下的计算复杂性。我们希望通过这些问题的列表，揭示了强NP完全和非强NP完全之间的结构差异。

Merging by Matching Models in Task Subspaces

paper_url: http://arxiv.org/abs/2312.04339
repo_url: https://github.com/r-three/mats
paper_authors: Derek Tam, Mohit Bansal, Colin Raffel
for: 本研究旨在开发一种可 cheaply 将各个任务特定模型合并为单个多任务模型的方法。
methods: 我们视过去的合并方法为在不同的 ‘’任务子空间’’ 中匹配模型，并将其与损失函数的landscape连接。我们将这种方法视为解决一个线性系统的问题，并考虑使用 conjugate gradient 方法来找到解。
results: 我们的合并框架 ‘’Matching Models in their Task Subspace’’ (MaTS) 在多任务和中间任务模型合并中实现了状态的最佳result。我们发布了所有的代码和检查点在 https://github.com/r-three/mats。

Abstract
Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task subspace'' in which models are matched before being merged. We connect the task subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Subspace'' (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. We release all of the code and checkpoints used in our work at https://github.com/r-three/mats.

摘要
模型融合目标是将多个任务特定模型融合到单个多任务模型中，以便更好地利用资源和降低成本。在这种情况下，我们将过去的融合方法视为在不同的任务子空间（task subspace）中匹配模型，然后将它们融合。我们将任务子空间中的模型连接到它们的损失函数，并将这种方法形式化为解决线性系统的问题。 although past work has generally been limited to linear systems with closed-form solutions, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the task subspace. We ultimately demonstrate that our merging framework called "Matching Models in their Task Subspace" (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. 我们在https://github.com/r-three/mats中发布了我们所使用的所有代码和检查点。

Beyond Surface: Probing LLaMA Across Scales and Layers

paper_url: http://arxiv.org/abs/2312.04333
repo_url: https://github.com/nuochenpku/llama_analysis
paper_authors: Nuo Chen, Ning Wu, Shining Liang, Ming Gong, Linjun Shou, Dongmei Zhang, Jia Li
for: 本研究探讨了大型自然语言模型（LLMs），特别是开源基础模型LLaMA，以探索其内在理解能力。
methods: 研究者采用多选 зада务来探测LLaMA的内在理解能力，包括逻辑推理和计算能力。研究者还进行了水平和垂直分析，比较不同大小和层次。
results: 研究发现了一些不寻常的发现，包括：（1）扩大模型大小不一定会增加知识和计算能力，但可以提高逻辑能力，尤其是在数学问题解决方面；（2）LLaMA的下层缺乏重要的数学和事实知识，但具有逻辑、多语言和认知能力，而顶层含有最多的计算能力和实际知识。

Abstract
This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

摘要

Horizontally, increasing model size does not automatically improve knowledge or computational abilities. Instead, larger models excel in math problem-solving and reduce hallucinations, but only above certain size thresholds.2. Vertically, the lower layers of LLaMA have limited arithmetic and factual knowledge, demonstrating logical thinking, multilingual, and recognitive abilities. The upper layers possess the most computational power and real-world knowledge.

paper_url: http://arxiv.org/abs/2312.04302
repo_url: https://github.com/dvlab-research/prompt-highlighter
paper_authors: Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, Jiaya Jia
for: 本研究旨在解决多modal LLMs（LLMs&VLMs）的推理中的一个关键问题：可控文本生成。多modal LLMs具备多modal理解的能力，但由于其autoregressive生成特性，它们的解释性较差，而且更容易受到提示内容的影响。
methods: 我们提出了一种新的推理方法，即Prompt Highlighter，它允许用户在生成过程中高亮特定提示 span，以便通过指定的span来控制生成的重点。这种方法基于classifier-free diffusion guidance，通过生成模型的自然语言生成能力，实现了无需训练的自适应生成。
results: 我们的方法可以与当前的LLMs和VLMs兼容，并在多种任务上达到了出色的自适应生成结果。实验表明，在推理过程中，通过高亮提示 span来引导模型的注意力，可以更好地控制生成的内容，并且可以生成更加可靠的结果。无需调试LLaVA-v1.5，我们的方法在MMBench测试中获得了69.5的分数，在MME-perception测试中获得了1552.5的分数。

Abstract
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/

摘要
Inspired by classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. We find that guiding the models with highlighted tokens through attention weights leads to more desired outputs during inference. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experimental results show that our method can effectively focus on input contexts and generate reliable content. Without tuning on LLaVA-v1.5, our method achieved 69.5 on the MMBench test and 1552.5 on MME-perception.The code for Prompt Highlighter is available at:

PsyChat: A Client-Centric Dialogue System for Mental Health Support

paper_url: http://arxiv.org/abs/2312.04262
repo_url: https://github.com/qiuhuachuan/psychat
paper_authors: Huachuan Qiu, Anqi Li, Lizhi Ma, Zhenzhong Lan
for: 提供在线心理支持（online mental health support）
methods: 使用客户端中心的对话系统（client-centric dialogue system），包括客户端行为识别（client behavior recognition）、师生策略选择（counselor strategy selection）、输入包装机制（input packer）、回快精度调整的回快生成器（response generator）以及回快选择机制（response selection）。
results: 经自动和人工评估，提出的对话系统显示在实际心理支持中的有效性和实用性，并在模拟Client-虚拟师生互动enario中预测客户端行为，选择合适的师生策略，并生成准确和适当的回快。

Abstract
Dialogue systems are increasingly integrated into mental health support to help clients facilitate exploration, gain insight, take action, and ultimately heal themselves. For a dialogue system to be practical and user-friendly, it should be client-centric, focusing on the client's behaviors. However, existing dialogue systems publicly available for mental health support often concentrate solely on the counselor's strategies rather than the behaviors expressed by clients. This can lead to the implementation of unreasonable or inappropriate counseling strategies and corresponding responses from the dialogue system. To address this issue, we propose PsyChat, a client-centric dialogue system that provides psychological support through online chat. The client-centric dialogue system comprises five modules: client behavior recognition, counselor strategy selection, input packer, response generator intentionally fine-tuned to produce responses, and response selection. Both automatic and human evaluations demonstrate the effectiveness and practicality of our proposed dialogue system for real-life mental health support. Furthermore, we employ our proposed dialogue system to simulate a real-world client-virtual-counselor interaction scenario. The system is capable of predicting the client's behaviors, selecting appropriate counselor strategies, and generating accurate and suitable responses, as demonstrated in the scenario.

摘要
对话系统在心理健康支持中越来越普遍应用，以帮助客户探索、获得新想法、采取行动，最终自我治疗。为了使对话系统实用和易用，它应该是客户中心的，专注于客户的行为。然而，现有的心理健康支持中的对话系统通常围绕咨商的策略而非客户表达的行为进行集中。这可能导致对话系统实施不合理或不适当的咨商策略和相应的回复。为解决这问题，我们提议了心理聊天（PsyChat），一个客户中心的对话系统，通过在线聊天提供心理支持。客户中心对话系统包括五个模块：客户行为识别、咨商策略选择、输入压缩、回复生成和回复选择。我们通过自动和人类评估表明，我们的提议的对话系统在实际心理健康支持中效果和实用。此外，我们利用我们的对话系统 simulate一个真实的客户-虚拟咨商互动场景，系统能预测客户的行为，选择合适的咨商策略，并生成准确和适合的回复，如场景所示。

Swap distance minimization in SOV languages. Cognitive and mathematical foundations

paper_url: http://arxiv.org/abs/2312.04219
repo_url: None
paper_authors: Ramon Ferrer-i-Cancho, Savithry Namboodiripad
for: investigate the principle of swap distance minimization in the context of word order
methods: use word order rotation as a cognitive underpinning to test the prediction in three flexible order SOV languages
results: evidence of swap distance minimization found in all three languages, but weaker in Sinhalese, stronger in Korean and Malayalam.

Abstract
Distance minimization is a general principle of language. A special case of this principle in the domain of word order is swap distance minimization. This principle predicts that variations from a canonical order that are reached by fewer swaps of adjacent constituents are lest costly and thus more likely. Here we investigate the principle in the context of the triple formed by subject (S), object (O) and verb (V). We introduce the concept of word order rotation as a cognitive underpinning of that prediction. When the canonical order of a language is SOV, the principle predicts SOV < SVO, OSV < VSO, OVS < VOS, in order of increasing cognitive cost. We test the prediction in three flexible order SOV languages: Korean (Koreanic), Malayalam (Dravidian), and Sinhalese (Indo-European). Evidence of swap distance minimization is found in all three languages, but it is weaker in Sinhalese. Swap distance minimization is stronger than a preference for the canonical order in Korean and especially Malayalam.

摘要
“距离最小化是语言普遍的原则。对话语序的特例是交换距离最小化原则。这个原则预测在主语（S）、质题（O）和动词（V）之间的语序组合时，将更多的词汇交换到更近的位置，则更加可能。我们在三种可动的语序SOV语言中调查这个预测：韩国语（韩语）、马拉雅拉姆语（南部语）和斯里兰卡语（欧洲语）。发现交换距离最小化原则在所有三种语言中都有证据，但在斯里兰卡语中较弱。交换距离最小化原则在韩语和马拉雅拉姆语中更强，比对于标准语序更具有优势。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, and the translation would be slightly different in Traditional Chinese.

Language Model Knowledge Distillation for Efficient Question Answering in Spanish

paper_url: http://arxiv.org/abs/2312.04193
repo_url: https://github.com/adrianbzg/tinyroberta-distillation-qa-es
paper_authors: Adrián Bazaga, Pietro Liò, Gos Micklem
for: 这个论文主要是为了提高西班牙语自然语言处理任务中的问答能力，特别是在有限的计算资源环境中。
methods: 这个论文使用了知识塑化法，将大型模型知识迁移到轻量级模型中，以实现更高的扩展性和可靠性。
results: 实验结果表明，使用这种压缩方法可以保持大型模型的性能，同时提高推理速度，这将为西班牙语自然语言处理任务提供更多的可能性。

Abstract
Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.

摘要
最近的西班牙语模型开发进步，导致许多自然语言处理（NLP）任务中的进步，如问题回答。然而，缺乏高效的模型限制了这些模型在资源有限环境中的采用。因此，可能有助于推广的西班牙语小型模型会对多种任务和场景产生深见。在这项工作中，我们通过基于RoBERTa的西班牙语压缩语言模型SpanishTinyRoBERTa来实现这一目标。为此，我们使用知识填充法将大型模型的知识传递到轻量级模型中，以便在计算资源有限的地区进行更广泛的实施，而无需损失性能。我们的实验表明，压缩后的紧凑型模型仍可保持与其更大的对手模型相同的性能，同时显著提高推理速度。这项工作为西班牙语模型压缩研究的起点，预示未来可能会在多种NLP任务中进行进一步的研究和探索。

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

paper_url: http://arxiv.org/abs/2312.04127
repo_url: None
paper_authors: Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin
for: 本研究旨在提高大语言模型（LLMs）的安全机制，但在特定场景下，LLMs仍然生成有害回应，这被称为“Jailbreak Attack”。
methods: 我们提出了一种新的Jailbreak攻击方法（RADIAL），包括两步：1）自然语言回应倾向分析：我们分析了LLMs对实际指令的自然语言回应倾向。2）实际指令驱动Jailbreak：根据我们的分析，我们选择了一些实际指令，并将Malicious Instructions逻辑地嵌入到它们中来增强LLMs的潜在危害回应能力。
results: 在三个开源人类对齐的LLMs上，我们的方法在中文和英文Malicious Instructions上达到了出色的Jailbreak攻击性能。此外，我们进行了详细的减少实验，并证明了我们的核心想法“自然语言回应倾向分析”的有效性。我们的探索还暴露了LLMs在后续对话中被引导生成更详细的危害回应的潜在漏洞。

Abstract
Extensive work has been devoted to improving the safety mechanism of Large Language Models (LLMs). However, in specific scenarios, LLMs still generate harmful responses when faced with malicious instructions, a phenomenon referred to as "Jailbreak Attack". In our research, we introduce a novel jailbreak attack method (\textbf{RADIAL}), which consists of two steps: 1) Inherent Response Tendency Analysis: we analyze the inherent affirmation and rejection tendency of LLMs to react to real-world instructions. 2) Real-World Instructions-Driven Jailbreak: based on our analysis, we strategically choose several real-world instructions and embed malicious instructions into them to amplify the LLM's potential to generate harmful responses. On three open-source human-aligned LLMs, our method achieves excellent jailbreak attack performance for both Chinese and English malicious instructions. Besides, we guided detailed ablation experiments and verified the effectiveness of our core idea "Inherent Response Tendency Analysis". Our exploration also exposes the vulnerability of LLMs to being induced into generating more detailed harmful responses in subsequent rounds of dialogue.

摘要
大量的工作已经投入到大语言模型（LLMs）的安全机制进行改进中，然而在特定情况下，LLMs仍然会生成有害回应，这种现象被称为“监狱攻击”。在我们的研究中，我们介绍了一种新的监狱攻击方法（RADIAL），它包括两个步骤：1. 自然语言回应倾向分析：我们分析了LLMs对实际指令的自然语言回应倾向，以便更好地理解它们的内在倾向。2. 基于实际指令的监狱攻击：根据我们的分析，我们选择了一些实际指令，并将恶意指令隐藏在其中，以增强LLMs的潜在性能。在三个开源的人类对齐的LLMs上，我们的方法在中文和英文恶意指令下达到了优秀的监狱攻击性能。此外，我们还进行了详细的剥离实验，并证明了我们的核心思想“自然语言回应倾向分析”的有效性。我们的探索也暴露了LLMs在后续对话中被引导到生成更详细的有害回应的潜在漏洞。

Comparing Large Language Model AI and Human-Generated Coaching Messages for Behavioral Weight Loss

paper_url: http://arxiv.org/abs/2312.04059
repo_url: None
paper_authors: Zhuoran Huang, Michael P. Berry, Christina Chwyl, Gary Hsieh, Jing Wei, Evan M. Forman
for: 这个研究是为了测试人工智能语言模型（LLM）是否可以提供有效的体重控制教练讯息。
methods: 这个研究使用了一个大型语言模型（ChatGPT）生成的人工智能讯息，并让受试者评分这些讯息的有用性。
results: 研究发现，在第一阶段，人工智能讯息被评分为较低的有用性，但在第二阶段，人工智能讯息与人类写的讯息一样有用性。受试者也认为人工智能讯息具有同等的感柔和个性化的价值。

Abstract
Automated coaching messages for weight control can save time and costs, but their repetitive, generic nature may limit their effectiveness compared to human coaching. Large language model (LLM) based artificial intelligence (AI) chatbots, like ChatGPT, could offer more personalized and novel messages to address repetition with their data-processing abilities. While LLM AI demonstrates promise to encourage healthier lifestyles, studies have yet to examine the feasibility and acceptability of LLM-based BWL coaching. 87 adults in a weight-loss trial rated ten coaching messages' helpfulness (five human-written, five ChatGPT-generated) using a 5-point Likert scale, providing additional open-ended feedback to justify their ratings. Participants also identified which messages they believed were AI-generated. The evaluation occurred in two phases: messages in Phase 1 were perceived as impersonal and negative, prompting revisions for Phase 2 messages. In Phase 1, AI-generated messages were rated less helpful than human-written ones, with 66 percent receiving a helpfulness rating of 3 or higher. However, in Phase 2, the AI messages matched the human-written ones regarding helpfulness, with 82% scoring three or above. Additionally, 50% were misidentified as human-written, suggesting AI's sophistication in mimicking human-generated content. A thematic analysis of open-ended feedback revealed that participants appreciated AI's empathy and personalized suggestions but found them more formulaic, less authentic, and too data-focused. This study reveals the preliminary feasibility and acceptability of LLM AIs, like ChatGPT, in crafting potentially effective weight control coaching messages. Our findings also underscore areas for future enhancement.

摘要
自动化教练消息可以节省时间和成本，但它们的重复和通用性可能会限制它们的有效性，相比于人类教练。大语言模型（LLM）基于的人工智能（AI）聊天机器人，如ChatGPT，可以提供更个性化和新颖的消息，以解决重复性。而LML AI在促进健康生活方面的潜在优势尚未得到研究。在这项研究中，87名体重损失试验参与者评分了10个教练消息的有用性（5个人类写成，5个ChatGPT生成），并提供了更多的开放式反馈来证明他们的评分。参与者还 identificated which messages they believed were AI-generated。评分occured in two phases：phase 1的消息被视为不个性化和消极，因此需要修订。在第一阶段，AI生成的消息被评分为 less helpful than human-written ones，有66%的参与者给出了3或更高的有用性评分。然而，在第二阶段，AI消息与人类写成的消息相比，在有用性方面具有了类似的水平，有82%的参与者给出了3或更高的有用性评分。此外，50%的参与者 incorrectly identified AI-generated messages as human-written，这表明AI的成熟程度在模仿人类生成内容方面。一个主题分析的开放式反馈表明，参与者对AI的同情和个性化建议表示感谢，但他们认为AI的消息更加 formulaic， less authentic， and too data-focused。这项研究表明LLM AI在制定有效的体重控制教练消息方面的初步可行性和可接受性。我们的发现还揭示了未来的改进方向。

paper_url: http://arxiv.org/abs/2312.04052
repo_url: None
paper_authors: Amica De Jager, Vukosi Marivate, Abioudun Modupe
for: 本研究的目的是检测社交媒体上的谣言（ Misinformation Detection，MD），以便在不同文化环境中进行模型的转移。
methods: 本研究使用了多种信息来检测谣言，包括文本和视觉元素。它使用了 bidirectional encoder representations from transformers（BERT）作为文本编码器，并使用 residual network（ResNet）作为视觉编码器。
results: 结果表明，在南非社交媒体环境中使用南非样本进行模型训练可以提高模型的性能，并且多模态模型在不同文化环境中的知识传递性较高。

Abstract
With the constant spread of misinformation on social media networks, a need has arisen to continuously assess the veracity of digital content. This need has inspired numerous research efforts on the development of misinformation detection (MD) models. However, many models do not use all information available to them and existing research contains a lack of relevant datasets to train the models, specifically within the South African social media environment. The aim of this paper is to investigate the transferability of knowledge of a MD model between different contextual environments. This research contributes a multimodal MD model capable of functioning in the South African social media environment, as well as introduces a South African misinformation dataset. The model makes use of multiple sources of information for misinformation detection, namely: textual and visual elements. It uses bidirectional encoder representations from transformers (BERT) as the textual encoder and a residual network (ResNet) as the visual encoder. The model is trained and evaluated on the Fakeddit dataset and a South African misinformation dataset. Results show that using South African samples in the training of the model increases model performance, in a South African contextual environment, and that a multimodal model retains significantly more knowledge than both the textual and visual unimodal models. Our study suggests that the performance of a misinformation detection model is influenced by the cultural nuances of its operating environment and multimodal models assist in the transferability of knowledge between different contextual environments. Therefore, local data should be incorporated into the training process of a misinformation detection model in order to optimize model performance.

摘要
随着社交媒体上的谣言不断扩散，需要不断评估数字内容的真实性。这种需求激发了许多研究对谣言检测（MD）模型的开发。然而，许多模型未使用所有可用信息，现有研究中缺乏 relevante 数据集来训练模型，特别在南非社交媒体环境中。本研究的目标是研究谣言检测模型在不同Contextual 环境中的知识传递性。本研究提供了一个多模式谣言检测模型，可以在南非社交媒体环境中运行，同时也 introduce了一个南非谣言数据集。该模型利用文本和视觉元素进行谣言检测，使用了 bidirectional encoder representations from transformers（BERT）作为文本编码器，并使用 residual network（ResNet）作为视觉编码器。模型在 Fakeddit 数据集和南非谣言数据集上进行训练和评估。结果显示，将南非样本包含在模型训练中会提高模型在南非Contextual 环境中的性能，并且多模式模型在谣言检测中保留了较多的知识，比文本和视觉单模式模型更高。我们的研究表明，谣言检测模型在运行环境中的文化特点会影响其性能，并且多模式模型可以在不同Contextual 环境中传递知识。因此，在谣言检测模型训练过程中应该包含本地数据。

RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training

paper_url: http://arxiv.org/abs/2312.04032
repo_url: https://github.com/bbuing9/roast
paper_authors: Jaehyung Kim, Yuning Mao, Rui Hou, Hanchao Yu, Davis Liang, Pascale Fung, Qifan Wang, Fuli Feng, Lifu Huang, Madian Khabsa
for: 提高语言模型（LM）的多方面 Robustness
methods: 利用选择性训练和对抗偏移来强化LM的多方面Robustness
results: 在六种不同的LM上比领先方法更高的效果表明RoAST的有用性

Abstract
Fine-tuning pre-trained language models (LMs) has become the de facto standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to robustness issues, such as adversarial robustness and model calibration. Several perspectives of robustness for LMs have been studied independently, but lacking a unified consideration in multiple perspectives. In this paper, we propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST), a simple yet effective fine-tuning technique to enhance the multi-perspective robustness of LMs in a unified way. RoAST effectively incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs. To be specific, RoAST introduces adversarial perturbation during fine-tuning while the model parameters are selectively updated upon their relative importance to minimize unnecessary deviation. Under a unified evaluation of fine-tuned LMs by incorporating four representative perspectives of model robustness, we demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs, which indicates its usefulness in practice.

摘要
现在许多自然语言处理（NLP）任务中都是通过微调预训练语言模型（LM）来实现标准。然而，微调LM仍然存在一些可靠性问题，如抗对抗性和模型调整。多种robustnessperspective对LM的研究已经进行了独立的研究，但是它们之间没有一个统一的考虑。在这篇论文中，我们提出了通过对抗偏移量学习（RoAST）来提升LM的多元可靠性。RoAST是一种简单 yet有效的微调技术，可以在一个统一的方式下提高LM的多元可靠性。具体来说，RoAST在微调过程中引入对抗偏移量，并在模型参数中进行选择性更新，以避免不必要的偏移。通过将四种代表性的LM模型进行统一评估，我们展示了RoAST比之前的微调方法更有效。这表明RoAST在实际应用中具有实际意义。

2023-12-07

Improved Visual Grounding through Self-Consistent Explanations

Efficient Monotonic Multihead Attention

An LLM Compiler for Parallel Function Calling

A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation

On the Learnability of Watermarks for Language Models

OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization

When Input Integers are Given in the Unary Numeral Representation

Merging by Matching Models in Task Subspaces

Beyond Surface: Probing LLaMA Across Scales and Layers

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

PsyChat: A Client-Centric Dialogue System for Mental Health Support

Swap distance minimization in SOV languages. Cognitive and mathematical foundations

Language Model Knowledge Distillation for Efficient Question Answering in Spanish

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

Comparing Large Language Model AI and Human-Generated Coaching Messages for Behavioral Weight Loss

Multimodal Misinformation Detection in a South African Social Media Environment

RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training