2023-12-04

cs.CL

cs.CL - 2023-12-04

New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

paper_url: http://arxiv.org/abs/2312.02382
repo_url: None
paper_authors: Karanpartap Singh, James Zou
for:* 评估大语言模型（LLM）水印的质量methods:* 使用LLM-judger指南进行评估* 使用文本嵌入分类来分辨水印和无水印文本results:* 现有水印方法可轻松地被检测出来* 水印会影响文本质量，特别是减少响应的 coherence 和深度* 发现评估水印质量的 metric 需要更加丰富，以捕捉水印的各种缺陷

Abstract
With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.

摘要
随着大语言模型（LLM）如ChatGPT的使用，水印技术已经出现为跟踪机器生成内容的有力的方法。然而，LLM水印研究经常利用简单的复杂度或多样性基准来评估水印文本质量，这可能会隐藏重要的水印限制。在这篇文章中，我们介绍了两种新的容易使用的方法来评估LLM水印算法：1）由LLM评审人使用特定指南进行评估；2）基于文本嵌入分类来分辨水印和无水印文本。我们应用这些方法来描述当前水印技术的效果。我们的实验，在不同的数据集上进行了，显示了现有水印方法可以被简单的分类器检测出来，这挑战了水印细微性的假设。我们还发现，通过LLM评审人，水印会影响文本质量，尤其是在减少响应的 coherence 和深度。我们的发现强调了水印Robustness 和文本质量之间的贸易，并且高亮了需要更多的信息来评估水印质量。

Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings

paper_url: http://arxiv.org/abs/2312.02337
repo_url: None
paper_authors: Gyandev Gupta, Bashir Rastegarpanah, Amalendu Iyer, Joshua Rubin, Krishnaram Kenthapadi
for: 这个论文主要是为了量化自然语言数据中的分布Shift而设计的。
methods: 论文提出了一种基于嵌入 layers的 clustering 算法，用于量化自然语言数据中的分布Shift。
results: 实验结果表明，通用的 LLM-based 嵌入 algorthm 能够具有高度敏感度，比较其他嵌入算法。 authors 还提出了 “drift sensitivity” 作为评估语言模型的一个重要纪录。

Abstract
An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.

摘要
必须的一部分在生产环境中监控机器学习模型是测量输入和输出数据的变化。在这篇论文中，我们提出了一种测量自然语言数据中的分布变化的系统，并 investigate了使用大型自然语言模型（LLMs）来解决这个问题的潜在优势。随着LLMs的发展和在不同领域的成功应用，它们在解决不同的自然语言处理问题中表现出了很好的效果。LLMs的力量主要来自它们在相应的神经网络中生成的编码（嵌入）。我们首先提出了基于归一化的算法来测量文本数据中的分布变化，然后对文本嵌入生成器和经典嵌入算法生成的嵌入进行比较。我们的实验表明，通用的LLM-基于嵌入在数据变化敏感性方面表现出了高度的优势。我们提出了分布变化敏感度作为评估语言模型的重要评价指标。最后，我们presented insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.

Revisiting Topic-Guided Language Models

paper_url: http://arxiv.org/abs/2312.02331
repo_url: https://github.com/carolinazheng/revisiting-tglms
paper_authors: Carolina Zheng, Keyon Vafa, David M. Blei
for: 这篇论文的目的是比较 combine language models 和 topic models 的效果。
methods: 这篇论文使用的方法包括 four topic-guided language models 和 two baselines，并对每个模型在四个 corpus 上进行了评估。
results: 研究发现，none of these methods outperform a standard LSTM language model baseline，而且大多数方法无法学习好的话题。此外，研究者还训练了一个使用 neural language model 的 probes，发现基eline 的隐藏状态已经包含了话题信息。

Abstract
A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.

摘要
一种最近的自然语言处理研究尝试将语言模型与主题模型结合。这些主题导向的语言模型将神经网络语言模型与无监督学习方法结合，以便发现文档级别的词语使用模式。本文在标准化设置下对这些方法进行比较。我们研究了四种主题导向的语言模型和两个基线，对每个 corpora 进行评估。意外地发现，none of these methods outperform a standard LSTM language model baseline，并且大多数方法无法学习好的主题。此外，我们在 neural language model 中训练了一个探针，发现基eline 的隐藏状态已经编码了主题信息。我们将所有用于这项研究的代码公开。

When it Rains, it Pours: Modeling Media Storms and the News Ecosystem

paper_url: http://arxiv.org/abs/2312.02118
repo_url: https://github.com/blitt2018/mediastorms
paper_authors: Benjamin Litterer, David Jurgens, Dallas Card
for: 研究媒体暴风的进程和主题分布
methods: 使用对比文章相似性模型实现Story cluster分析，实现两年内的新闻暴风丰富corrpus
results: 验证媒体暴风的演化和主题分布，提供媒体覆盖和新闻主题间的影响关系的实践支持

Abstract
Most events in the world receive at most brief coverage by the news media. Occasionally, however, an event will trigger a media storm, with voluminous and widespread coverage lasting for weeks instead of days. In this work, we develop and apply a pairwise article similarity model, allowing us to identify story clusters in corpora covering local and national online news, and thereby create a comprehensive corpus of media storms over a nearly two year period. Using this corpus, we investigate media storms at a new level of granularity, allowing us to validate claims about storm evolution and topical distribution, and provide empirical support for previously hypothesized patterns of influence of storms on media coverage and intermedia agenda setting.

摘要
大多数世界事件只 receive brief 的新闻报道。然而，有时会有一些事件引起媒体风暴，持续数周不断的广泛报道。在这项工作中，我们开发并应用一种对应式文章相似性模型，以identify Story clusters in corpora的当地和国家在线新闻，并 thereby create a comprehensive corpus of media storms over a nearly two year period。使用这个 corpus，我们investigate media storms at a new level of granularity，以验证风暴的演化和主题分布，并为媒体报道和intermedia Agenda Setting提供实证支持。

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

paper_url: http://arxiv.org/abs/2312.02073
repo_url: None
paper_authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Hamid Palangi, Barun Patra, Robert West
for: 这个研究的目的是研究大语言模型（LLM）在新的上下文信息中的准确性和灵活性。
methods: 这个研究使用了Counterfactual dataset（Fakepedia）来评估LLM的准确性和灵活性。并对不同的LLM进行比较，以了解它们在不同情况下的表现。
results: 研究发现，GPT-4-turbo具有强大的参数知识偏好，而Mistral-7B具有最强的基于实际情况的选择能力。此外，研究还发现，对LLM的计算图 alone可以预测它们的准确性，尤其是在非基于参数的情况下。

Abstract
Large language models (LLMs) have demonstrated impressive capabilities in storing and recalling factual knowledge, but also in adapting to novel in-context information. Yet, the mechanisms underlying their in-context grounding remain unknown, especially in situations where in-context information contradicts factual knowledge embedded in the parameters. This is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify the outdated parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the parametric knowledge clashes with the in-context information. We benchmark various LLMs with Fakepedia and discover that GPT-4-turbo has a strong preference for its parametric knowledge. Mistral-7B, on the contrary, is the model that most robustly chooses the grounded answer. Then, we conduct causal mediation analysis on LLM components when answering Fakepedia queries. We demonstrate that inspection of the computational graph alone can predict LLM grounding with 92.8% accuracy, especially because few MLPs in the Transformer can predict non-grounded behavior. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.

摘要

Recursive Visual Programming

paper_url: http://arxiv.org/abs/2312.02249
repo_url: https://github.com/amrutabuge/recursive
paper_authors: Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell
for: 提高Visual Question Answering (VQA) 的作用和可读性，通过生成和执行特定问题的代码。
methods: 使用受欢迎的人类编程方法，采用迭代循环的代码生成方法，将复杂问题分解成更小的部分，提高问题的解决效率和代码的可读性。
results: 通过对多个benchmark dataset进行广泛的实验，证明RVP可以更好地解决VQA任务，并且可以更好地处理复杂的数据结构。

Abstract
Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.

摘要
Visual Programming (VP) 已经成为Visual Question Answering (VQA) 的强大框架。通过生成和执行特定问题的代码，这些方法表现出了 Compositional 和 reasoning 能力，特别是在几个shot 和 zero-shot enario 中。然而，现有的 VP 方法都会生成所有代码在单一函数中，导致代码的准确性和可读性受到限制。受人类编程实践启发，我们提出了 Recursive Visual Programming (RVP)，它简化生成的 Routines，提供更高效的问题解决方法，并可以处理更复杂的数据结构。RVP 采用人类编程实践的迭代循环代码生成方法，将问题 decomposed 成更小的部分，以便更好地解决复杂问题。具有动态类型分配功能，即在系统 recursively 生成新代码时，自动确定返回类型并生成相应的代码来生成该输出。我们通过对 VSR、COVR、GQA 和 NextQA 等各种标准套件进行广泛的实验，证明了采用人类类似的迭代和模块化编程技术可以更好地解决 VQA 任务。

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

paper_url: http://arxiv.org/abs/2312.01957
repo_url: https://github.com/vicgalle/distilled-self-critique
paper_authors: Victor Gallego
for: 这篇论文旨在解释RLAIF为 bayesian inference，通过引入精炼自我批判（dSC）来细化LLM的输出，并通过 Gibbs 采样器进行整合。
methods: 该方法使用synthetic数据进行训练，并通过distillation进行细化，以实现LLM的alignment。
results: 实验结果表明，dSC可以成为一种可靠和便宜的LLM alignment方法，并且在安全、情感和隐私控制方面表现出色。

Abstract
This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.

摘要
Simplified Chinese translation:这篇论文提出RLAIF的解释为 bayesian inference，通过引入精炼自我批判（dSC）来细化LLM的输出，然后通过一个Gibbs采样器进行细化，最后生成一个精度高的模型。只需使用 sintetic data，dSC在安全、情感和隐私控制等方面进行了实验，表明它可以成为LLM的Alignment的可靠和便宜的代替方案。代码发布在。

Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models

paper_url: http://arxiv.org/abs/2312.01954
repo_url: None
paper_authors: Andrea Papaluca, Daniel Krefl, Sergio Mendez Rodriguez, Artem Lensky, Hanna Suominen
for: 这个论文测试了不同大小的自然语言模型（LLM）在零和几个例子设定下的Triplet Extraction（TE）能力。
methods: 论文提出了一个管道，通过在知识库（KB）中动态收集上下文信息，包括上下文 triplets 和（句子， triplets）对的例子，并将其提供给 LLM 作为提示。
results: 研究发现，随着知识库上下文的提高，LLM 的 TE 能力得到了显著提高，并在一些情况下与基于BiLSTM网络架构的全部训练基eline相当。此外，研究还发现，模型的大小只有 logarithm 方式提高 TE 能力。

Abstract
In this work, we tested the Triplet Extraction (TE) capabilities of a variety of Large Language Models (LLMs) of different sizes in the Zero- and Few-Shots settings. In detail, we proposed a pipeline that dynamically gathers contextual information from a Knowledge Base (KB), both in the form of context triplets and of (sentence, triplets) pairs as examples, and provides it to the LLM through a prompt. The additional context allowed the LLMs to be competitive with all the older fully trained baselines based on the Bidirectional Long Short-Term Memory (BiLSTM) Network architecture. We further conducted a detailed analysis of the quality of the gathered KB context, finding it to be strongly correlated with the final TE performance of the model. In contrast, the size of the model appeared to only logarithmically improve the TE capabilities of the LLMs.

摘要
在这个工作中，我们测试了一些大语言模型（LLM）的 triplet extraction（TE）能力在零和几个例目下。具体来说，我们提出了一个管道， dynamically gathering知识库（KB）中的Contextual information，包括context triplets和（句子， triplets）对的例子，并将其提供给LLM via prompt。这些额外的上下文使得LLMs可以与所有的老的完全训练基eline相比竞争。我们还进行了KB上下文质量的详细分析，发现与最终TE性能之间存在强相关性。然而，模型大小似乎只有对TE能力带来了对数的改进。

A Machine Learning Approach Towards SKILL Code Autocompletion

paper_url: http://arxiv.org/abs/2312.01921
repo_url: None
paper_authors: Enrique Dehaerne, Bappaditya Dey, Wannes Meert
for: 提高电子设计自动化（EDA）技术以满足全球需求，提高半导体设计工程师的生产力。
methods: 使用 transformer 架构的代码生成模型，通过自动完成 SKILL 代码来提高半导体设计工程师的生产力。
results: 提出了一种数据有效的方法，包括创建高质量 SKILL 数据集，使用 T5 模型在自动学习和监督学习中进行训练，并评估生成的 SKILL 代码。结果表明，使用该方法的模型比基线方法高于人工评分和 BLEU 分数。然而，由于可用的 SKILL 代码数据量非常少，模型训练时还存在许多限制。

Abstract
As Moore's Law continues to increase the complexity of electronic systems, Electronic Design Automation (EDA) must advance to meet global demand. An important example of an EDA technology is SKILL, a scripting language used to customize and extend EDA software. Recently, code generation models using the transformer architecture have achieved impressive results in academic settings and have even been used in commercial developer tools to improve developer productivity. To the best of our knowledge, this study is the first to apply transformers to SKILL code autocompletion towards improving the productivity of hardware design engineers. In this study, a novel, data-efficient methodology for generating SKILL code is proposed and experimentally validated. More specifically, we propose a novel methodology for (i) creating a high-quality SKILL dataset with both unlabeled and labeled data, (ii) a training strategy where T5 models pre-trained on general programming language code are fine-tuned on our custom SKILL dataset using unsupervised and supervised learning, and (iii) evaluating synthesized SKILL code. We show that models trained using the proposed methodology outperform baselines in terms of human-judgment score and BLEU score. A major challenge faced was the extremely small amount of available SKILL code data that can be used to train a transformer model to generate SKILL code. Despite our validated improvements, the extremely small dataset available to us was still not enough to train a model that can reliably autocomplete SKILL code. We discuss this and other limitations as well as future work that could address these limitations.

摘要
Moore's Law 的不断提高复杂性要求电子设计自动化（EDA）技术不断进步，SKILL 是一种用于自定义和扩展 EDA 软件的脚本语言。在学术和商业领域中，基于 transformer 架构的代码生成模型已经取得了很好的成绩。据我们所知，这是第一个应用 transformers 到 SKILL 代码自动完成以提高硬件设计工程师的产效。在这种研究中，我们提出了一种新的、数据效率高的 SKILL 代码生成方法，并通过实验验证其效果。具体来说，我们提出了以下三个方法：1. 创建高质量的 SKILL 数据集，包括无标签数据和标签数据。2. 使用 T5 模型，先在通用编程语言代码上进行预训练，然后在我们自定义的 SKILL 数据集上进行无监督和监督学习。3. 评估生成的 SKILL 代码。我们的研究表明，使用我们提出的方法可以超越基eline。具体来说，我们的模型在人类评估 score 和 BLEU score 两个指标上均表现出色。然而，我们面临了一个主要的挑战：SKILL 代码数据的可用量非常小。尽管我们 Validated 改进，但我们的数据集仍然不够用于可靠地生成 SKILL 代码。我们讨论了这些限制以及未来的工作，以解决这些限制。

Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness

paper_url: http://arxiv.org/abs/2312.01858
repo_url: None
paper_authors: Zichao Li, Ines Arous, Siva Reddy, Jackie C. K. Cheung
for: 本研究旨在探讨如何使用大型自然语言模型（LLM）作为知识库（KB），以及如何确保编辑学习的知识具有内在的逻辑约束。
methods: 本研究提出了一种评估协议和一个 accompanying 问答集（DepEdit），用于全面评估编辑过程中知识的内在逻辑约束。该协议包括在控制环境中编辑知识，并监测其影响以及其逻辑推论。
results: 实验结果表明，现有的知识编辑方法受到知识表示形式的影响，并且它们在推论 edited 知识中表现有限。

Abstract
The potential of using a large language model (LLM) as a knowledge base (KB) has sparked significant interest. To manage the knowledge acquired by LLMs, we need to ensure that the editing of learned facts respects internal logical constraints, which are known as dependency of knowledge. Existing work on editing LLMs has partially addressed the issue of dependency, when the editing of a fact should apply to its lexical variations without disrupting irrelevant ones. However, they neglect the dependency between a fact and its logical implications. We propose an evaluation protocol with an accompanying question-answering dataset, DepEdit, that provides a comprehensive assessment of the editing process considering the above notions of dependency. Our protocol involves setting up a controlled environment in which we edit facts and monitor their impact on LLMs, along with their implications based on If-Then rules. Extensive experiments on DepEdit show that existing knowledge editing methods are sensitive to the surface form of knowledge, and that they have limited performance in inferring the implications of edited facts.

摘要
受大语言模型（LLM）知识库（KB）的潜在利用吸引了广泛的关注。为了管理LLM所获知，我们需要确保编辑学习的事实遵循内部逻辑约束，这些约束被称为知识依赖关系。现有的LLM编辑工作部分解决了事实编辑的问题，但忽略了事实与其логи合理的结论之间的依赖关系。我们提出了一种评估协议，并附加了一个相应的问答数据集， DepEdit，以全面评估编辑过程中的依赖关系。我们的协议包括在控制环境中编辑事实，并监测其影响LLM以及其逻辑推论的变化。广泛的DepEdit实验表明，现有的知识编辑方法受到表示知识的表达形式的影响，而且它们在推理编辑事实的逻辑结论时表现有限。

Prompting Disentangled Embeddings for Knowledge Graph Completion with Pre-trained Language Model

paper_url: http://arxiv.org/abs/2312.01837
repo_url: https://github.com/genggengcss/pdkgc
paper_authors: Yuxia Geng, Jiaoyan Chen, Yuhang Zeng, Zhuo Chen, Wen Zhang, Jeff Z. Pan, Yuxiang Wang, Xiaoliang Xu
for: 这篇论文主要是关于知识图完成（KGC）中使用预训练语言模型（PLM）的应用。
methods: 该论文提出了一种新的KGC方法 named PDKGC，其使用了两个提示：一个困难任务提示和一个分解结构提示。这两个提示用于在冻结PLM上进行KGC任务，并将其与文本信息结合以实现更全面的实体预测。
results: 对两个常用的KGC数据集进行了坚实的评估，结果表明PDKGC常常超过基elines，其中的组件都是有效的。codes和数据可以在https://github.com/genggengcss/PDKGC上获取。

Abstract
Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts and perform KGC on a frozen PLM with only the prompts trained. Accordingly, we propose a new KGC method named PDKGC with two prompts -- a hard task prompt which is to adapt the KGC task to the PLM pre-training task of token prediction, and a disentangled structure prompt which learns disentangled graph representation so as to enable the PLM to combine more relevant structure knowledge with the text information. With the two prompts, PDKGC builds a textual predictor and a structural predictor, respectively, and their combination leads to more comprehensive entity prediction. Solid evaluation on two widely used KGC datasets has shown that PDKGC often outperforms the baselines including the state-of-the-art, and its components are all effective. Our codes and data are available at https://github.com/genggengcss/PDKGC.

摘要
<> translate "Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts and perform KGC on a frozen PLM with only the prompts trained. Accordingly, we propose a new KGC method named PDKGC with two prompts -- a hard task prompt which is to adapt the KGC task to the PLM pre-training task of token prediction, and a disentangled structure prompt which learns disentangled graph representation so as to enable the PLM to combine more relevant structure knowledge with the text information. With the two prompts, PDKGC builds a textual predictor and a structural predictor, respectively, and their combination leads to more comprehensive entity prediction. Solid evaluation on two widely used KGC datasets has shown that PDKGC often outperforms the baselines including the state-of-the-art, and its components are all effective. Our codes and data are available at https://github.com/genggengcss/PDKGC." into Simplified Chinese.Here's the translation: both 图structure和文本信息在知识图完成(KGC)中发挥关键作用。随着预训练语言模型(PLMs)如BERT的成功，它们在KGC中应用于文本编码。然而，当前的方法主要是微调PLMs，导致很大的训练成本和更大的PLMs的扩展性。相比之下，我们提议使用提示并在冻结PLM上进行KGC。根据此，我们提出了一种新的KGC方法 named PDKGC，该方法使用了两个提示：一个hard task prompt，用于将KGC任务适应PLM的预训练任务中的token预测任务，以及一个分解结构提示，用于让PLM学习分解图表示，以便PLM可以将更有关系的结构知识与文本信息结合。通过两个提示，PDKGC建立了文本预测器和结构预测器，并将它们组合以实现更全面的实体预测。我们对两个常用的KGC数据集进行了坚实的评估，结果显示，PDKGC经常超越基elines，包括状态的前一个艺术。我们的代码和数据可以在https://github.com/genggengcss/PDKGC中获取。

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

paper_url: http://arxiv.org/abs/2312.01823
repo_url: https://github.com/yinzhangyue/eot
paper_authors: Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, Xipeng Qiu
for: 提高复杂逻辑任务中LLM的表现，通过外部知识增强LLM的内在理解。
methods: 提出了一种名为交换思维（Exchange-of-Thought，EoT）的新框架，允许不同模型之间进行问题解决的交流。
results: 通过实验表明，EoT在多种复杂逻辑任务中表现出色，超过了现有的基线值，并且在成本效益方面表现出优异。

Abstract
Large Language Models (LLMs) have recently made significant strides in complex reasoning tasks through the Chain-of-Thought technique. Despite this progress, their reasoning is often constrained by their intrinsic understanding, lacking external insights. To address this, we propose Exchange-of-Thought (EoT), a novel framework that enables cross-model communication during problem-solving. Drawing inspiration from network topology, EoT integrates four unique communication paradigms: Memory, Report, Relay, and Debate. This paper delves into the communication dynamics and volume associated with each paradigm. To counterbalance the risks of incorrect reasoning chains, we implement a robust confidence evaluation mechanism within these communications. Our experiments across diverse complex reasoning tasks demonstrate that EoT significantly surpasses established baselines, underscoring the value of external insights in enhancing LLM performance. Furthermore, we show that EoT achieves these superior results in a cost-effective manner, marking a promising advancement for efficient and collaborative AI problem-solving.

摘要
EoT integrates four unique communication paradigms: Memory, Report, Relay, and Debate, inspired by network topology. This paper explores the communication dynamics and volume associated with each paradigm. To ensure the accuracy of the reasoning chains, we have implemented a robust confidence evaluation mechanism within these communications.Our experiments across a variety of complex reasoning tasks show that EoT significantly outperforms existing baselines, demonstrating the importance of external insights in enhancing LLM performance. Additionally, EoT achieves these superior results in an efficient and cost-effective manner, marking a significant advancement in collaborative AI problem-solving.

paper_url: http://arxiv.org/abs/2312.01714
repo_url: None
paper_authors: Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su, Longyue Wang
for: This paper aims to improve the performance of large language models (LLMs) in multi-modal question answering tasks by addressing the challenge of selecting optimal chain of thought (CoT) demonstration examples.
methods: The proposed approach uses retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal similarities, and employs a stratified sampling method to promote the diversity of demonstration examples.
results: The proposed approach significantly improves the performance of LLMs in multi-modal reasoning tasks, achieving state-of-the-art results on the ScienceQA dataset. Specifically, the ChatGPT-based approach outperforms the Chameleon(ChatGPT) by 2.74% and the GPT4-based approach surpasses the Chameleon(GPT-4) by 0.89%. The best performing model shows a 6.05% increase over Chameleon for ChatGPT-based models and a 4.57% increase for GPT-4-based models.Here is the same information in Simplified Chinese:
for: 这篇论文目标是提高大语言模型（LLMs）在多模态问答任务中的表现，并解决选择优化链条（CoT）示例的挑战。
methods: 提议的方法使用检索机制来动态和自动选择基于多模态相似性的示例，并使用分类 stratified sampling 方法来促进示例的多样性。
results: 提议的方法在多模态理解任务中显著提高了 LLMS 的表现，在科学问答 datasets 上达到了状态的最佳结果。特别是，基于 ChatGPT 的方法比 Chameleon(ChatGPT) 高出 2.74%，GPT4 基于方法比 Chameleon(GPT-4) 高出 0.89%。最佳表现比 Chameleon 高出 6.05% 和 4.57%。

Abstract
The advancement of Large Language Models(LLMs) has brought substantial attention to the Chain of Thought(CoT) approach, primarily due to its ability to enhance the capability of LLMs on tasks requiring complex reasoning. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks, such as multi-modal question answering. However, the selection of optimal CoT demonstration examples in multi-modal reasoning for LLMs remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal similarities. This method aims to refine the CoT reasoning process in multi-modal scenarios via informing LLMs with more relevant and informative examples. Furthermore, we employ a stratified sampling method categorising demonstration examples into groups based on their types and retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments, we demonstrate that our approach significantly improves the performance of LLMs, achieving state-of-the-art results in multi-modal reasoning tasks. Specifically, our methods demonstrate significant advancements on the ScienceQA dataset. While our method based on ChatGPT outperforms the Chameleon(ChatGPT) by 2.74% with an accuracy of 82.67%, the GPT4-based approach surpasses the Chameleon(GPT-4) by 0.89%, achieving 87.43% on accuracy under the same setting. Moreover, our best performing show a 6.05% increase over Chameleon for ChatGPT-based models and a 4.57% increase for GPT-4-based models.

摘要
大语言模型（LLM）的进步导致了对循环思维（CoT）方法的广泛关注，主要是因为它能够增强 LLM 在需要复杂推理的任务上的能力。此外，CoT 方法在 LLM 应用于多modal任务，例如多modal问答任务上也有重要意义。然而，选择 LLM 的最佳 CoT 示例在多modal 推理中仍然是未explored的领域，因为多modal 示例的本质内在复杂。在这篇文章中，我们提出了一个新的方法，它可以通过循环示例的选择来自动地选择示例，并且使用对于多modal示例的跨modal相似性进行排序。这个方法的目的是通过将更有 relevance 和有用的示例告知 LLM，以改善多modal 推理过程中 LLM 的性能。此外，我们还使用了分组 stratified sampling 方法，将示例分组，并从不同的组别中选择示例，以提高示例的多样性。经过一系列的实验，我们发现了我们的方法可以对 LLM 进行 significiant 提升，在多modal 推理任务中获得了state-of-the-art 的结果。具体来说，我们的方法在 ScienceQA dataset 上达到了82.67% 的准确率，而我们基于 ChatGPT 的方法比 Chameleon(ChatGPT) 高2.74%，GPT4-based 方法比 Chameleon(GPT-4) 高0.89%，具体的结果如下：* 我们的 ChatGPT 方法在 ScienceQA dataset 上获得了87.43% 的准确率。* 我们的 GPT4-based 方法在 ScienceQA dataset 上获得了84.57% 的准确率。* 我们的 best performing 方法在 ChatGPT 和 GPT-4 上分别提高了6.05% 和4.57%。

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

paper_url: http://arxiv.org/abs/2312.01701
repo_url: https://github.com/anonymousanoy/fohe
paper_authors: Lei Wang, Jiabang He, Shenshen Li, Ning Liu, Ee-Peng Lim
for: 这个论文的目的是减少大语言模型（LLM）的细化 объек hallucination（OH）。methods: 该论文提出了一个框架，即 \textit{ReCaption}，包括两个组件：使用 ChatGPT 重写描述和 Fine-tune 指令适应 LLM 在重写描述上。results: 实验结果表明，ReCaption 能够有效减少不同 LLM 选项的细化 objet hallucination，并提高其文本生成质量。

Abstract
Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose \textit{ReCaption}, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named \textit{Fine-Grained Object Hallucination Evaluation} (\textit{FGHE}). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at https://github.com/Anonymousanoy/FOHE.

摘要
Note:* LLMs: Large language models* LVLMs: Large vision-language models* FGHE: Fine-Grained Object Hallucination Evaluation

Voice-Based Smart Assistant System for Vehicles using RASA

paper_url: http://arxiv.org/abs/2312.01642
repo_url: None
paper_authors: Aditya Paranjape, Yash Patwardhan, Vedant Deshpande, Aniket Darp, Jayashree Jagdale
for: 这个论文主要是为了开发一个基于RASA框架的语音助手应用程序，用于自动化车辆内部的各种任务，以提高安全性和驾驶体验。
methods: 该论文使用了RASA框架来开发一个语音助手应用程序，并实现了语音识别、自然语言处理等技术，以便用户可以通过语音输入完成各种任务。
results: 该论文的实验结果表明，使用语音助手应用程序可以提高车辆内部的安全性和驾驶体验，同时也可以减少驾驶者的分心和干扰。

Abstract
Conversational AIs, or chatbots, mimic human speech when conversing. Smart assistants facilitate the automation of several tasks that needed human intervention earlier. Because of their accuracy, absence of dependence on human resources, and accessibility around the clock, chatbots can be employed in vehicles too. Due to people's propensity to divert their attention away from the task of driving while engaging in other activities like calling, playing music, navigation, and getting updates on the weather forecast and latest news, road safety has declined and accidents have increased as a result. It would be advantageous to automate these tasks using voice commands rather than carrying them out manually. This paper focuses on the development of a voice-based smart assistance application for vehicles based on the RASA framework. The smart assistant provides functionalities like navigation, communication via calls, getting weather forecasts and the latest news updates, and music that are completely voice-based in nature.

摘要
对话AI或chatbot可以模拟人类的话语，并且可以自动进行一些需要人工干预的任务。智能助手可以帮助自动化这些任务，因此chatbot可以在车辆中使用。由于人们对车辆驾驶时的注意力会被其他活动如电话联系、音乐播放、路径导航和天气预报等分散，因此道路安全性下降，而事故的数量则增加。这篇文章强调了基于RASA框架的voice-based智能助手应用程序的开发，这个应用程序提供了完全voice-based的功能，如航行、电话通话、天气预报和最新新闻更新等。

Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment

paper_url: http://arxiv.org/abs/2312.01592
repo_url: None
paper_authors: Cong-Duy Nguyen, The-Anh Vu-Le, Thong Nguyen, Tho Quan, Luu Anh Tuan
for: 本研究旨在提高BERT表示的可视化语言学习。
methods: 我们提出了一种名为GroundedBERT的可视化语言学习方法，它组合了语言 corpus 中学习的上下文表示和可视化数据集中学习的视觉信息。我们还使用了最佳运输算法（OT）解决两种模态之间的分数对齐问题。
results: 我们的提议方法在GLUE和SQuAD datasets上的多种语言任务上显著超越了基eline语言模型。

Abstract
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.

摘要
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT consists of two components: (i) the original BERT, which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module, which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.Here's the translation in Traditional Chinese:Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT consists of two components: (i) the original BERT, which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module, which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.

Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation

paper_url: http://arxiv.org/abs/2312.02227
repo_url: None
paper_authors: Cong-Duy Nguyen, Thong Nguyen, Duc Anh Vu, Luu Anh Tuan
For:The paper is written for the task of multimodal sentiment analysis, specifically addressing the limitations of previous methods in capturing the variation in sentiment scores within the same class and the significance of unimodal representations in the fusion vector.Methods:The paper proposes a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis, which enhances the discrimination and generalizability of the multimodal representation and overcomes biases in the fusion vector’s modality.Results:The experimental results demonstrated the effectiveness of the proposed approach, along with visualizations on two widely used datasets.

Abstract
The effectiveness of a model is heavily reliant on the quality of the fusion representation of multiple modalities in multimodal sentiment analysis. Moreover, each modality is extracted from raw input and integrated with the rest to construct a multimodal representation. Although previous methods have proposed multimodal representations and achieved promising results, most of them focus on forming positive and negative pairs, neglecting the variation in sentiment scores within the same class. Additionally, they fail to capture the significance of unimodal representations in the fusion vector. To address these limitations, we introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality. Our experimental results, along with visualizations on two widely used datasets, demonstrate the effectiveness of our approach.

摘要
模型的有效性受到多modalities的融合表示的质量影响很大，在多modal sentiment analysis中。此外，每种模式都是从原始输入中提取出来的，然后与其他模式结合在一起构建一个多modal表示。虽然之前的方法已经提出了多modal表示并取得了良好的结果，但大多数都是通过形成正面和负面对的方式来实现，忽视了同一类别内的情感分数的变化。此外，它们还无法捕捉谱 modal 表示在折衔vector中的重要性。为了解决这些局限性，我们提出了一种名为Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis的框架。这种框架的目的是增强多modal表示的分辨率和泛化性，并在折衔vector中消除模式偏见。我们的实验结果，以及在两个常用的数据集上的视觉化，都表明了我们的方法的有效性。

Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences

paper_url: http://arxiv.org/abs/2312.01582
repo_url: https://github.com/elbria/ex-semdiv
paper_authors: Eleftheria Briakou, Navita Goyal, Marine Carpuat
for: 本研究的目的是解释NLP模型如何预测文本之间的 semantic divergence。
methods: 本研究使用了一种新的技术，即phrase-alignment-guided erasure，来生成contrastive highlights，并证明了这种技术可以更好地帮助人们理解NLP模型的预测结果。
results: 研究发现，使用这种技术可以更好地匹配人类的理由，并帮助人们检测文本翻译中的细腻意义差异和极重机器翻译错误。In English, this means:
for: The purpose of this study is to explain how NLP models predict semantic divergence between two input texts.
methods: The study uses a new technique called phrase-alignment-guided erasure to generate contrastive highlights, and shows that this technique can better help people understand the predictions of NLP models.
results: The study finds that using this technique can better match human rationales, and help people detect fine-grained meaning differences in human translations and critical machine translation errors.

Abstract
Explainable NLP techniques primarily explain by answering "Which tokens in the input are responsible for this prediction?''. We argue that for NLP models that make predictions by comparing two input texts, it is more useful to explain by answering "What differences between the two inputs explain this prediction?''. We introduce a technique to generate contrastive highlights that explain the predictions of a semantic divergence model via phrase-alignment-guided erasure. We show that the resulting highlights match human rationales of cross-lingual semantic differences better than popular post-hoc saliency techniques and that they successfully help people detect fine-grained meaning differences in human translations and critical machine translation errors.

摘要
“我们提出了一种可解释的NLP技术，主要是通过回答“哪些输入元素贡献到这个预测中？”来解释。我们认为，当NLP模型通过比较两个输入文本来做预测时，更有用的是通过回答“哪些差异在两个输入文本中引起这个预测？”来解释。我们介绍了一种生成对照高亮的技术，通过Alignment-guided erasure来解释semantic divergence模型的预测。我们证明了这些高亮能够更好地匹配人类的理由，并且能够帮助人检测人工翻译和机器翻译中的细部意义差异。”Here's a word-for-word translation of the text in Traditional Chinese:“我们提出了一种可解释的NLP技术，主要是通过回答“哪些输入元素贡献到这个预测中？”来解释。我们认为，当NLP模型通过比较两个输入文本来做预测时，更有用的是通过回答“哪些差异在两个输入文本中引起这个预测？”来解释。我们介绍了一种生成对照高亮的技术，通过Alignment-guided erasure来解释semantic divergence模型的预测。我们证明了这些高亮能够更好地匹配人类的理由，并且能够帮助人检测人工翻译和机器翻译中的细部意义差异。”

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

paper_url: http://arxiv.org/abs/2312.01575
repo_url: https://github.com/keitokudo/multi-vidsum
paper_authors: Keito Kudo, Haruki Nagasawa, Jun Suzuki, Nobuyuki Shimizu
for: 这个论文的目标任务是为视频快照 Summarization 提供实践的任务设定和数据集，以便用于快速了解视频内容。
methods: 该论文提出了一种实践型多Modal video summarization任务的设定和数据集，并提出了一种评价指标。在实现该任务时，需要同时优化键帧选择和caption质量，这需要考虑键帧和caption之间的互相关系。
results: 论文提出了两个基eline系统的性能，并对其进行了评价。

Abstract
This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying them in a listable format to grasp the video content quickly. This task aims to extract crucial scenes from the video in the form of images (keyframes) and generate corresponding captions explaining each keyframe's situation. This task is useful as a practical application and presents a highly challenging problem worthy of study. Specifically, achieving simultaneous optimization of the keyframe selection performance and caption quality necessitates careful consideration of the mutual dependence on both preceding and subsequent keyframes and captions. To facilitate subsequent research in this field, we also construct a dataset by expanding upon existing datasets and propose an evaluation framework. Furthermore, we develop two baseline systems and report their respective performance.

摘要
这篇论文提出了一个实用的多Modal视频概要任务设定和一个用于训练和评估该任务的数据集。目标任务是从给定的视频中提取关键场景，并将其转换为预定数量的关键帧-标题对，以便快速了解视频内容。这个任务的目标是从视频中提取关键场景，并将其转换为图像（关键帧）和对应的标题，以描述每个关键帧的情况。这个任务是一个实用的应用，同时也是一个具有挑战性的问题。在实现这个任务时，需要 simultanously 优化关键帧选择性和标题质量，这需要考虑关键帧和标题之间的互相依赖关系。为了促进后续的研究，我们还构建了一个数据集，并提出了评估框架。此外，我们还开发了两个基线系统，并对其表现进行了报告。

2023-12-04

New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings

Revisiting Topic-Guided Language Models

When it Rains, it Pours: Modeling Media Storms and the News Ecosystem

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Recursive Visual Programming

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models

A Machine Learning Approach Towards SKILL Code Autocompletion

Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness

Prompting Disentangled Embeddings for Knowledge Graph Completion with Pre-trained Language Model

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

Voice-Based Smart Assistant System for Vehicles using RASA

Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment

Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation

Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video