cs.CL - 2023-11-29

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

paper_url: http://arxiv.org/abs/2311.17898
repo_url: None
paper_authors: Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, René Vidal
for: 提高多模态生成模型的质量和准确性，而不需要大量的文本-图像对应数据。
methods: 提出了一种零shot框架，通过外部知识检索和语言模型压缩来帮助生成器生成可靠的视觉内容。
results: 在多个文本驱动生成任务（图像、3DRendering和视频）中，KPP可以生成准确、semantically rich的视觉内容，并且可以适应不同的视觉领域和基础模型。

Abstract
Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models. A prevalent strategy to align multiple modalities is to fine-tune the generator with a large number of annotated text-image pairs. However, such a procedure is labor-consuming and resource-draining. The key question we ask is: can we enhance the quality and faithfulness of text-driven generative models beyond extensive text-image pair annotations? To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. Instead of training generators to handle generic prompts, KPP employs a recursive knowledge query process to gather informative external facts from the knowledge base, instructs a language model to compress the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. The entire process is zero-shot, without accessing the architectures and parameters of generative models. We evaluate the framework across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of different domains. We further demonstrate the extensibility and adaptability of KPP through varying foundation model bases and instructions. Our results show that KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.

摘要
多modal生成模型中的幻觉和不准确的合成问题已经广泛被观察到，这主要归结于提供不够精准的semantic detail。为了解决这问题，我们提出了知识追求提示（KPP），一种零式框架，它可以在不访问生成器的architecture和参数的情况下，iteratively incorporates external knowledge to help generators produce reliable visual content。而不是训练生成器处理通用的提示，KPP使用了递归的知识查询过程来从知识库中获取有用的外部信息，然后使用语言模型压缩获取的知识来修正提示，最后使用文本驱动的生成器进行视觉合成。我们在多种文本驱动生成任务（图像、3D rendering和视频）上的多个领域中进行了评估，并证明了KPP的可重复性和适应性。我们的结果表明，KPP可以在多种视觉领域中生成准确和semantically rich的内容，提供了改善多modal生成模型的可能性。

Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

paper_url: http://arxiv.org/abs/2311.17813
repo_url: None
paper_authors: Alexis Toumi, Giovanni de Felice
for: 这个论文旨在提出一种新的高阶DisCoCat模型，其中单词意义不是一个图ogram，而是一个图ogram-值的高阶函数。
methods: 该论文使用 lambda calculus 和 Montague semantics 为基础，将字符串图ogram作为基本 primitives，实现了高阶和非线性自然语言Semantics 的处理。
results: 该论文通过将 Lambek Calculus 翻译成 Peirce 系统 beta 来实现一种纯图ogramatic的处理方法，可以处理高阶和非线性的自然语言semantics 中的词语、副词、否定和量词等。

Abstract
We propose a new definition of higher-order DisCoCat (categorical compositional distributional) models where the meaning of a word is not a diagram, but a diagram-valued higher-order function. Our models can be seen as a variant of Montague semantics based on a lambda calculus where the primitives act on string diagrams rather than logical formulae. As a special case, we show how to translate from the Lambek calculus into Peirce's system beta for first-order logic. This allows us to give a purely diagrammatic treatment of higher-order and non-linear processes in natural language semantics: adverbs, prepositions, negation and quantifiers. The theoretical definition presented in this article comes with a proof-of-concept implementation in DisCoPy, the Python library for string diagrams.

摘要
我们提出一个新的定义高阶DisCoCat（分布式compositional distributional）模型，其中单词的意义不是图表，而是图表值的高阶函数。我们的模型可以看作是基于Montague semantics的一种变形，使用λ推理 calculus中的基本 primitives 作用于字串图表而不是逻辑式。我们还示了将Lambek calculus转换为Peirce的系统beta，这允许我们对高阶和非线性自然语言 semantics 进行纯粹的图表处理：副词、前置词、否定和量化器。在本文中提出的理论定义附有一个证明的试验实现在DisCoPy，Python 库中的字串图表。

DSS: Synthesizing long Digital Ink using Data augmentation, Style encoding and Split generation

paper_url: http://arxiv.org/abs/2311.17786
repo_url: None
paper_authors: Aleksandr Timofeev, Anastasiia Fadeeva, Andrei Afonin, Claudiu Musat, Andrii Maksai
for: 提高文本生成模型对长文本数据的泛化能力
methods: 使用对比学习技术，对encoder-decoder模型进行修改和推理过程的修改，以适应手写增强生成的可读性
results: 比基eline RNN减少了英文长文本中字符错误率的一半，并与之前的方法相比减少了16%，并且三个方法都提高了生成的增强文本的可读性。

Abstract
As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.

摘要
为了解决生成长文本的问题，我们对生成长文本的数字墨水进行了处理。我们发现常用的模型在长文本数据上失去泛化能力，并且如何解决这个问题。我们提出了三种解决方案，即通过增强训练数据、改变模型结构和推理过程。这些方法使用对比学习技术，专门针对手写域。它们可以应用于任何encoder-decoder模型，并且可以将charater error rate降低到baseline RNN的一半，并且比前一种方法降低16%。我们还发现这三种方法都提高了生成的墨水的可读性。此外，我们进行了人类研究，并发现大多数生成的数据被人类识别为真实的。

Supervising the Centroid Baseline for Extractive Multi-Document Summarization

paper_url: http://arxiv.org/abs/2311.17771
repo_url: None
paper_authors: Simão Gonçalves, Gonçalo Correia, Diogo Pernes, Afonso Mendes
for: 本研究旨在提高EXTRACTIVE multi-document summarization的简洁度，并在多个多语言场景中进行评估。
methods: 本研究使用了Beam搜索和Centroid估计注意力模型，对选择句子进行优化，以提高结果的质量。
results: 在多个多语言场景中，本研究的方法获得了改进的结果，比如在En2Fr和En2De等多语言场景中的翻译场景中。

Abstract
The centroid method is a simple approach for extractive multi-document summarization and many improvements to its pipeline have been proposed. We further refine it by adding a beam search process to the sentence selection and also a centroid estimation attention model that leads to improved results. We demonstrate this in several multi-document summarization datasets, including in a multilingual scenario.

摘要
中心点方法是一种简单的方法 для抽取式多文摘要，而且许多改进都已经被提议。我们进一步精细化它，通过添加帧搜索过程和中心点估计注意力模型，从而得到了改进的结果。我们在多个多文摘要数据集中进行了证明，包括一个多语言场景。

End-to-end Joint Rich and Normalized ASR with a limited amount of rich training data

paper_url: http://arxiv.org/abs/2311.17741
repo_url: None
paper_authors: Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent
for: 这篇论文旨在实现一个可以同时执行精确和含括的自动语音识别（ASR）系统，并且可以在流式应用程序中使用。
methods: 作者使用了两种不同的方法来训练一个无状态的 Transducer-based E2E JOINT 精确和含括 ASR 系统，即使只有有限量的 riclabeled 数据available。第一种方法使用语言模型生成 Pseudo-rich 精确转录的Normalized 训练数据。第二种方法使用一个单一的解码器，conditional 于输出类型。
results: 作者发现，使用第一种方法可以在非预训练数据上提高 E2E 精确 ASR 系统的性能，相比于使用只有5% 的 riclabeled 数据，可以获得更好的性能（相对错误减少9%）。使用第二种方法可以在5% 的 riclabeled 数据上实现 E2E JOINT 精确和含括 ASR 系统，相比于使用只有2.42% 的精确数据，相对错误增加2.42%）。

Abstract
Joint rich and normalized automatic speech recognition (ASR), that produces transcriptions both with and without punctuation and capitalization, remains a challenge. End-to-end (E2E) ASR models offer both convenience and the ability to perform such joint transcription of speech. Training such models requires paired speech and rich text data, which is not widely available. In this paper, we compare two different approaches to train a stateless Transducer-based E2E joint rich and normalized ASR system, ready for streaming applications, with a limited amount of rich labeled data. The first approach uses a language model to generate pseudo-rich transcriptions of normalized training data. The second approach uses a single decoder conditioned on the type of the output. The first approach leads to E2E rich ASR which perform better on out-of-domain data, with up to 9% relative reduction in errors. The second approach demonstrates the feasibility of an E2E joint rich and normalized ASR system using as low as 5% rich training data with moderate (2.42% absolute) increase in errors.

摘要
JOINT ricH 和 normalized 自动语音识别（ASR），生成带有和无标点和大小写的译文，仍然是一个挑战。端到端（E2E） ASR 模型提供了便利性和可以执行这种联合译文的能力。训练这些模型需要 paired 语音和丰富文本数据，这并不是广泛存在。在这篇论文中，我们比较了两种不同的方法来训练一个无状态 Transducer-based E2E joint ricH 和 normalized ASR 系统，准备进行流动应用程序，使用有限的丰富标注数据。第一种方法使用语言模型生成 pseudo-ricH 译文的normalized 训练数据。第二种方法使用单个解码器，根据输出类型的conditioning。第一种方法导致 E2E ricH ASR 在域外数据上表现更好，错误率下降达9%。第二种方法表明了使用5% ricH 训练数据和moderate（2.42%绝对）错误增加的E2E joint ricH 和 normalized ASR 系统的可行性。

SenTest: Evaluating Robustness of Sentence Encoders

paper_url: http://arxiv.org/abs/2311.17722
repo_url: None
paper_authors: Tanmay Chavan, Shantanu Patankar, Aditya Kane, Omkar Gokhale, Geetanjali Kale, Raviraj Joshi
for: 评估句子变换器的稳定性。
methods: 使用多种敌对攻击来评估句子变换器的稳定性，包括字符水平的随机字符替换、词水平的同义词替换和句子水平的内句词顺序排序。
results: 实验结果表明，句子变换器的稳定性强度较弱，模型在修改后的数据集上的准确率可以下降到15%以上。然而，实验还表明，这些嵌入包含句子的语义和语法结构信息，但现有的超vised分类策略未能充分利用这些信息。

Abstract
Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due to their rich and effective sentence representations. Having effective sentence representations is paramount in multiple tasks, such as information retrieval, retrieval augmented generation (RAG), and sentence comparison. Keeping in mind the deployability factor of transformers, evaluating the robustness of sentence transformers is of utmost importance. This work focuses on evaluating the robustness of the sentence encoders. We employ several adversarial attacks to evaluate its robustness. This system uses character-level attacks in the form of random character substitution, word-level attacks in the form of synonym replacement, and sentence-level attacks in the form of intra-sentence word order shuffling. The results of the experiments strongly undermine the robustness of sentence encoders. The models produce significantly different predictions as well as embeddings on perturbed datasets. The accuracy of the models can fall up to 15 percent on perturbed datasets as compared to unperturbed datasets. Furthermore, the experiments demonstrate that these embeddings does capture the semantic and syntactic structure (sentence order) of sentences. However, existing supervised classification strategies fail to leverage this information, and merely function as n-gram detectors.

摘要
强化学习已经在视觉领域证明是一种有效的预训练方法，使用弱Label数据。句子转移器是NLP领域的对应建筑，它们在生成强大和有效的句子表示方面具有益处。在多个任务中，如信息检索、生成增强（RAG）和句子比较中，有效的句子表示是非常重要的。在考虑到转移器的部署性因素，评估句子转移器的稳定性非常重要。这种系统使用字符级攻击、词级攻击和句子级攻击来评估句子转移器的稳定性。实验结果表明，模型对受攻击的数据集产生了显著不同的预测和嵌入。模型的准确率可以降低到15%以上，相比于未受攻击的数据集。此外，实验还表明，这些嵌入不仅捕捉了句子的语义和 sintaxis结构（句子顺序），而且可以通过不同的排序方式来区分句子。然而，现有的超vised分类策略却未能充分利用这些信息，只是作为n-gram探测器。

How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation

paper_url: http://arxiv.org/abs/2311.17696
repo_url: None
paper_authors: Chenxi Dong
for: 提供个性化教学支持 (provide personalized teaching support)
methods: 使用 cutting-edge Large Language Model (LLM) 和 Retrieval-Augmented Generation (RAG) 技术 (use cutting-edge Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) techniques)
results: 实现了个性化教学支持 (achieve personalized teaching support)In English, the three key points would be:
for: Providing personalized teaching support
methods: Using cutting-edge Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) techniques
results: Achieving personalized teaching support

Abstract
Artificial intelligence is transforming education through data-driven, personalized learning solutions. This paper introduces AI Tutor, an innovative web application that provides personalized tutoring in any subject using state-of-the-art Large Language Model (LLM). AI Tutor ingests course materials to construct an adaptive knowledge base tailored to the course. When students pose questions, it retrieves the most relevant information and generates detailed, conversational responses citing supporting evidence. The system is powered by advanced large language models and Retrieval-Augmented Generation (RAG) techniques for accurate, natural question answering. We present a fully-functional web interface and video demonstration that showcase AI Tutor's versatility across diverse subjects and its ability to produce pedagogically cogent responses. While an initial prototype, this work represents a pioneering step toward AI-enabled tutoring systems that can democratize access to high-quality, customized educational support.

摘要
人工智能改变教育的方式，通过数据驱动、个性化学习解决方案。这篇论文介绍了 AI 教师，一款创新的网络应用程序，可以提供任何科目的个性化教学。 AI 教师将课程材料进行处理，构建适应型知识库，并在学生提问时，通过 retrieve 扩展生成（RAG）技术，提供精准、自然的问答。我们提供了一个完整的网页界面和视频演示，展示 AI 教师在不同科目中的多样性和其能够生成教学上的有效性。尽管这是一个初步的 проtotype，但这个工作表明了人工智能帮助教学系统的前景，可以普及高质量、个性化的教育支持。

Enhancing Answer Selection in Community Question Answering with Pre-trained and Large Language Models

paper_url: http://arxiv.org/abs/2311.17502
repo_url: None
paper_authors: Xinghang Hu
for: 本文主要针对社区问答（CQA）中答案选择问题进行研究，以提高答案选择的准确率。
methods: 本文提出了问题答案卷积网络（QAN）模型，利用预训练模型和大语言模型（LLM）进行答案选择。特别是，使用BERT模型作Encoder层进行预训练，然后使用交叉注意机制选择不同问题的最相关答案。
results: 实验结果表明，QAN模型在SemEval2015和SemEval2017两个数据集上达到了状态艺术性的表现，而使用LLM进行知识扩充可以提高LLM对两个数据集的正确答案选择率。同时，LLM可以通过优化提问来选择更多的问题中的正确答案。

Abstract
Community Question Answering (CQA) becomes increasingly prevalent in recent years. However, there are a large number of answers, which is difficult for users to select the relevant answers. Therefore, answer selection is a very significant subtask of CQA. In this paper, we first propose the Question-Answer cross attention networks (QAN) with pre-trained models for answer selection and utilize large language model (LLM) to perform answer selection with knowledge augmentation. Specifically, we apply the BERT model as the encoder layer to do pre-training for question subjects, question bodies and answers, respectively, then the cross attention mechanism selects the most relevant answer for different questions. Experiments show that the QAN model achieves state-of-the-art performance on two datasets, SemEval2015 and SemEval2017. Moreover, we use the LLM to generate external knowledge from questions and correct answers to achieve knowledge augmentation for the answer selection task by LLM, while optimizing the prompt of LLM in different aspects. The results show that the introduction of external knowledge can improve the correct answer selection rate of LLM on datasets SemEval2015 and SemEval2017. Meanwhile, LLM can also select the correct answer on more questions by optimized prompt.

摘要
社区问答 (CQA) 在最近几年变得越来越普遍。然而，有大量的答案，使用户很难选择相关的答案。因此，答案选择是 CQA 中非常重要的子任务。在这篇论文中，我们首先提出了问题答案交叉注意力网络 (QAN) ，使用预训练模型进行答案选择，并利用大语言模型 (LLM) 进行知识扩展。具体来说，我们使用 BERT 模型作为编码层进行预训练，对问题主题、问题体和答案进行预训练，然后通过交叉注意力机制选择不同问题的最相关答案。实验表明，QAN 模型在 SemEval2015 和 SemEval2017 两个数据集上达到了状态机器的表现。此外，我们使用 LLM 生成外部知识，对答案选择任务进行知识扩展，同时优化 LLM 的提问以达到不同方面的优化。结果显示，引入外部知识可以提高 LLM 在 SemEval2015 和 SemEval2017 两个数据集上的正确答案选择率。此外，LLM 还可以通过优化提问选择更多的问题中的正确答案。

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

paper_url: http://arxiv.org/abs/2311.17492
repo_url: None
paper_authors: Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee
for: 保护满语言（Manchu language）的存在，采用首次尝试的满语-朝鲜机器翻译（MT）模型。
methods: 利用历史文献《满文老档》和《满语-朝鲜词典》等资源，采用 GloVe 嵌入式词替换技术扩展数据，并采用encoder-decoder神经机器翻译模型，具有双向GRU层。
results: 实验结果显示，使用该模型可以有效提高满语-朝鲜翻译效果，BLEU 分数提高20-30分。

Abstract
The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.

摘要
《满语言», 起源于历史上的满洲地区东北中国，现在面临着濒临灭绝的危机，因为几乎没有 anymore speakers。为了保护满语言，我们介绍了“Mergen”，这是第一个满语言-韩语机器翻译（MT）模型。为了构建这个模型，我们利用了宝贵的资源，如《满文老档》（一部历史书籍）和满语韩语词典。由于满语韩语平行文本的罕见，我们扩展了数据集，通过使用 GloVe 嵌入法，训练在单语言和平行文本上。我们的方法建立在 encoder-decoder 神经机器翻译模型之上，包括一个双向扩展 GRU 层。实验结果很出色，显示了明显的提升在满语韩语翻译中，BLEU 分数提高了20-30个点。

Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention

paper_url: http://arxiv.org/abs/2311.17400
repo_url: None
paper_authors: Lujia Shen, Yuwen Pu, Shouling Ji, Changjiang Li, Xuhong Zhang, Chunpeng Ge, Ting Wang
for: This paper aims to enhance the inherent robustness of transformer-based models, such as BERT and GPT, against various textual adversarial attacks.
methods: The proposed method, called dynamic attention, consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens.
results: The proposed dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33% better performance than previous methods against widely-used adversarial attacks.

Abstract
Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.

摘要
transformer-based models, such as BERT 和 GPT, 在自然语言处理（NLP）中广泛采用了，因为它们的表现非常出色。然而，最近的研究表明，这些模型对文本攻击性攻击很易受到影响，可以通过故意修改输入文本来诱导模型的输出。虽然许多提议来增强模型的 robustness 和抵御这种攻击，但大多需要巨大的资源投入（例如，对抗训练）或只提供有限的保护（例如，随机dropout）。在这篇论文中，我们提出了一种新的方法，called dynamic attention，特制 для transformer 架构，以提高模型本身的自然 robustness против多种攻击。我们的方法不需要下游任务知识，并不需要额外成本。我们提出的 dynamic attention 包括两个模块：（I）注意力正确化，用于屏蔽或弱化选择的Token的注意力值；（ii）动态建模，用于动态构建候选Token的集合。广泛的实验表明， dynamic attention 能够减轻攻击性攻击的影响，提高与广泛使用的攻击方法相比，最高达33%。model-level的设计使得 dynamic attention 可以轻松地与其他防御方法（例如，对抗训练）结合使用，进一步增强模型的 robustness。此外，我们示示了 dynamic attention 保持了原始模型的状态空间robustness，与其他动态模型方法相比。

Unveiling the Implicit Toxicity in Large Language Models

paper_url: http://arxiv.org/abs/2311.17391
repo_url: https://github.com/thu-coai/implicit-toxicity
paper_authors: Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang
for: 研究各种恶意利用大语言模型（LLM）的安全问题。
methods: 提出了一种基于强化学习（RL）的攻击方法，通过优化语言模型以便生成潜在危险的偏见语言。
results: 实验表明，RL fine-tuning可以在五种常见的危险分类器上提高攻击成功率，例如LLaMA-13B模型在BAD和Davinci003上的攻击成功率分别为90.04%和62.85%。

Abstract
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.

摘要
大型语言模型（LLM）的开放性和其出色的能力可能会导致新的安全问题，当被用于黑客用途时。而最近的研究主要关注于探测毒害输出，可以使用现有的毒害分类器检测。但我们显示，LLM可以生成特殊的含义毒害输出，这些输出很难通过简单的零批训练检测。此外，我们提出了一种强化学习（RL）基于的攻击方法，通过对语言模型进行RL微调，使其偏好于隐式毒害输出。我们在五种广泛使用的毒害分类器上进行了实验，结果显示，通过RL微调，攻击成功率可以显著提高。例如，RL微调后的LLaMA-13B模型在BAD和Davinci003上的攻击成功率分别达到90.04%和62.85%。我们的发现表明，LLM可能会生成隐藏的毒害语言， pose a significant threat。我们还证明，通过我们的攻击方法来注意标注的例子进行微调，可以有效地提高毒害分类器对LLM生成的隐藏毒害语言的检测能力。代码可以在https://github.com/thu-coai/Implicit-Toxicity上获取。

CESAR: Automatic Induction of Compositional Instructions for Multi-turn Dialogs

paper_url: http://arxiv.org/abs/2311.17376
repo_url: None
paper_authors: Taha Aksu, Devamanyu Hazarika, Shikib Mehri, Seokhwan Kim, Dilek Hakkani-Tür, Yang Liu, Mahdi Namazifar
for: 这 paper 的目的是提高大型语言模型（LLM）在多回对话应用中的表现，并且解决复杂的指令下的 LLM 的表现不佳问题。
methods: 这 paper 使用了一种新的框架 called CESAR，它可以自动将多个对话任务format into a unified format，并且可以通过程序матиче induction来学习复杂的指令。
results: 通过对 InstructDial 数据集进行实验，这 paper 显示了 CESAR 的可扩展性和可靠性，并且模型可以 seguido compositional prompts，如多种样式约束的请求。

Abstract
Instruction-based multitasking has played a critical role in the success of large language models (LLMs) in multi-turn dialog applications. While publicly available LLMs have shown promising performance, when exposed to complex instructions with multiple constraints, they lag against state-of-the-art models like ChatGPT. In this work, we hypothesize that the availability of large-scale complex demonstrations is crucial in bridging this gap. Focusing on dialog applications, we propose a novel framework, CESAR, that unifies a large number of dialog tasks in the same format and allows programmatic induction of complex instructions without any manual effort. We apply CESAR on InstructDial, a benchmark for instruction-based dialog tasks. We further enhance InstructDial with new datasets and tasks and utilize CESAR to induce complex tasks with compositional instructions. This results in a new benchmark called InstructDial++, which includes 63 datasets with 86 basic tasks and 68 composite tasks. Through rigorous experiments, we demonstrate the scalability of CESAR in providing rich instructions. Models trained on InstructDial++ can follow compositional prompts, such as prompts that ask for multiple stylistic constraints.

摘要
<>将文本翻译成简化中文。<>大型语言模型（LLM）在多轮对话应用程序中扮演了关键角色。公共可用的LLM表现了可以接受的表现，但当面临复杂的指令和多个约束时，它们与现状的模型如ChatGPT相比表现不佳。在这项工作中，我们认为大规模的复杂示例是bridging这个差距的关键。专注于对话应用程序，我们提出了一个新的框架，CESAR，它可以在同一个格式下结合大量对话任务，并且可以无需手动努力地进行复杂的指令编程。我们在InstructDial上应用CESAR，InstructDial是对话任务的指令基金 benchmark。我们还新增了一些数据集和任务，并使用CESAR来编程复杂任务。这导致了一个新的benchmark，InstructDial++，它包含63个数据集，86个基本任务和68个复杂任务。通过严格的实验，我们证明了CESAR的可扩展性，模型在InstructDial++上可以跟踪复杂的指令，如多种样式约束的请求。

Are Large Language Models Good Fact Checkers: A Preliminary Study

paper_url: http://arxiv.org/abs/2311.17355
repo_url: None
paper_authors: Han Cao, Lingwei Wei, Mengyang Chen, Wei Zhou, Songlin Hu
for: 这研究旨在评估大语言模型（LLMs）在事实核查中的潜力，以及其在不同事实核查子任务中的表现。
methods: 本研究使用了多种大语言模型，包括各种预训练和特定任务预训练的模型，并进行了系统性的评估和比较分析。
results: 实验结果表明，LLMs在大多数情况下可以与其他小型模型相比的表现竞争，但它们在中文事实核查和整个事实核查管道中遇到了语言不一致和幻见等挑战。这些发现强调了进一步探索和研究，以提高LLMs的可靠性作为事实核查工具。

Abstract
Recently, Large Language Models (LLMs) have drawn significant attention due to their outstanding reasoning capabilities and extensive knowledge repository, positioning them as superior in handling various natural language processing tasks compared to other language models. In this paper, we present a preliminary investigation into the potential of LLMs in fact-checking. This study aims to comprehensively evaluate various LLMs in tackling specific fact-checking subtasks, systematically evaluating their capabilities, and conducting a comparative analysis of their performance against pre-trained and state-of-the-art low-parameter models. Experiments demonstrate that LLMs achieve competitive performance compared to other small models in most scenarios. However, they encounter challenges in effectively handling Chinese fact verification and the entirety of the fact-checking pipeline due to language inconsistencies and hallucinations. These findings underscore the need for further exploration and research to enhance the proficiency of LLMs as reliable fact-checkers, unveiling the potential capability of LLMs and the possible challenges in fact-checking tasks.

摘要

Efficient Stitchable Task Adaptation

paper_url: http://arxiv.org/abs/2311.17352
repo_url: None
paper_authors: Haoyu He, Zizheng Pan, Jing Liu, Jianfei Cai, Bohan Zhuang
for: 这个论文旨在提出一个能够快速生成多个新的网络（称为“缝合”），并且适应多种资源限制的模型整合（Model Stitching）方法。
methods: 这个方法使用了具有低维度更新的参数高效练习（Parameter-efficient fine-tuning），以分享低维度更新，并保持独立的偏好项。这样可以大幅减少练习内存负担和任务适应中的干扰。此外，这个方法还使用了一个简单 yet有效的一阶段部署管线，可以估计需要部署的重要缝合。
results: 实验结果显示，这个方法可以快速生成具有稳定精度-效率贡献的缝合，并且比直接使用SN-Net适应得到更大的改善，具有更低的训练时间和更少的可变参数。此外，这个方法还可以与LLaMA家族的语言模型进行缝合，实现了对谈话机器人的整合。

Abstract
The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, stitchable neural network (SN-Net) is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes.

摘要
深度学习模型的预训练和精度调整的 paradigm 已经为深度学习模型的部署提供了基础。然而，大多数精度调整方法都是为满足特定资源预算而设计。在最近，为了快速从预训练模型（锚点）中获得多个新网络（缝合），模型缝合（SN-Net）被介绍。虽然有前景，SN-Net 面临新的挑战，包括巨大的内存和存储需求以及长时间的多stage 适应过程。在这种情况下，我们提出了一种新的框架，高效精度调整（ESTA），以生成遵循多样化资源限制的多个精度调整模型。具体来说，我们首先为精度调整进行参数高效化，以便在缝合中共享低级别更新，保持独立的偏置项。这样，我们可以大幅减少精度调整的内存压力，并降低缝合中的干扰。此外，我们还提出了一种简单 yet effective 的一stage 部署管线，可以在训练时间中估算重要的缝合。通过将更高的抽样概率分配给重要的缝合，我们还可以获得加强的 pareto 前ier。经过对 25 个下游视觉识别任务的广泛实验，我们的 ESTA 能够生成缝合具有平滑的精度-效率交易和直接 SN-Net 适应的显著优势，同时具有更低的训练时间和更少的可训练参数。此外，我们还demonstrate了我们的 ESTA 框架的灵活性和可扩展性，通过将 LLaMA 家族中的 LLMS 缝合起来，获得了各种大小的 chatbot 缝合。

Biomedical knowledge graph-enhanced prompt generation for large language models

paper_url: http://arxiv.org/abs/2311.17330
repo_url: None
paper_authors: Karthik Soman, Peter W Rose, John H Morris, Rabia E Akbas, Brett Smith, Braian Peetoom, Catalina Villouta-Reyes, Gabriel Cerono, Yongmei Shi, Angela Rizk-Jackson, Sharat Israni, Charlotte A Nelson, Sui Huang, Sergio E Baranzini
for: 本研究旨在推动AI技术在生物医学领域的进步，并解决现有的知识瓶颈。
methods: 本研究使用了知识图(KG)和大语言模型(LLM)，其中KG提供了生物医学领域的知识，而LLM则提供了语言生成能力。研究者通过将KG和LLM结合在一起，实现了生成有意义的生物医学文本，并且能够考虑多种提问类型，包括一阶和二阶提问、药物重用查询、生物医学真假问题和多选题(MCQ)。
results: 研究结果表明，通过使用KG-RAG框架，可以明显提高LLM的性能，特别是在MCQ数据集上，Llama-2模型的性能提高了71%。此外，KG-RAG还能够提高 propriety GPT 模型的性能，如 GPT-3.5 和 GPT-4。此外，研究者还能够通过 KG-RAG 回答药物重用问题，并返回有意义的药物重用建议。

Abstract
Large Language Models (LLMs) have been driving progress in AI at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, and the latter require domain-expertise. External knowledge infusion is task-specific and requires model training. Here, we introduce a task-agnostic Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging the massive biomedical KG SPOKE with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. KG-RAG consistently enhanced the performance of LLMs across various prompt types, including one-hop and two-hop prompts, drug repurposing queries, biomedical true/false questions, and multiple-choice questions (MCQ). Notably, KG-RAG provides a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 which exhibited improvement over GPT-4 in context utilization on MCQ data. Our approach was also able to address drug repurposing questions, returning meaningful repurposing suggestions. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM, respectively, in an optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a unified framework.

摘要
大型语言模型（LLM）在人工智能领域中传递进步，但在知识充足领域如生医领域仍面临挑战。如果使用预训练和域专化微调，则需要更多的计算负载，而且需要专业知识。外部知识渗透需要模型训练。我们在这里引入一个任务不对称的知识图графі-加速生成（KG-RAG）框架，通过利用巨量生医知识图 SPOKE 与 LLM 如 Llama-2-13b、GPT-3.5-Turbo 和 GPT-4，以生成有意义的生医文本，并将其与已有的知识相连接。KG-RAG 稳定地提高了 LLM 的表现，包括一阶和二阶提示、药物重新利用查询、生医真伪问题和多项选择问题（MCQ）。特别是，KG-RAG 为 Llama-2 模型在具有挑战性的 MCQ 数据集上提供了惊人的71%的提升，显示了这个框架的能力以 fewer 参数的开源模型来解决域专题。此外，KG-RAG 还增强了商业 GPT 模型，例如 GPT-3.5，在 MCQ 数据集上表现出了更好的Context Utilization。我们的方法也能够回答药物重新利用问题，提供了有意义的重新利用建议。总结来说，我们的提案结合了知识图和 LLM 的明确和隐含知识，在一个优化的方式下增强了通用 LLM 的适应能力，以便在域专领域中解决域专问题。