cs.CL - 2023-07-29

Towards Codable Text Watermarking for Large Language Models

paper_url: http://arxiv.org/abs/2307.15992
repo_url: https://github.com/lancopku/codable-watermarking-for-llm
paper_authors: Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, Xu Sun
for: 防止大语言模型（LLM）的滥用，通过植入隐藏模式到生成的文本中来识别文本是否由LLM生成。
methods: 利用文本水印技术，在LLM生成的文本中植入隐藏模式以识别文本来源。
results: 提出了首个系统性研究，对LLM水印技术进行了首个系统性研究，并提出了一种可编程文本水印技术（CTWL），可以在不同的LLM应用场景中具有更多自定义信息编码需求。

Abstract
As large language models (LLMs) generate texts with increasing fluency and realism, there is a growing need to identify the source of texts to prevent the abuse of LLMs. Text watermarking techniques have proven reliable in distinguishing whether a text is generated by LLMs by injecting hidden patterns into the generated texts. However, we argue that existing watermarking methods for LLMs are encoding-inefficient (only contain one bit of information - whether it is generated from an LLM or not) and cannot flexibly meet the diverse information encoding needs (such as encoding model version, generation time, user id, etc.) in different LLMs application scenarios. In this work, we conduct the first systematic study on the topic of Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry more customizable information. First of all, we study the taxonomy of LLM watermarking technology and give a mathematical formulation for CTWL. Additionally, we provide a comprehensive evaluation system for CTWL: (1) watermarking success rate, (2) robustness against various corruptions, (3) coding rate of payload information, (4) encoding and decoding efficiency, (5) impacts on the quality of the generated text. To meet the requirements of these non-Pareto-improving metrics, we devise a CTWL method named Balance-Marking, based on the motivation of ensuring that available and unavailable vocabularies for encoding information have approximately equivalent probabilities. Compared to the random vocabulary partitioning extended from the existing work, a probability-balanced vocabulary partition can significantly improve the quality of the generated text. Extensive experimental results have shown that our method outperforms a direct baseline under comprehensive evaluation.

摘要
大型语言模型（LLM）生成的文本流行度在增长，需要识别文本来防止LLM的滥用。文本水印技术已经证明可以有效地 отличи出LLM生成的文本，但我们认为现有的LLM水印方法存在编码不fficient（只包含一bit信息，即是否由LLM生成）和不能适应不同的LLM应用场景中的多样化信息编码需求。在这项工作中，我们进行了首次系统性的研究，探讨 codable text watermarking for LLM（CTWL）技术，允许文本水印包含更多自定义信息。首先，我们研究了LLM水印技术的分类和CTWL的数学表述。此外，我们提供了对CTWL的全面评价方法：（1）水印成功率，（2）对各种损害的Robustness，（3）payload信息的编码率，（4）编码和解码效率，（5）对生成文本质量的影响。为满足这些不可比较的度量，我们提出了一种名为Balance-Marking的CTWL方法，基于保证可用和不可用词汇的编码信息有相近的概率。与现有的随机词汇分 partitioning 方法相比，一个probability-balanced词汇分可以显著改善生成文本的质量。我们的方法在全面的实验结果中胜过直接基eline。

GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

paper_url: http://arxiv.org/abs/2307.15933
repo_url: https://github.com/roysoumya/genemask
paper_authors: Soumyadeep Roy, Jonas Wallat, Sowmya S Sundaram, Wolfgang Nejdl, Niloy Ganguly
for: 学习优化基因表示，用于基因序列分类。
methods: 提出了一种新的masking算法， GeneMask，用于基因序列中的MASKED语言模型训练。
results: 在四个基因序列分类数据集上，GeneMask-based模型在五种几个shot Setting中显著超过了SOTA模型（DNABert和LOGO），而且可以在训练时间上减少一半。此外，我们还发现了 conserved DNA sequence motifs和top-ranked PMI tokens之间的强相关性，可能表明了基因序列中的隐藏信息的包含。

Abstract
Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GeneMask, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GeneMask-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GeneMask-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

摘要
大规模语言模型如DNABert和LOGO目标学习优化的基因表示，并在人类参照基因组中进行训练。然而，标准的分割方案通常使用简单的滑块窗口的 токен，不利用基因基本 semantics，可能导致（轻松）遮盲易predictable的序列，并最终导致不fficient的Masked Language Modeling（MLM）训练。因此，我们提出了一种新的遮盲算法， called GeneMask，用于 MLM 训练基因序列，我们在基因序列中随机选择位置作为遮盲中心，然后在遮盲中心周围选择 highest Normalized Pointwise Mutual Information（NPMI）的 span来遮盲。我们发现，在 genomics 领域缺乏人类理解的 semantics（与 NLP 领域中的 semantic units 不同），GeneMask 基于模型在四个基因序列分类数据集上表现出优于 SOTA 模型（DNABert 和 LOGO）的五种 few-shot Setting（10 到 1000 个批）中表现出优。此外，我们发现 GeneMask 基于 DNABert 模型在训练时间上的减少为原始 SOTA 模型的一半，并且我们发现顶尖 PMI token 与保守的 DNA 序列模式之间存在强相关性，这可能表明 incorporation 隐藏的 genomic 信息。我们将codes（包括训练模型）和数据集公开发布在 GitHub 上，请参考。

Analysing the Resourcefulness of the Paragraph for Precedence Retrieval

paper_url: http://arxiv.org/abs/2308.01203
repo_url: https://github.com/bhoomeendra/paragraph_resourcefulness
paper_authors: Bhoomeendra Singh Sisodiya, Narendra Babu Unnam, P. Krishna Reddy, Apala Das, K. V. K. Santhy, V. Balakista Reddy
for: aid legal practitioners in retrieving relevant legal information
methods: analyzed the resourcefulness of paragraph-level information in capturing similarity among judgments
results: found that paragraph-level methods could capture similarity with only a few paragraph interactions and exhibit more discriminating power over baseline document-level method, with comparable performance to state-of-the-art methods.Here is the same information in Simplified Chinese text:
for: aid法律启用者在获取相关法律信息方面
methods: 利用判例文本中段级信息来捕捉判例之间的相似性
results: 发现段级方法可以通过只需几个段交互来捕捉判例之间的相似性，并且比基线文档级方法具有更高的分化力，与现状的状态艺术方法相当。

Abstract
Developing methods for extracting relevant legal information to aid legal practitioners is an active research area. In this regard, research efforts are being made by leveraging different kinds of information, such as meta-data, citations, keywords, sentences, paragraphs, etc. Similar to any text document, legal documents are composed of paragraphs. In this paper, we have analyzed the resourcefulness of paragraph-level information in capturing similarity among judgments for improving the performance of precedence retrieval. We found that the paragraph-level methods could capture the similarity among the judgments with only a few paragraph interactions and exhibit more discriminating power over the baseline document-level method. Moreover, the comparison results on two benchmark datasets for the precedence retrieval on the Indian supreme court judgments task show that the paragraph-level methods exhibit comparable performance with the state-of-the-art methods

摘要
研究抽取有关法律信息以帮助法律专业人士是一个活跃的研究领域。在这个方面，研究团队通过不同的信息类型，如元数据、引用、关键词、句子、段落等进行探索。与任何文本文档一样，法律文档也是由段落组成。本文通过分析了法律文档段落级别信息的资源fulness，发现段落级别方法可以通过只有几个段落互动 capture judgments的相似性，并且与基eline文档级别方法相比，段落级别方法具有更高的分化力。此外，对印度最高法院判决任务的两个标准数据集进行比较研究表明，段落级别方法与当前领域的状态态方法相比，具有相似的性能。

Dialogue Shaping: Empowering Agents through NPC Interaction

paper_url: http://arxiv.org/abs/2307.15833
repo_url: None
paper_authors: Wei Zhou, Xiangyu Peng, Mark Riedl
for: 本研究旨在使用人工智能技术快速让RL代理人在文本游戏环境中学习优化策略，特别是在大量步骤训练过程中。
methods: 本研究使用大型自然语言模型（LLMs）与NPCE交互获取关键信息，并使用知识图（KGs）和Story Shaping将该信息integrate到RL代理人的训练中。
results: 研究表明，通过与NPCE交互获取关键信息，可以帮助RL代理人更快地 converge 到优化策略，提高训练效率。

Abstract
One major challenge in reinforcement learning (RL) is the large amount of steps for the RL agent needs to converge in the training process and learn the optimal policy, especially in text-based game environments where the action space is extensive. However, non-player characters (NPCs) sometimes hold some key information about the game, which can potentially help to train RL agents faster. Thus, this paper explores how to interact and converse with NPC agents to get the key information using large language models (LLMs), as well as incorporate this information to speed up RL agent's training using knowledge graphs (KGs) and Story Shaping.

摘要
一个主要挑战在强制学习（RL）是训练过程中RL机器人需要很多步骤才能学习最佳策略，特别在文本基于游戏环境中，动作空间很广泛。然而，非玩家角色（NPC）有时会拥有游戏中的一些关键信息，这可能能够帮助快速训练RL机器人。因此，这篇论文探讨如何与NPC代理交互获取关键信息，以及如何使用知识图（KG）和 Story Shaping incorporate这些信息以加速RL机器人的训练。

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

paper_url: http://arxiv.org/abs/2307.15818
repo_url: None
paper_authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich
for: 这个论文的目标是将视力语言模型直接integrated到终端控制中，以提高总结和允许emergent semantic reasoning。
methods: 作者提出了一种简单的、通用的方法来实现这个目标：表示动作为文本token，并将其直接 incorporated into the training set of the model。
results: 作者的方法导致了高效的机器人策略和优秀的总结能力，并允许模型获得了一系列的emergent capabilities，如对 novel object generalization、不存在于机器人训练数据中的命令理解以及multi-stage semantic reasoning。

Abstract
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

摘要
我们研究如何使用互联网规模的数据进行训练，以实现普适的机器人控制和出现意义的推理。我们的目标是让单个终端训练的模型可以同时学习机器人观察到的动作和大规模预训练的语言和视觉语言数据。为达到这个目标，我们提议并行练习现状的视觉语言模型，包括机器人轨迹数据和互联网规模的视觉语言任务，如视觉问答。我们的方法是将机器人动作表示为文本符号，并将其直接添加到模型的训练集中，与自然语言符号一样。我们称之为视觉语言动作模型（VLA），并实现了一个例子，称为RT-2。我们的广泛评估（6000次评估试验）表明，我们的方法可以实现高效的机器人政策，并使RT-2能够获得互联网规模训练的许多emergent能力，包括对新物品的广泛适应、对机器人培训数据中未出现的命令的解释、以及对用户命令的简单推理。此外，我们还证明了将链条思维包含在VLA中可以实现多stage意义推理，例如选择哪个物品作为锤子（一个岩石），或者选择哪种饮料适合某人（一种能量饮料）。

Resume Evaluation through Latent Dirichlet Allocation and Natural Language Processing for Effective Candidate Selection

paper_url: http://arxiv.org/abs/2307.15752
repo_url: None
paper_authors: Vidhita Jagwani, Smit Meghani, Krishna Pai, Sudhir Dhage
for: 提出一种基于Latent Dirichlet Allocation(LDA)和实体检测（SpaCy）的简历评分方法，以提高简历评分的内容准确性。
methods: 方法首先使用SpaCy的Named Entity Recognition（NER）提取简历中有用的实体，然后使用LDA模型将这些实体分配到不同的主题中，并计算每个实体的主题概率。
results: 使用LDA模型，我们的提出的系统可以将简历分解成 latent topic，并提取有意义的semantic representation。在考虑技能、学历、工作经验等多个属性时，我们的模型达到了77%的准确率（仅考虑技能）和82%的总准确率。

Abstract
In this paper, we propose a method for resume rating using Latent Dirichlet Allocation (LDA) and entity detection with SpaCy. The proposed method first extracts relevant entities such as education, experience, and skills from the resume using SpaCy's Named Entity Recognition (NER). The LDA model then uses these entities to rate the resume by assigning topic probabilities to each entity. Furthermore, we conduct a detailed analysis of the entity detection using SpaCy's NER and report its evaluation metrics. Using LDA, our proposed system breaks down resumes into latent topics and extracts meaningful semantic representations. With a vision to define our resume score to be more content-driven rather than a structure and keyword match driven, our model has achieved 77% accuracy with respect to only skills in consideration and an overall 82% accuracy with all attributes in consideration. (like college name, work experience, degree and skills)

摘要
在这篇论文中，我们提出了一种使用Latent Dirichlet Allocation（LDA）和实体检测（SpaCy）来评分简历的方法。我们的方法首先从简历中提取有用的实体，如教育经验和技能，使用SpaCy的命名实体识别（NER）。然后，LDA模型使用这些实体来评分简历，并将每个实体分配话题概率。此外，我们还进行了NER的Entity detection的详细分析，并对其评估指标进行了报告。使用LDA，我们的提出的系统可以将简历分解成含义上的话题，并提取有意义的语义表示。我们的目标是使简历评分更加内容驱动，而不是基于结构和关键词匹配，我们的模型在只考虑技能时的准确率为77%，并在所有特征（包括学院名、工作经验、学位和技能）的情况下的总准确率为82%。

Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering

paper_url: http://arxiv.org/abs/2307.15745
repo_url: None
paper_authors: Nandita Naik, Christopher Potts, Elisa Kreiss
for: 使网络更加可交互地访问，帮助无法看到图像的人们问题图像。
methods: 引入了 Context-VQA dataset，该 dataset将图像与上下文（例如购物网站）对应。
results: 发现不同上下文下的问题类型系统性异，例如旅游上的图像引起 2 倍多的 “Where?” 问题，社交媒体和新闻上的图像引起 2.8 倍多的 “Who?” 问题。Context 对 VQA 模型的性能有重要影响，特别在无法看到图像的情况下。

Abstract
Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way, allowing people who cannot see images to ask questions about them. However, multiple studies have shown that people who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears, yet current VQA datasets focus on images in isolation. We argue that VQA models will not fully succeed at meeting people's needs unless they take context into account. To further motivate and analyze the distinction between different contexts, we introduce Context-VQA, a VQA dataset that pairs images with contexts, specifically types of websites (e.g., a shopping website). We find that the types of questions vary systematically across contexts. For example, images presented in a travel context garner 2 times more "Where?" questions, and images on social media and news garner 2.8 and 1.8 times more "Who?" questions than the average. We also find that context effects are especially important when participants can't see the image. These results demonstrate that context affects the types of questions asked and that VQA models should be context-sensitive to better meet people's needs, especially in accessibility settings.

摘要
视觉问答（VQA）有可能使互联网变得更加访问ible，让无法看到图像的人可以通过问题来了解图像。然而，多个研究表明，盲人或有低视力的人更喜欢图像解释包含图像的上下文，然而当前VQA数据集却主要关注图像的孤立显示。我们认为VQA模型不会完全满足人们的需求， Unless they take context into account。为了进一步驱动和分析不同上下文的差异，我们引入了Context-VQA，一个对图像与上下文进行对应的VQA数据集。我们发现，不同的上下文中的问题类型系统atically vary。例如，在旅游上下文中出现的图像会引发2倍的“Where?”问题，而社交媒体和新闻上下文中出现的图像会引发2.8倍的“Who?”问题。我们还发现，上下文效应在参与者无法看到图像时特别重要。这些结果表明上下文对问题类型的影响，并且VQA模型应该sensitive to context，以更好地满足人们的需求，特别是在Accessibility设置下。