2023-10-25

cs.CL

cs.CL - 2023-10-25

BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs’ Generation

paper_url: http://arxiv.org/abs/2310.17054
repo_url: https://github.com/PlusLabNLP/BOOST_EMNLP23
paper_authors: Yufei Tian, Felix Zhang, Nanyun Peng
for: 本研究旨在提高大型自然语言模型（LLM）的生成结果具备常识性。
methods: 我们提出了一种计算效率高的框架，使用冻结的预训练语言模型（PTLM）来生成更常识性的输出。我们首先构建了一个不含参考的评估器，将 sentence 评估为常识度。然后，我们使用评估器作为常识知识的oracle，并将 NADO 方法扩展到培训一个辅助头，以使PTLM更好地满足 oracle。
results: 我们在多种 GPT-2-, Flan-T5- 和 Alpaca-based 语言模型（LM）上进行了 série 的测试，结果显示，我们的方法能够 consistently 生成最常识性的输出。

Abstract
Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.

摘要
大型语言模型（LLM）如GPT-3已经表现出了强大的文本生成能力，但是在其成功之余，一个关键的问题仍然存在：它们的生成输出ometimes lack commonsense。而且，对整个LLM进行更加commonsensical的输出的精细调整是计算成本高昂的，甚至不可能。在这篇论文中，我们提出了一种计算效率高的框架，可以使用冻结的Pre-Trained Language Model（PTLM）生成更加commonsensical的文本。具体来说，我们首先构建了不含参考的评估器，可以根据四个关系方面的动态通用常识知识库赋予一句话commonsensical分数。然后，我们使用这个评估器作为 oracle，并将NADO控制生成方法扩展到固定PTLM上，以帮助它更好地满足oracle。我们在GPT-2-, Flan-T5-和Alpaca-based语言模型（LM）上进行了一系列测试。人类评估结果表明，我们的方法可以一直领先其他方法，并且生成出最commonsensical的输出。

Follow-on Question Suggestion via Voice Hints for Voice Assistants

paper_url: http://arxiv.org/abs/2310.17034
repo_url: None
paper_authors: Besnik Fetahu, Pedro Faustini, Giuseppe Castellucci, Anjie Fang, Oleg Rokhlenko, Shervin Malmasi
for: This paper aims to provide a solution for suggesting questions with compact and natural voice hints to allow users to ask follow-up questions in voice-based search settings.
methods: The authors propose an approach using sequence-to-sequence Transformers to generate spoken hints from a list of questions, and also define a linguistically-motivated pretraining task to improve the quality of the hints.
results: The authors evaluate their approach using a new dataset of 6681 input questions and human written hints, and find that their approach is strongly preferred by humans for producing the most natural hints, as compared to a naive approach of concatenating suggested questions.

Abstract
The adoption of voice assistants like Alexa or Siri has grown rapidly, allowing users to instantly access information via voice search. Query suggestion is a standard feature of screen-based search experiences, allowing users to explore additional topics. However, this is not trivial to implement in voice-based settings. To enable this, we tackle the novel task of suggesting questions with compact and natural voice hints to allow users to ask follow-up questions. We define the task, ground it in syntactic theory and outline linguistic desiderata for spoken hints. We propose baselines and an approach using sequence-to-sequence Transformers to generate spoken hints from a list of questions. Using a new dataset of 6681 input questions and human written hints, we evaluated the models with automatic metrics and human evaluation. Results show that a naive approach of concatenating suggested questions creates poor voice hints. Our approach, which applies a linguistically-motivated pretraining task was strongly preferred by humans for producing the most natural hints.

摘要
“对话助手如Alexa或Siri的采用速度快速增加，让用户通过声音搜寻取得信息。视觉搜寻经验中的查询建议是一个标准功能，允许用户继续探索相关主题。但在声音基础设置中实现此功能并不容易。为此，我们面临了一个新的任务：提出自然且简洁的声音提示，让用户可以通过声音提问。”“我们定义这个任务，并基于 syntax theory 进行定义。我们也提出了一些语言特性，以确保提出的提示 naturall 且易于理解。我们提出了一个基于 sequence-to-sequence Transformer 的方法，将问题列表转换为声音提示。使用了6681个输入问题和人工写的提示，我们评估了这些模型的性能。结果显示， concatenate 提出的提示会导致poor的声音提示。我们的方法，将在语言驱动的预训任务中使用语言驱动的预训任务，被人类评估为生成最自然的提示。”

Conditionally Combining Robot Skills using Large Language Models

paper_url: http://arxiv.org/abs/2310.17019
repo_url: https://github.com/krzentner/language-world
paper_authors: K. R. Zentner, Ryan Julian, Brian Ichter, Gaurav S. Sukhatme
for: 本研究旨在探讨一种 combining two contributions的方法，包括一个叫做”Language-World”的扩展，允许一个大型自然语言模型在一个模拟的 роботиче环境中运行，使用 semi-structured natural language queries 和 scripted skills 描述使用 natural language。
methods: 本研究使用的方法包括 Plan Conditioned Behavioral Cloning (PCBC)，可以使用 end-to-end 示例来调整高级计划的行为。
results: 使用 Language-World，PCBC 在多种 few-shot 情况下能够实现强性表现，经常实现任务总结概念，只需要一个示例即可。

Abstract
This paper combines two contributions. First, we introduce an extension of the Meta-World benchmark, which we call "Language-World," which allows a large language model to operate in a simulated robotic environment using semi-structured natural language queries and scripted skills described using natural language. By using the same set of tasks as Meta-World, Language-World results can be easily compared to Meta-World results, allowing for a point of comparison between recent methods using Large Language Models (LLMs) and those using Deep Reinforcement Learning. Second, we introduce a method we call Plan Conditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of high-level plans using end-to-end demonstrations. Using Language-World, we show that PCBC is able to achieve strong performance in a variety of few-shot regimes, often achieving task generalization with as little as a single demonstration. We have made Language-World available as open-source software at https://github.com/krzentner/language-world/.

摘要
这篇论文组合了两个贡献。首先，我们介绍了一种扩展Meta-World benchmark，我们称之为"语言世界"（Language-World），允许一个大型自然语言模型在模拟的机器人环境中使用不结构化的自然语言查询和遵循自然语言描述的脚本技能。通过使用Meta-World任务集，Language-World结果可以与Meta-World结果进行直接比较，从而为最近使用大型自然语言模型（LLMs）和深度强化学习方法之间的比较提供一个参照点。其次，我们介绍了一种方法，称之为Plan Conditioned Behavioral Cloning（PCBC），允许高级计划的训练终端示例。使用Language-World，我们表明PCBC在不同的几个尝试情况下能够实现强大的表现，经常在几个示例下实现任务总结。我们将Language-World作为开源软件提供在GitHub上，请参考。

Data Augmentation for Emotion Detection in Small Imbalanced Text Data

paper_url: http://arxiv.org/abs/2310.17015
repo_url: https://github.com/a-koufakou/augemotiondetection
paper_authors: Anna Koufakou, Diego Grisales, Ragy Costa de jesus, Oscar Fox
for: 本研究旨在探讨数据增强技术在小规模、不均衡数据集上的影响，以提高NL表示模型在情感识别任务中的性能。
methods: 本研究使用了四种数据增强方法（EDA、静态和Contextual Embedding-based、ProtAugment），在三个不同的数据集上进行了实验。
results: 实验结果显示，通过在模型训练中使用增强数据，可以得到显著改善情感识别性能。此外，本研究还进行了两个 случа研究，包括使用受欢迎的Chat-GPT API来自动生成句子，以及使用外部数据增强训练集。结果表明这些方法具有潜在的潜力。

Abstract
Emotion recognition in text, the task of identifying emotions such as joy or anger, is a challenging problem in NLP with many applications. One of the challenges is the shortage of available datasets that have been annotated with emotions. Certain existing datasets are small, follow different emotion taxonomies and display imbalance in their emotion distribution. In this work, we studied the impact of data augmentation techniques precisely when applied to small imbalanced datasets, for which current state-of-the-art models (such as RoBERTa) under-perform. Specifically, we utilized four data augmentation methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and ProtAugment) on three datasets that come from different sources and vary in size, emotion categories and distributions. Our experimental results show that using the augmented data when training the classifier model leads to significant improvements. Finally, we conducted two case studies: a) directly using the popular chat-GPT API to paraphrase text using different prompts, and b) using external data to augment the training set. Results show the promising potential of these methods.

摘要
文本情感识别任务（Emotion Recognition）是自然语言处理（NLP）领域中的一个挑战性任务，具有许多应用。其中一个挑战是有限的可用标注数据。现有的一些数据集都很小，遵循着不同的情感分类法，并且具有不均匀的情感分布。在这项工作中，我们研究了对小规模不均匀数据集进行数据增强技术的影响。特别是，我们使用了四种数据增强方法（Easy Data Augmentation EDA、静态和Contextual Embedding-based、ProtAugment）在三个不同来源的数据集上进行实验。我们的实验结果表明，在训练分类器模型时使用增强数据可以获得显著改善。最后，我们进行了两个案例研究：a）直接使用流行的 chat-GPT API 将文本重新表述为不同的提问，b）使用外部数据增强训练集。结果表明这些方法具有潜在的批处性。

Quality > Quantity: Synthetic Corpora from Foundation Models for Closed-Domain Extractive Question Answering

paper_url: http://arxiv.org/abs/2310.16995
repo_url: https://github.com/saptarshi059/cdqa-v1-targetted-pretraining
paper_authors: Saptarshi Sengupta, Connor Heaton, Shreya Ghosh, Preslav Nakov, Prasenjit Mitra
for: 本研究旨在提高闭包的问答系统的性能，通过针对性地预训练模型来适应特定领域的问题。
methods: 我们提出了一种名为“targeted pre-training”的方法，即根据特定领域的数据来预训练模型，以提高其在目标领域的性能。我们使用了Galactica工具来生成一些“targeted”的 corpora，以便更好地适应特定领域的问题。
results: 我们在两个生物医学抽取式问答数据集上进行了实验，并 achieved a new benchmark on COVID-QA 数据集，同时在 RadQA 数据集上也得到了全面的改进。

Abstract
Domain adaptation, the process of training a model in one domain and applying it to another, has been extensively explored in machine learning. While training a domain-specific foundation model (FM) from scratch is an option, recent methods have focused on adapting pre-trained FMs for domain-specific tasks. However, our experiments reveal that either approach does not consistently achieve state-of-the-art (SOTA) results in the target domain. In this work, we study extractive question answering within closed domains and introduce the concept of targeted pre-training. This involves determining and generating relevant data to further pre-train our models, as opposed to the conventional philosophy of utilizing domain-specific FMs trained on a wide range of data. Our proposed framework uses Galactica to generate synthetic, ``targeted'' corpora that align with specific writing styles and topics, such as research papers and radiology reports. This process can be viewed as a form of knowledge distillation. We apply our method to two biomedical extractive question answering datasets, COVID-QA and RadQA, achieving a new benchmark on the former and demonstrating overall improvements on the latter. Code available at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.

摘要
域适应，即在一个领域中训练模型，然后应用到另一个领域，在机器学习领域中得到了广泛的探索。而在训练域pecific基本模型（FM）从scratch的方法也有所研究，但我们的实验表明，这两种方法并不一定能够在目标领域 achieve state-of-the-art（SOTA）结果。在这项工作中，我们研究closed domain中的抽取式问答 tasks，并提出了一种名为目标预训练的概念。这种方法是通过Determining和生成相关的数据来进一步训练我们的模型，而不是通过使用域pecific FMs在各种数据上进行训练。我们的提出的框架使用Galactica来生成一些“targeted”的Synthetic corpora，这些corpora与特定的写作风格和主题相对应，例如研究论文和医学报告。这个过程可以看作是一种知识储存。我们在COVID-QA和RadQA两个生物医学抽取式问答数据集上应用了我们的方法， achieved a new benchmark on the former and demonstrated overall improvements on the latter。代码可以在https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main。

How well can machine-generated texts be identified and can language models be trained to avoid identification?

paper_url: http://arxiv.org/abs/2310.16992
repo_url: None
paper_authors: Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek
for: 本研究旨在 distinguishing 人工生成的文本与机器生成的文本。
methods: 我们使用了五种独立的语言模型来生成假 Tweets，并发现了 shallow learning 分类算法（如 Naive Bayes）可以达到0.6-0.8的检测精度。
results: 我们发现，使用高温值生成文本时，人类检测和机器检测之间存在显著差异，而使用 transformer 基于的分类算法可以达到0.9和更高的检测精度。更 того，我们使用了强化学习approach来练习我们的生成模型，可以成功逃脱BERT基于的检测算法，其检测精度为0.15或更低。

Abstract
With the rise of generative pre-trained transformer models such as GPT-3, GPT-NeoX, or OPT, distinguishing human-generated texts from machine-generated ones has become important. We refined five separate language models to generate synthetic tweets, uncovering that shallow learning classification algorithms, like Naive Bayes, achieve detection accuracy between 0.6 and 0.8. Shallow learning classifiers differ from human-based detection, especially when using higher temperature values during text generation, resulting in a lower detection rate. Humans prioritize linguistic acceptability, which tends to be higher at lower temperature values. In contrast, transformer-based classifiers have an accuracy of 0.9 and above. We found that using a reinforcement learning approach to refine our generative models can successfully evade BERT-based classifiers with a detection accuracy of 0.15 or less.

摘要
随着生成预训练变换器模型如GPT-3、GPT-NeoX或OPT的出现，分辨人工生成的文本和机器生成的文本已经变得非常重要。我们对五种语言模型进行了精细调整，生成了Synthetic tweets，发现了使用Naive Bayes等浅学习分类算法时，检测精度在0.6-0.8之间。浅学习分类器与人类检测存在差异，尤其是在使用更高的温度值生成文本时，检测率较低。人类偏好语言可接受性，这种可接受性通常在低温度值时高。相比之下，transformer基类分类器的准确率为0.9和更高。我们发现使用强化学习方法来精细调整我们的生成模型可以成功逃脱BERT基类分类器，其检测精度为0.15或更低。

paper_url: http://arxiv.org/abs/2310.16968
repo_url: None
paper_authors: Nafis Irtiza Tripto, Mohammed Eunus Ali
for: 本研究探讨了现代文学小说中社会结构和现实生活事件的影响，通过使用文本分析方法来解释这些现实现象。
methods: 本研究使用了自然语言处理（NLP）方法，包括情感分析、故事概要和主题分析，以及视觉化技术来分析和检索现代文学小说中的人物互动。
results: 研究发现，使用人物互动图（或网络）可以帮助解释和检索现代文学小说中的社会问题，并且可以对 Bengali 小说的影响进行特定的评估和信息检索。

Abstract
Social structures and real-world incidents often influence contemporary literary fiction. Existing research in literary fiction analysis explains these real-world phenomena through the manual critical analysis of stories. Conventional Natural Language Processing (NLP) methodologies, including sentiment analysis, narrative summarization, and topic modeling, have demonstrated substantial efficacy in analyzing and identifying similarities within fictional works. However, the intricate dynamics of character interactions within fiction necessitate a more nuanced approach that incorporates visualization techniques. Character interaction graphs (or networks) emerge as a highly suitable means for visualization and information retrieval from the realm of fiction. Therefore, we leverage character interaction graphs with NLP-derived features to explore a diverse spectrum of societal inquiries about contemporary culture's impact on the landscape of literary fiction. Our study involves constructing character interaction graphs from fiction, extracting relevant graph features, and exploiting these features to resolve various real-life queries. Experimental evaluation of influential Bengali fiction over half a century demonstrates that character interaction graphs can be highly effective in specific assessments and information retrieval from literary fiction. Our data and codebase are available at https://cutt.ly/fbMgGEM

摘要
社会结构和现实生活中的事件常常影响当代文学小说。现有的文学分析研究通过手动分析故事来解释现实世界现象。传统的自然语言处理（NLP）方法，包括情感分析、简要摘要和话题模型，已经证明了在文学作品中的有效性。然而，小说中人物之间的复杂关系需要一种更细微的方法，该方法包括视觉化技术。因此，我们利用小说中人物之间的互动图（或网络）来可视化和检索文学作品中的信息。我们的研究包括从小说中构建人物互动图，提取有关图的重要特征，并利用这些特征来解决现实生活中各种社会问题。我们对印度文学中的影响力有限的 Bengali 小说进行实验评估，发现人物互动图可以在特定的评价和信息检索中表现出非常高效。我们的数据和代码库可以在以下链接中找到：https://cutt.ly/fbMgGEM。

Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text Generation

paper_url: http://arxiv.org/abs/2310.16964
repo_url: https://github.com/langus0/critic-aware-decoding
paper_authors: Mateusz Lango, Ondřej Dušek
for: Mitigating hallucinations in neural data-to-text generation
methods: Combining probabilistic output of a generator language model with output of a special “text critic” classifier
results: Improved performance on WebNLG and OpenDialKG benchmarks

Abstract
Hallucination of text ungrounded in the input is a well-known problem in neural data-to-text generation. Many methods have been proposed to mitigate it, but they typically require altering model architecture or collecting additional data, and thus cannot be easily applied to an existing model. In this paper, we explore a new way to mitigate hallucinations by combining the probabilistic output of a generator language model (LM) with the output of a special "text critic" classifier, which guides the generation by assessing the match between the input data and the text generated so far. Our method does not need any changes to the underlying LM's architecture or training procedure and can thus be combined with any model and decoding operating on word probabilities. The critic does not need any additional training data, using the base LM's training data and synthetic negative examples. Our experimental results show that our method improves over the baseline on the WebNLG and OpenDialKG benchmarks.

摘要
幻像文本不受输入数据支持是神经网络数据到文本生成领域的一个常见问题。许多方法已经被提出来解决这个问题，但它们通常需要修改模型结构或收集更多数据，因此无法轻松应用于现有模型。在这篇论文中，我们探索了一种新的幻像 mitigation 方法，通过将生成语言模型（LM）的概率输出与一个特殊的 "文本评估器" 类型的分类器结合使用，以评估输入数据和已生成文本之间的匹配度。我们的方法不需要对基础LM的结构或训练过程进行任何更改，因此可以与任何模型和word probabilitiesdecoding进行组合。批评器也不需要额外的训练数据，只需使用基础LM的训练数据和 sintetic的负例来训练。我们的实验结果显示，我们的方法在WebNLG和OpenDialKG benchmark上表现出色。

Muslim-Violence Bias Persists in Debiased GPT Models

paper_url: http://arxiv.org/abs/2310.18368
repo_url: None
paper_authors: Babak Hemmatian, Razan Baltaji, Lav R. Varshney
for: 本研究旨在探讨GPT-3语言模型是否具有对穆斯林的偏见，以及如何使用不同的提示方式来降低这种偏见。
methods: 本研究使用了两个预注册的复制实验，其中一个使用了GPT-3模型，另一个使用了ChatGPT模型。两个实验中使用了不同的提示方式，以检测GPT-3模型是否具有偏见。
results: 研究发现，使用GPT-3模型时，对穆斯林的提示可能会导致更多的暴力结果，而对其他宗教的提示则不会。此外，使用ChatGPT模型时，也发现了类似的偏见。研究还发现了一些宗教特定的暴力主题，这些主题具有很强的不受欢迎的想法。

Abstract
Abid et al. (2021) showed a tendency in GPT-3 to generate violent completions when prompted about Muslims, compared with other religions. Two pre-registered replication attempts found few violent completions and only the weakest anti-Muslim bias in the Instruct version, fine-tuned to eliminate biased and toxic outputs. However, more pre-registered experiments showed that using common names associated with the religions in prompts increases several-fold the rate of violent completions, revealing a highly significant second-order bias against Muslims. Our content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. Replications with ChatGPT suggest that any effects of GPT-3's de-biasing have disappeared with continued model development, as this newer model showed both a strong Muslim-violence bias and rates of violent completions closer to Abid et al. (2021). Our results show the need for continual de-biasing of models in ways that address higher-order associations.

摘要
阿比德等人 (2021) 发现，当提供有关穆斯林的提示时，GPT-3会生成暴力完成的倾向，相比其他宗教。两个预先注册的复现尝试发现了少量的暴力完成和只有最弱的反伊斯兰偏见在指导版本中，经过精心修改以消除恐怖和毒害输出。然而，更多的预先注册实验表明，在提示中使用宗教名称可以增加数量多少倍的暴力完成，揭示了高度显著的第二阶段偏见对穆斯林。我们的内容分析发现，不同宗教的暴力主题具有高度冒犯性的想法，无论提示格式如何。复现使用ChatGPT表明，GPT-3的去偏见效果已经消失，这 newer模型显示了强穆斯林暴力偏见和与Abid et al. (2021) 相似的暴力完成率。我们的结果表明，需要不断地去偏见模型，以解决更高级别的关联。

Zephyr: Direct Distillation of LM Alignment

paper_url: http://arxiv.org/abs/2310.16944
repo_url: https://github.com/huggingface/alignment-handbook
paper_authors: Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf
for: 这 paper 的目的是提高 chat 模型的意图对接。
methods: 这 paper 使用了 distilled supervised fine-tuning (dSFT) 和 preference data from AI Feedback (AIF) 来学习一个高效的 chat 模型。
results: 这 paper 的最终结果是 Zephyr-7B，这是一个基于 7B 参数模型的 chat 模型，在 chat benchmark 上达到了新的州OF-the-art 水平，并不需要人工标注。

Abstract
We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.

摘要
我团队目标是开发一个更小的语言模型，并将其与用户意图进行对齐。前一研究表明，通过对更大的模型进行混合精度微调（dSFT）可以显著提高任务准确率，但这些模型通常不具备自然提示的响应能力。为了抓取这个特性，我们尝试使用人工智能反馈（AIF）的偏好数据进行直接偏好优化（dDPO），从而学习一个与用户意图更好地对齐的对话模型。这种方法只需几个小时的训练，不需任何额外的采样，并且不需人工标注。最终的结果是Zephyr-7B，它在7B参数模型上设置了对话benchmark的新纪录，并且在MT-Bench上超越了Llama2-Chat-70B，这是最佳的开放访问RLHF-based模型。我们提供了相关的代码、模型、数据和教程，可以在https://github.com/huggingface/alignment-handbook上下载。

Learning Transfers over Several Programming Languages

paper_url: http://arxiv.org/abs/2310.16937
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Razan Baltaji, Saurabh Pujar, Louis Mandel, Martin Hirzel, Luca Buratti, Lav Varshney
for: 这篇论文旨在探讨跨语言传输学习在编程语言中的可行性和效果。
methods: 该论文使用了一种基于转换器的大型自然语言模型，并在11到41种编程语言中进行了广泛的实验，以探讨以下问题：首先，跨语言传输在不同语言对的效果如何？其次，给定任务和目标语言，如何选择最佳的源语言？第三，哪些语言对传输性能有优先顺序？第四，这些因素受到任务的影响如何？
results: 研究发现，跨语言传输可以在多种任务中提供有效的提升，且可以通过选择合适的源语言来提高效果。此外，研究还发现了一些语言对传输性能有优先顺序的特征，这些特征可以用于任务选择和语言对选择。

Abstract
Large language models (LLMs) have recently become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pretraining and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled samples for most tasks and often even lacking unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or new languages) miss out on the benefits of LLMs. Cross-lingual transfer learning uses data from a source language to improve model performance on a target language. It has been well-studied for natural languages, but has received little attention for programming languages. This paper reports extensive experiments on four tasks using a transformer-based LLM and 11 to 41 programming languages to explore the following questions. First, how well cross-lingual transfer works for a given task across different language pairs. Second, given a task and target language, how to best choose a source language. Third, the characteristics of a language pair that are predictive of transfer performance, and fourth, how that depends on the given task.

摘要
大型语言模型 (LLM) 在高资源编程语言中提高开发人员产量的能力已经很有起色。这些模型使用两种数据：大量的无标示代码样本用于预训练，以及相对较小的标注代码样本用于细化或在场景学习。然而，许多编程语言是低资源的，缺乏大多数任务的标注样本，甚至缺乏无标示样本。因此，使用低资源语言的用户（例如遗产语言或新语言）无法享受到 LLM 的好处。 Cross-lingual transfer learning 使用来自源语言的数据来改善目标语言的模型性能。它在自然语言方面得到了广泛的研究，但在编程语言方面得到了少量的注意。本文报告了使用 transformer-based LLM 和 11 到 41 种编程语言进行了广泛的实验，以探索以下问题：1. across different language pairs, how well does cross-lingual transfer work for a given task?2. given a task and target language, how to best choose a source language?3. what are the characteristics of a language pair that are predictive of transfer performance, and4. how does that depend on the given task?

Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors

paper_url: http://arxiv.org/abs/2310.16924
repo_url: https://github.com/n-mehandru/physicianqe
paper_authors: Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Elaine C Khoong, Ge Gao, Marine Carpuat, Niloufar Salehi
for: 这个论文的目的是提高机器翻译（MT）在实际应用中的可靠性，特别是帮助用户做出有知识的决策。
methods: 这篇论文使用了质量评估技术来自动评估MT质量，并在实际应用场景中进行了人类研究，以评估这些技术的有效性。
results: 研究发现，基于质量评估的干预可以提高用户对MT输出的有效使用，但是返回翻译可以帮助医生检测更多的严重错误，而质量评估单独无法捕捉这些错误。

Abstract
A major challenge in the practical use of Machine Translation (MT) is that users lack guidance to make informed decisions about when to rely on outputs. Progress in quality estimation research provides techniques to automatically assess MT quality, but these techniques have primarily been evaluated in vitro by comparison against human judgments outside of a specific context of use. This paper evaluates quality estimation feedback in vivo with a human study simulating decision-making in high-stakes medical settings. Using Emergency Department discharge instructions, we study how interventions based on quality estimation versus backtranslation assist physicians in deciding whether to show MT outputs to a patient. We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.

摘要
Machine Translation（MT）在实际应用中的一个主要挑战是用户缺乏指导来做出了解MT输出的决策。质量评估研究的进步提供了自动评估MT质量的技术，但这些技术主要在室外进行了人工评估，而不是在特定的使用场景下进行评估。本文通过医疗高危 Settings中的医生决策模拟研究，评估了基于质量评估和反编译的干预对MT输出的影响。我们发现，质量评估可以提高对MT的有效依赖，但反编译可以帮助医生检测更多的严重错误，这些错误经常被QEalonemiss。

Divide et Impera: Multi-Transformer Architectures for Complex NLP-Tasks

paper_url: http://arxiv.org/abs/2310.16897
repo_url: None
paper_authors: Solveig Helland, Elena Gavagnin, Alexandre de Spindler
for: 解决复杂的自然语言处理任务，例如减少性别偏见。
methods: 将复杂任务拆分成 simpler subtask，并使用多个transformer模型进行 fine-tuning，以实现更好的控制性。
results: 在使用多个模型进行 fine-tuning时，性别偏见减少的效果更好than使用单个模型。

Abstract
The growing capabilities of transformer models pave the way for solving increasingly complex NLP tasks. A key to supporting application-specific requirements is the ability to fine-tune. However, compiling a fine-tuning dataset tailored to complex tasks is tedious and results in large datasets, limiting the ability to control transformer output. We present an approach in which complex tasks are divided into simpler subtasks. Multiple transformer models are fine-tuned to one subtask each, and lined up to accomplish the complex task. This simplifies the compilation of fine-tuning datasets and increases overall controllability. Using the example of reducing gender bias as a complex task, we demonstrate our approach and show that it performs better than using a single model.

摘要
transformer 模型的增长能力为解决越来越复杂的自然语言处理任务开创了道路。一个关键是可以微调，但是为复杂任务编译微调数据集是费时且导致数据集较大，限制了transformer输出的控制。我们提出了一种方法，将复杂任务分解成更简单的子任务。多个transformer模型对每个子任务进行微调，并将它们组合起来完成复杂任务。这种方法可以简化微调数据集的编译，并提高总的控制性。使用减少性别偏见为复杂任务的示例，我们示出了我们的方法的效果，并表明它在单一模型的情况下表现更好。

Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution

paper_url: http://arxiv.org/abs/2310.16834
repo_url: None
paper_authors: Aaron Lou, Chenlin Meng, Stefano Ermon
for:* This paper aims to improve the performance of diffusion models on discrete data domains, such as natural language, by proposing a novel discrete score matching loss called score entropy.methods:* The proposed method, called Score Entropy Discrete Diffusion (SEDD), uses a denoising variant of the score entropy loss to efficiently optimize the model for maximum likelihood training.results:* The SEDD model achieves highly competitive likelihoods compared to the baseline GPT-2 model, and has several algorithmic advantages such as learning a more faithful sequence distribution, trading off compute for generation quality, and enabling arbitrary infilling beyond the standard left to right prompting.

Abstract
Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel discrete score matching loss that is more stable than existing methods, forms an ELBO for maximum likelihood training, and can be efficiently optimized with a denoising variant. We scale our Score Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, achieving highly competitive likelihoods while also introducing distinct algorithmic advantages. In particular, when comparing similarly sized SEDD and GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of and sometimes outperforming the baseline). Furthermore, SEDD models learn a more faithful sequence distribution (around $4\times$ better compared to GPT-2 models with ancestral sampling as measured by large models), can trade off compute for generation quality (needing only $16\times$ fewer network evaluations to match GPT-2), and enables arbitrary infilling beyond the standard left to right prompting.

摘要
尽管扩散模型在许多生成模型任务上表现出色，但它们在自然语言类数据上的表现却不如预期。原因是标准的扩散模型依赖于已经成熟的分数匹配理论，但将这种理论扩展到逻辑结构上并没有得到同样的实验性提升。在这项工作中，我们bridges这个差距，提出了一种新的简 discrete分数匹配损失函数，即分数 entropy，它更稳定、能够形成 ELBO для最大化可能性训练，并且可以高效地使用降噪变体进行优化。我们在 GPT-2 实验设置下扩大了 Score Entropy Discrete Diffusion 模型（SEDD），达到了非常竞争的可能性，同时也提供了一些算法优势。具体来说，与相同大小的 SEDD 和 GPT-2 模型相比，SEDD 可以达到与基准相同的折算值（通常在 $+10\%$ 以内，有时 même outperforming 基准），并且 SEDD 模型可以更好地学习数据序列分布（与 GPT-2 模型在 ancestral sampling 下的分布相比，大约 $4\times$ 更好），可以交换计算量和生成质量（只需 $16\times$ fewer network evaluations 可以与 GPT-2 模型匹配），并且允许随意填充 beyond 标准的左到右提示。

Language Agnostic Code Embeddings

paper_url: http://arxiv.org/abs/2310.16803
repo_url: https://github.com/snknitin/Multilingual-Embeddings-using-ACS-for-Cross-lingual-NLP
paper_authors: Saiteja Utpala, Alex Gu, Pin Yu Chen
for: 本研究探讨了多种编程语言的代码嵌入，尤其是跨语言代码嵌入的跨语言能力。
methods: 通过探索实验，研究发现代码嵌入包含两个不同组成部分：一个深深地关联到特定语言的细节和 sintaxis，另一个主要关注 semantics，不受语言细节影响。
results: 当我们隔离并消除语言特定的组成部分时，在下游代码检索任务中观察到显著改善，MRR提高了+17。

Abstract
Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR).

摘要

Detecting Pretraining Data from Large Language Models

paper_url: http://arxiv.org/abs/2310.16789
repo_url: https://github.com/swj0419/detect-pretrain-code
paper_authors: Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer
for:这篇论文是为了研究大型自然语言模型（LLM）的预训练数据检测问题而写的。methods:这篇论文提出了一种新的检测方法，即使用Min-K% Prob方法，该方法基于一个简单的假设：未看过的示例可能会包含一些低概率词语，而已经看过的示例则 less likely 会有这些低概率词语。此外，这种方法不需要知道预训练词库或任何额外训练，因此与之前的检测方法不同。results:实验表明，Min-K% Prob 方法在 WIKIMIA 上比之前的方法提高了7.4%。此外，这种方法在实际应用中，如检测版权书籍和污染下游示例的问题上也表现了良好的效果。

Abstract
Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to two real-world scenarios, copyrighted book detection, and contaminated downstream example detection, and find it a consistently effective solution.

摘要
尽管大型语言模型（LLM）广泛应用，但训练它们的数据几乎 nunca 被披露。这些数据的规模可以达到数十亿个字符，因此可能包含可能有问题的文本，如版权保护的内容、个人可识别信息和报道的参考基准测试数据。然而，我们目前没有任何方式可以了解这些类型的数据是否包含在内，以及它们的占比。在这篇论文中，我们研究了预训练数据检测问题：给定一个文本，无需知道预训练数据，可以使用黑框访问 LLl 来判断这个文本是否包含在预训练数据中？为了支持这项研究，我们引入了一个动态 benchmark 名为 WIKIMIA，它使用了在模型训练前后创建的数据来支持金实验 truth 检测。我们还提出了一种新的检测方法，即 Min-K% Prob，基于简单的假设：未seen 的例子很可能包含一些低概率的单词，而seen 的例子则更 unlikely 会有这些低概率的单词。Min-K% Prob 可以无需了解预训练集或任何额外训练，与之前的检测方法不同。此外，我们的实验表明，Min-K% Prob 在 WIKIMIA 上的提升为 7.4%。我们在实际应用中使用 Min-K% Prob 来检测版权保护的书籍和下游示例中的污染，发现它是一个可靠的解决方案。

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

paper_url: http://arxiv.org/abs/2310.16781
repo_url: None
paper_authors: Morris Alper, Hadar Averbuch-Elor
for: 这个研究探讨了语音 Symbolism在计算机视觉语言模型CLIP和Stable Diffusion中是否存在强烈的印证。
methods: 该研究使用零shot知识探测来调查这些模型内置的知识，并发现它们 действительно具有这种印证，与心理语言学中知名的吃苹果效应相似。
results: 该研究发现计算机视觉语言模型CLIP和Stable Diffusion中存在强烈的印证，证明了语音 Symbolism的存在并提供了一种计算机方法来证明和理解它的性质。

Abstract
Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.

摘要
尽管人类语言中音与意义的映射被视为大体是随机的，但是认知科学研究发现，不同语言和人群之间的声音和意义之间存在一定的相互关系，这种现象被称为声 Symbolism。在多种意义维度中，声 Symbolism 特别是与视觉领域的交互关系非常显著，在这种情况下，我们研究了 CLIP 和 Stable Diffusion 等视觉语言模型是否具备声 Symbolism 特征。通过零 shot 知识探测，我们发现这些模型确实具备这种特征，与 психолингвисти学中著名的 kiki-bouba 效应相似。我们的工作提供了一种计算工具来探索和理解声 Symbolism 的方法，代码将公开发布。

IntenDD: A Unified Contrastive Learning Approach for Intent Detection and Discovery

paper_url: http://arxiv.org/abs/2310.16761
repo_url: None
paper_authors: Bhavuk Singhal, Ashim Gupta, Shivasankaran V P, Amrith Krishna
for: 本文旨在提出一种能够同时处理多类和多标签的任务 oriented dialogue系统中的意向识别任务。
methods: 本文提出了一种名为 IntenDD 的独特方法，它利用共享的语句编码器来解决意向识别任务。该方法采用了一种不需要监督的对比学习策略，其中 pseudo-labels 是基于语句的字典特征来生成的。此外，本文还提出了一种两步后处理设置，用于类型化任务，其中包括卷积投影和修正。
results: 经过广泛的测试，本文发现 IntenDD 可以在多个数据集上与竞争对手相比，常常具有更高的性能。特别是，在少量数据情况下，IntenDD 的性能提高了2.32%、1.26%和1.52%。

Abstract
Identifying intents from dialogue utterances forms an integral component of task-oriented dialogue systems. Intent-related tasks are typically formulated either as a classification task, where the utterances are classified into predefined categories or as a clustering task when new and previously unknown intent categories need to be discovered from these utterances. Further, the intent classification may be modeled in a multiclass (MC) or multilabel (ML) setup. While typically these tasks are modeled as separate tasks, we propose IntenDD, a unified approach leveraging a shared utterance encoding backbone. IntenDD uses an entirely unsupervised contrastive learning strategy for representation learning, where pseudo-labels for the unlabeled utterances are generated based on their lexical features. Additionally, we introduce a two-step post-processing setup for the classification tasks using modified adsorption. Here, first, the residuals in the training data are propagated followed by smoothing the labels both modeled in a transductive setting. Through extensive evaluations on various benchmark datasets, we find that our approach consistently outperforms competitive baselines across all three tasks. On average, IntenDD reports percentage improvements of 2.32%, 1.26%, and 1.52% in their respective metrics for few-shot MC, few-shot ML, and the intent discovery tasks respectively.

摘要
标准化对话语言理解是任务导向对话系统的一个重要组成部分。intent相关任务通常被формализова为分类任务，其中对话语言被分类为预定的类别，或者为聚类任务，当新的意图类别需要从对话语言中发现时。此外，意向分类可能是多类（MC）或多标签（ML）的设置。通常这些任务是分开模型的，但我们提出了IntenDD，一种综合方法，利用共享对话语言编码核心。IntenDD使用了一种完全无监督对比学习的表征学习策略，其中 pseudo-标签 для无标签对话语言是基于其语言特征生成的。此外，我们引入了一种两步后处理设置，其中首先在训练数据中的剩余被传播，然后对模型进行滑动平滑。通过对多个 benchmark 数据集进行广泛的评估，我们发现，我们的方法在所有三个任务中 consistently 超越竞争对手的基eline。在 average 的情况下，IntenDD Report了分类任务的准确率提高2.32%、1.26%和1.52%。

DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

paper_url: http://arxiv.org/abs/2310.16749
repo_url: https://github.com/vineet2104/disco
paper_authors: Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya
for: 该研究的目的是提供多语言干扰纠正（DC）的高质量人类标注数据集，以便进行多语言干扰纠正研究。
methods: 该研究使用了现有的DC模型，对四种重要的印欧语言（英语、希腊语、德语和法语）进行了分析。
results: 研究得到了四种语言的DC模型的F1分数，分别为97.55（英语）、94.29（希腊语）、95.89（德语）和92.97（法语）。此外，研究还表明，DC可以提高下游任务的BLEU分数平均5.65分。

Abstract
Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system. We release code to run our experiments along with our annotated dataset here.

摘要
“缺乏流畅性调正（DC）是将口语说话中的填充词、重复和修正元素移除，以创建可读和解释的文本的过程。DC 是自动语音识别（ASR）输出的重要后处理步骤，在下游语言理解任务前进行。现有 DC 研究主要集中在英语上，因为大规模的开源数据集的不足。为了多语言缺乏流畅性调正，我们发布了高品质的人类评估 DC 数据库，覆盖四种重要的印欧语言：英语、希腊语、德语和法语。我们提供了广泛的 DC 模型的结果分析，其中英语的 F1 分数为 97.55，希腊语的 F1 分数为 94.29，德语的 F1 分数为 95.89，法语的 F1 分数为 92.97。为了证明 DC 对下游任务的好处，我们显示了 DC 对 Machine Translation（MT）系统的均值 BLEU 分数提高 5.65 分。我们在这里发布了实验代码和数据。”

HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

paper_url: http://arxiv.org/abs/2310.16746
repo_url: None
paper_authors: Nafis Irtiza Tripto, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca Giannotti, Dongwon Lee
for:这个论文主要是为了提高人工智能对口语文本的分析能力。methods:这个论文使用了现有的口语数据集，以及使用3种知名的大语言模型（ChatGPT、PaLM2和Vicuna13B）生成的人工生成的口语数据集。results:这个论文通过对人类口语数据集和人工生成的口语数据集进行作者归属分析和作者验证，以及人工生成 spoken text检测等方面的研究，以提高人工智能对口语文本的分析能力。

Abstract
Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform Authorship Attribution (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models. While SOTA methods, such as, character ngram or Transformer-based model, exhibit similar AA & AV performance in human-spoken datasets compared to written ones, there is much room for improvement in AI-generated spoken text detection. The HANSEN benchmark is available at: https://huggingface.co/datasets/HANSEN-REPO/HANSEN.

摘要
《作者分析》也称为《风格分析》，已经是自然语言处理（NLP）的一个基本方面。而在最近，大型语言模型（LLMs）的发展使得作者分析变得越来越重要，以确定人类写作和人工智能生成的文本之间的区别。然而，这些作者分析任务主要集中在written文本上，忽略了口语文本。因此，我们介绍了最大的口语文本benchmark - HANSEN（人类和AI语音文本benchmark）。HANSEN包括了仔细筹集的现有口语数据集，并与创建了使用3个知名的LLMs：ChatGPT、PaLM2和Vicuna13B生成的口语文本数据集。总的来说，HANSEN包括17个人类数据集，以及由AI生成的口语文本。为了评估和利用HANSEN，我们在人类口语数据集上进行了作者归属（AA）和作者验证（AV），并使用当前最佳（SOTA）模型进行人类vsAI口语文本检测。虽然SOTA方法，如字符串igrams或Transformer基本模型，在人类口语数据集上与written数据集相比， AA&AV性能相似，但AI生成口语文本检测仍有很大的改进空间。HANSEN benchmark可以在以下地址获取：https://huggingface.co/datasets/HANSEN-REPO/HANSEN。

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation

paper_url: http://arxiv.org/abs/2310.16738
repo_url: https://github.com/wangxieric/bias-crs
paper_authors: Xi Wang, Hossein A. Rahmani, Jiqun Liu, Emine Yilmaz
for: 这个论文的目的是提出两种新的数据增强策略，以解决 conversational recommendation 中的各种偏见问题。
methods: 该论文使用了语言模型和数据增强技术，并从 Success of generative data 中灵感得到了两种新的数据增强策略：’Once-Aug’ 和 ‘PopNudge’。
results: 经过对 ReDial 和 TG-ReDial 数据集的广泛实验，该论文表明了 CRS 技术的表现得到了通过数据增强方法的改进，并提供了多种偏见问题的解决方案。

Abstract
Conversational Recommendation System (CRS) is a rapidly growing research area that has gained significant attention alongside advancements in language modelling techniques. However, the current state of conversational recommendation faces numerous challenges due to its relative novelty and limited existing contributions. In this study, we delve into benchmark datasets for developing CRS models and address potential biases arising from the feedback loop inherent in multi-turn interactions, including selection bias and multiple popularity bias variants. Drawing inspiration from the success of generative data via using language models and data augmentation techniques, we present two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model performance while mitigating biases. Through extensive experiments on ReDial and TG-ReDial benchmark datasets, we show a consistent improvement of CRS techniques with our data augmentation approaches and offer additional insights on addressing multiple newly formulated biases.

摘要
对话推荐系统（CRS）是一个迅速成长的研究领域，与语言模型技术的进步相互发展。然而，目前的对话推荐面临许多挑战，主要是因为这个领域的相对新颖性和有限的现有贡献。在这篇研究中，我们探索了对话推荐模型的 benchmarck 数据集，并处理了从多轮互动中产生的可能的偏见，包括选择偏见和多个流行度偏见的多种变体。受到语言模型和数据增强技术的成功的灵感，我们提出了两种新的策略，“Once-Aug”和“PopNudge”，以提高模型性能并减少偏见。通过广泛的实验，我们显示了对 ReDial 和 TG-ReDial 实验数据集的一致性改进，并提供了多个对多个新的偏见的解决方案。

Disentangling Extraction and Reasoning in Multi-hop Spatial Reasoning

paper_url: http://arxiv.org/abs/2310.16731
repo_url: None
paper_authors: Roshanak Mirzaee, Parisa Kordjamshidi
for: 这篇论文旨在探讨文本空间理解的挑战，以及如何通过分离信息提取和理解过程来解决这个挑战。
methods: 作者设计了多种模型，其中一些模型将信息提取和理解分离开来，并与现有基eline进行比较。
results: 实验结果表明，分离信息提取和理解过程可以提高模型在真实数据领域的普适性。

Abstract
Spatial reasoning over text is challenging as the models not only need to extract the direct spatial information from the text but also reason over those and infer implicit spatial relations. Recent studies highlight the struggles even large language models encounter when it comes to performing spatial reasoning over text. In this paper, we explore the potential benefits of disentangling the processes of information extraction and reasoning in models to address this challenge. To explore this, we design various models that disentangle extraction and reasoning(either symbolic or neural) and compare them with state-of-the-art(SOTA) baselines with no explicit design for these parts. Our experimental results consistently demonstrate the efficacy of disentangling, showcasing its ability to enhance models' generalizability within realistic data domains.

摘要
对文本空间逻辑是挑战，因为模型不仅需要从文本中提取直接的空间信息，而且还需要对其进行推理和推论，推理出隐藏的空间关系。Recent studies表明，even large language models still struggle with spatial reasoning over text. 在这篇论文中，我们探索了分解模型中提取信息和推理的过程可能带来的可能性。为了做到这一点，我们设计了不同的模型，其中包括分解为符号或神经网络的模型，并与现有的基elines进行比较。我们的实验结果一致地表明了分解的效果，它能够提高模型在真实数据领域的普遍性。

LLM Performance Predictors are good initializers for Architecture Search

paper_url: http://arxiv.org/abs/2310.16712
repo_url: None
paper_authors: Ganesh Jawahar, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Dujian Ding
for: 这个论文的目的是使用大型自然语言处理（NL）模型来建立性能预测模型（PP），以便预测特定深度神经网络架构在下游任务上的性能。
methods: 这个论文使用的方法包括设计PP提示语 для大型NL模型（LLM），其中包括角色描述、指令集、超参数定义和示例架构。在机器翻译（MT）任务上，我们发现使用LLM-PP提示语可以准确预测架构性能，并且与现有最佳性能预测器相当。此外，我们还提出了一种将PP预测结果进行混合压缩（LLM-Distill-PP），以实现成本效果的性能预测模型。
results: 这个论文的结果表明，使用LLM-PP提示语可以准确预测架构性能，并且可以在机器翻译（MT）任务上实现类似于现有最佳性能预测器的性能。此外，我们还提出了一种Hybrid-Search算法（HS-NAS），使用LLM-Distill-PP进行初始部分的搜索，然后使用基eline预测器进行剩余的搜索。我们的HS-NAS方法可以在不同的benchmark上实现相似于最佳NAS方法的性能，并且可以降低搜索时间约50%。

Abstract
Large language models (LLMs) have become an integral component in solving a wide range of NLP tasks. In this work, we explore a novel use case of using LLMs to build performance predictors (PP): models that, given a specific deep neural network architecture, predict its performance on a downstream task. We design PP prompts for LLMs consisting of: (i) role: description of the role assigned to the LLM, (ii) instructions: set of instructions to be followed by the LLM to carry out performance prediction, (iii) hyperparameters: a definition of each architecture-specific hyperparameter and (iv) demonstrations: sample architectures along with their efficiency metrics and 'training from scratch' performance. For machine translation (MT) tasks, we discover that GPT-4 with our PP prompts (LLM-PP) can predict the performance of architecture with a mean absolute error matching the SOTA and a marginal degradation in rank correlation coefficient compared to SOTA performance predictors. Further, we show that the predictions from LLM-PP can be distilled to a small regression model (LLM-Distill-PP). LLM-Distill-PP models surprisingly retain the performance of LLM-PP largely and can be a cost-effective alternative for heavy use cases of performance estimation. Specifically, for neural architecture search (NAS), we propose a Hybrid-Search algorithm for NAS (HS-NAS), which uses LLM-Distill-PP for the initial part of search, resorting to the baseline predictor for rest of the search. We show that HS-NAS performs very similar to SOTA NAS across benchmarks, reduces search hours by 50% roughly, and in some cases, improves latency, GFLOPs, and model size.

摘要

Role: A description of the role assigned to the LLM.2. Instructions: A set of instructions to be followed by the LLM to carry out performance prediction.3. Hyperparameters: A definition of each architecture-specific hyperparameter.4. Demonstrations: Sample architectures along with their efficiency metrics and ‘training from scratch’ performance.For machine translation (MT) tasks, we discover that GPT-4 with our PP prompts (LLM-PP) can predict the performance of architecture with a mean absolute error matching the state-of-the-art (SOTA) and a marginal degradation in rank correlation coefficient compared to SOTA performance predictors. Furthermore, we show that the predictions from LLM-PP can be distilled to a small regression model (LLM-Distill-PP). LLM-Distill-PP models surprisingly retain the performance of LLM-PP largely and can be a cost-effective alternative for heavy use cases of performance estimation.In the context of neural architecture search (NAS), we propose a Hybrid-Search algorithm (HS-NAS) that uses LLM-Distill-PP for the initial part of the search and resorts to the baseline predictor for the rest of the search. We show that HS-NAS performs very similarly to SOTA NAS across benchmarks, reduces search hours by approximately 50%, and in some cases, improves latency, GFLOPs, and model size.

BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?

paper_url: http://arxiv.org/abs/2310.16681
repo_url: https://github.com/zephyr1022/babystories-utsa
paper_authors: Xingmeng Zhao, Tongnian Wang, Sheri Osborn, Anthony Rios
for: 这个研究的目的是探索通过人工反馈学习（RLHF）是否可以提高由预训练的语言模型在小型、人类化数据集上的表现。
methods: 这个研究使用了两种GPT-2变种，通过RLHF练练后在故事作业中表现较好。
results: 研究发现， larger model在RLHF练练后在故事作业中表现较好，这表明RLHF技术可能对大型模型更有利，但需要更多的实验来证明这一结论。 These findings suggest that RLHF techniques may be more advantageous for larger models due to their higher learning and adaptation capacity.

Abstract
Language models have seen significant growth in the size of their corpus, leading to notable performance improvements. Yet, there has been limited progress in developing models that handle smaller, more human-like datasets. As part of the BabyLM shared task, this study explores the impact of reinforcement learning from human feedback (RLHF) on language models pretrained from scratch with a limited training corpus. Comparing two GPT-2 variants, the larger model performs better in storytelling tasks after RLHF fine-tuning. These findings suggest that RLHF techniques may be more advantageous for larger models due to their higher learning and adaptation capacity, though more experiments are needed to confirm this finding. These insights highlight the potential benefits of RLHF fine-tuning for language models within limited data, enhancing their ability to maintain narrative focus and coherence while adhering better to initial instructions in storytelling tasks. The code for this work is publicly at https://github.com/Zephyr1022/BabyStories-UTSA.

摘要
Language models have experienced significant growth in their corpus size, leading to notable improvements in performance. However, there has been limited progress in developing models that can handle smaller, more human-like datasets. This study explores the impact of reinforcement learning from human feedback (RLHF) on language models pretrained from scratch with a limited training corpus. Comparing two GPT-2 variants, the larger model performs better in storytelling tasks after RLHF fine-tuning. These findings suggest that RLHF techniques may be more advantageous for larger models due to their higher learning and adaptation capacity, although more experiments are needed to confirm this finding. These insights highlight the potential benefits of RLHF fine-tuning for language models within limited data, enabling them to better maintain narrative focus and coherence while adhering to initial instructions in storytelling tasks. The code for this work is publicly available at .

SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning Framework for Emotion Recognition in Conversations

paper_url: http://arxiv.org/abs/2310.16676
repo_url: https://github.com/taoshi1998/sslcl
paper_authors: Tao Shi, Xiao Liang, Yaoyuan Liang, Xinyi Tong, Shao-Lun Huang
for: 这个论文主要针对的是对话中的情感识别任务（ERC）， aiming to detect the emotions expressed by speakers during a conversation.
methods: 我们提出了一种高效的、模型无关的Supervised Sample-Label Contrastive Learning（SSLCL）框架， which eliminates the need for a large batch size and can be seamlessly integrated with existing ERC models without introducing any model-specific assumptions.
results: 我们的SSLCL框架在两个ERC数据集上（IEMOCAP和MELD）得到了与现有State-of-the-art SCL方法相比的compatibleibility和显著性能提升。

Abstract
Emotion recognition in conversations (ERC) is a rapidly evolving task within the natural language processing community, which aims to detect the emotions expressed by speakers during a conversation. Recently, a growing number of ERC methods have focused on leveraging supervised contrastive learning (SCL) to enhance the robustness and generalizability of learned features. However, current SCL-based approaches in ERC are impeded by the constraint of large batch sizes and the lack of compatibility with most existing ERC models. To address these challenges, we propose an efficient and model-agnostic SCL framework named Supervised Sample-Label Contrastive Learning with Soft-HGR Maximal Correlation (SSLCL), which eliminates the need for a large batch size and can be seamlessly integrated with existing ERC models without introducing any model-specific assumptions. Specifically, we introduce a novel perspective on utilizing label representations by projecting discrete labels into dense embeddings through a shallow multilayer perceptron, and formulate the training objective to maximize the similarity between sample features and their corresponding ground-truth label embeddings, while minimizing the similarity between sample features and label embeddings of disparate classes. Moreover, we innovatively adopt the Soft-HGR maximal correlation as a measure of similarity between sample features and label embeddings, leading to significant performance improvements over conventional similarity measures. Additionally, multimodal cues of utterances are effectively leveraged by SSLCL as data augmentations to boost model performances. Extensive experiments on two ERC benchmark datasets, IEMOCAP and MELD, demonstrate the compatibility and superiority of our proposed SSLCL framework compared to existing state-of-the-art SCL methods. Our code is available at \url{https://github.com/TaoShi1998/SSLCL}.

摘要
“情感识别在对话（ERC）是自然语言处理领域的一个快速发展的任务，目的是在对话中检测发言人表达的情感。当前的ERC方法中，一个增长的数量的方法是利用监督对比学习（SCL）来提高学习的稳定性和通用性。然而，现有的SCL基于的ERC方法受到大批次大小的限制和现有ERC模型的不兼容性的问题。为解决这些挑战，我们提出了一种高效和模型无关的SCL框架，名为Supervised Sample-Label Contrastive Learning with Soft-HGR Maximal Correlation（SSLCL），它不需要大批次大小，可以轻松地与现有ERC模型集成，无需做任何模型特定的假设。具体来说，我们提出了一种新的标签表示方法，通过将精确的标签映射到权重化的多层感知机中，并将训练目标设置为最大化样本特征和其相应的真实标签嵌入的相似性，同时最小化样本特征和不同类别的标签嵌入之间的相似性。此外，我们创新地采用Soft-HGR最大相似度作为样本特征和标签嵌入之间的相似度度量，从而导致模型性能显著提高。同时，我们有效地利用对话语音的多 modal 信息作为数据增强来提高模型性能。我们的实验表明，我们的SSLCL框架在两个ERC标准测试集上（IEMOCAP和MELD）的Compatibility和Superiority，证明了我们的提议的可行性和优势。我们的代码可以在 \url{https://github.com/TaoShi1998/SSLCL} 中找到。”

ChatGPT is a Potential Zero-Shot Dependency Parser

paper_url: http://arxiv.org/abs/2310.16654
repo_url: None
paper_authors: Boda Lin, Xinyi Zhou, Binghao Tang, Xiaocheng Gong, Si Li
for: 研究是否可以使用预训练语言模型进行静态分析，无需额外结构。
methods: 使用ChatGPT大语言模型进行实验和语言分析。
results: ChatGPT表现出了零shot情况下的静态分析能力，并且分析结果也显示了一些特殊的 parsed 输出。

Abstract
Pre-trained language models have been widely used in dependency parsing task and have achieved significant improvements in parser performance. However, it remains an understudied question whether pre-trained language models can spontaneously exhibit the ability of dependency parsing without introducing additional parser structure in the zero-shot scenario. In this paper, we propose to explore the dependency parsing ability of large language models such as ChatGPT and conduct linguistic analysis. The experimental results demonstrate that ChatGPT is a potential zero-shot dependency parser, and the linguistic analysis also shows some unique preferences in parsing outputs.

摘要
大量语言模型在依赖分析任务中广泛应用，并实现了显著提高 parser 性能。但是，还是一个未研究的问题是否可以在零shotenario中，不添加额外的 parser 结构，使大量语言模型自动表现出依赖分析能力。在这篇论文中，我们提出了探索大量语言模型如ChatGPT的依赖分析能力，并进行语言分析。实验结果表明，ChatGPT可能是一个零shot依赖分析器，语言分析还发现了一些独特的依赖分析输出偏好。

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

paper_url: http://arxiv.org/abs/2310.16616
repo_url: None
paper_authors: Yiming Lin, Xiao-Bo Jin, Qiufeng Wang, Kaizhu Huang
for: 提高图像中文描述匹配精度，解决文本描述与图像像素之间的匹配异常问题。
methods: 提出了一种新的学习框架，即变换注意力重新评估网络（DRMN），通过在循环学习过程中引入可变注意力网络，以 capture 不同级别的像素上的关键信息，从而提高文本描述与图像像素之间的匹配精度。
results: 实验结果表明，DRMN在PNG数据集上达到了新的状态艺术性能，增加了3.5%的召回率。

Abstract
Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar $k$ image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-$k$ most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially.Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.

摘要
паннорамный нарративный гроуинг (ПНГ) - это возникающая задача визуального гроуинга, которая целится в сегментации визуальных объектов в изображениях на основе плотных нарративных записей. Текущие методы штата-арта используют технику сжатия представления фразы, а затем сравнения с представлениями пикселей изображения для получения результатов сегментации. Однако, простое сжатие выборочных пикселей изображения игнорирует контекстную информацию, что может привести к несоответствию фразы-пиксель. В этой статье мы предлагаем новый фреймворк обучения называемый Сверхёжидной вниманием очищенным сетью (ДРМН), который использует внимание в процессе итеративного обучения для интеграции информации о различных масштабах пикселей. ДРМН итеративно перекодирует пиксели с сетью внимания после обновления представления пикселей. Таким образом, ДРМН может привести к точным и дискриминативным представлениям пикселей, очистить топ-$k$ самых похожих пикселей и уменьшить несоответствие фразы-пиксель существенно. Экспериментальные результаты показывают, что наша новая конструкция значительно улучшает результаты сравнения текстовых фраз и пикселей изображения. Конкретно, ДРМН достигает новых рекордов штата-арта на задаче ПНГ с увеличением в среднем на 3,5% recall. Коды доступны в: https://github.com/JaMesLiMers/DRMN.

On the Interplay between Fairness and Explainability

paper_url: http://arxiv.org/abs/2310.16607
repo_url: None
paper_authors: Stephanie Brandl, Emanuele Bugliarello, Ilias Chalkidis
for: 这个论文的目的是建立可靠和信任worthy的NLP应用程序，这些应用程序需要具有不偏袋和可解释性。
methods: 这篇论文使用了多种方法来优化偏袋和可解释性，包括偏袋缓解和可解释性检测。
results: 研究发现，偏袋缓解算法不总是能够提高公平性，同时，empirical fairness和可解释性是 orthogonal的。

Abstract
In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both. In this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible rationales? and vice versa. To this end, we conduct experiments on two English multi-class text classification datasets, BIOS and ECtHR, that provide information on gender and nationality, respectively, as well as human-annotated rationales. We fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations. We find that bias mitigation algorithms do not always lead to fairer models. Moreover, we discover that empirical fairness and explainability are orthogonal.

摘要
为建立可靠和信worthy的自然语言处理（NLP）应用程序，模型需要具备不同人群的公平性和可解释性。通常这两个目标被优化和/或独立地评估。我们 argue that forthcoming NLP系统应该同时考虑这两个目标。在这项工作中，我们进行了首次研究，了解这两个目标之间的关系：是否可以更加公平的模型具备更加可信的理由？以及vice versa。为此，我们在两个英语多类文本分类 datasets（BIOS和ECtHR）上进行了实验，这两个dataset提供了 gender和 nationality信息，以及人类标注的理由。我们对预训练语言模型进行了多种方法的调整，包括：* 偏好缓和，用于提高公平性* 理由提取，用于生成可信的解释我们发现，偏好缓和算法不总是能够提高公平性。此外，我们发现了empirical公平性和可解释性是独立的。

Tailoring Personality Traits in Large Language Models via Unsupervisedly-Built Personalized Lexicons

paper_url: http://arxiv.org/abs/2310.16582
repo_url: None
paper_authors: Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
for: 这研究旨在tailoring大 Five trait within large language models (LLMs), allowing for the incorporation of any combination of the Big Five factors (i.e., openness, conscientiousness, extraversion, agreeableness, and neuroticism) in a pluggable manner.
methods: 该方法使用Unsupervisedly-Built Personalized Lexicons (UBPL) 来调整原始 LLMs 的下一个 token 预测概率，以促使模型生成具有个性 trait 的文本。
results: 实验结果表明该方法可以精细地控制 LLMs 的个性 trait，并且可以轻松地与其他 LLMs 集成。

Abstract
Personality plays a pivotal role in shaping human expression patterns, and empowering and manipulating large language models (LLMs) with personality traits holds significant promise in enhancing the user experience of LLMs. However, prior approaches either rely on fine-tuning LLMs on a corpus enriched with personalized expressions or necessitate the manual crafting of prompts to induce LLMs to produce personalized responses. The former approaches demand substantial time and resources for collecting sufficient training examples while the latter might fail in enabling the precise manipulation of the personality traits at a fine-grained level (e.g., achieving high agreeableness while reducing openness). In this study, we introduce a novel approach for tailoring personality traits within LLMs, allowing for the incorporation of any combination of the Big Five factors (i.e., openness, conscientiousness, extraversion, agreeableness, and neuroticism) in a pluggable manner. This is achieved by employing a set of Unsupervisedly-Built Personalized Lexicons (UBPL) that are utilized to adjust the probability of the next token predicted by the original LLMs during the decoding phase. This adjustment encourages the models to generate words present in the personalized lexicons while preserving the naturalness of the generated texts. Extensive experimentation demonstrates the effectiveness of our approach in finely manipulating LLMs' personality traits. Furthermore, our method can be seamlessly integrated into other LLMs without necessitating updates to their parameters.

摘要
人格 trait plays a crucial role in shaping human expression patterns, and empowering and manipulating large language models (LLMs) with personality traits holds great promise in enhancing the user experience of LLMs. However, previous approaches either rely on fine-tuning LLMs on a corpus enriched with personalized expressions or require manual crafting of prompts to induce LLMs to produce personalized responses. The former approaches demand substantial time and resources for collecting sufficient training examples, while the latter might fail to achieve precise manipulation of personality traits at a fine-grained level (e.g., achieving high agreeableness while reducing openness).In this study, we propose a novel approach for tailoring personality traits within LLMs, allowing for the incorporation of any combination of the Big Five factors (i.e., openness, conscientiousness, extraversion, agreeableness, and neuroticism) in a pluggable manner. This is achieved by employing a set of Unsupervisedly-Built Personalized Lexicons (UBPL) to adjust the probability of the next token predicted by the original LLMs during the decoding phase. This adjustment encourages the models to generate words present in the personalized lexicons while preserving the naturalness of the generated texts. Extensive experimentation demonstrates the effectiveness of our approach in finely manipulating LLMs' personality traits. Furthermore, our method can be seamlessly integrated into other LLMs without requiring updates to their parameters.

paper_url: http://arxiv.org/abs/2310.16579
repo_url: https://github.com/HKBUNLP/WSDMS-EMNLP2023
paper_authors: Ruichao Yang, Wei Gao, Jing Ma, Hongzhan Lin, Zhiwei Yang
for: 这项研究的目的是解决社交媒体上快速传播的假消息和不确定信息问题。methods: 本研究提出了一种新的假新闻推篱方法，即利用多个实例学习（Multiple Instance Learning，MIL）方法，只需要训练集的袋子级标签，但可以推断出具有错误信息的句子和具有真实性的文章。results: 研究表明，该方法可以在三个真实世界标准 benchmark 上击败现有的状态作准基eline，在句子和文章水平上 debunk 假新闻。

Abstract
In recent years, we witness the explosion of false and unconfirmed information (i.e., rumors) that went viral on social media and shocked the public. Rumors can trigger versatile, mostly controversial stance expressions among social media users. Rumor verification and stance detection are different yet relevant tasks. Fake news debunking primarily focuses on determining the truthfulness of news articles, which oversimplifies the issue as fake news often combines elements of both truth and falsehood. Thus, it becomes crucial to identify specific instances of misinformation within the articles. In this research, we investigate a novel task in the field of fake news debunking, which involves detecting sentence-level misinformation. One of the major challenges in this task is the absence of a training dataset with sentence-level annotations regarding veracity. Inspired by the Multiple Instance Learning (MIL) approach, we propose a model called Weakly Supervised Detection of Misinforming Sentences (WSDMS). This model only requires bag-level labels for training but is capable of inferring both sentence-level misinformation and article-level veracity, aided by relevant social media conversations that are attentively contextualized with news sentences. We evaluate WSDMS on three real-world benchmarks and demonstrate that it outperforms existing state-of-the-art baselines in debunking fake news at both the sentence and article levels.

摘要
近年来，我们目睹了社交媒体上的谣言泛洪，让公众受到了各种不同的影响。谣言可以让社交媒体用户表达多种不同的看法，大多是争议的。验证谣言和判断看法是不同 yet 相关的任务。驳斥 fake news 主要集中在决定新闻文章的真实性，这有些 simplifies 了问题，因为 fake news 经常混合真实和假的元素。因此，成为必须鉴别特定的谣言内容。在这项研究中，我们调查了一项新的 fake news 驳斥任务，即Detecting Misinforming Sentences（DMS）。这项任务的主要挑战在于缺乏 sentence-level 的真实性标注数据。以 Multiple Instance Learning（MIL） Approach 为 inspiration，我们提出了 Weakly Supervised Detection of Misinforming Sentences（WSDMS）模型。这个模型只需要训练 bag-level 标签，但可以推断出 sentence-level 谣言和文章-level 真实性，得益于与新闻句子相关的社交媒体对话。我们对 WSDMS 进行了三个实际 benchmark 的评估，并证明它在 fake news 驳斥中超过了现有的基eline。

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models

paper_url: http://arxiv.org/abs/2310.16570
repo_url: None
paper_authors: Paul Youssef, Osman Alperen Koraş, Meijie Li, Jörg Schlötterer, Christin Seifert
for: 这个研究旨在调查 PLMs 中的事实知识量，以解释其在下游任务中的表现，并可能正当使用它们作为知识库。
methods: 这篇论文报道了对 PLMs 的事实检测方法，包括输入、输出和检测 PLMs 的方法，并提供了这些方法的概述。
results: 该研究发现了 PLMs 中的事实知识量，并分析了在采用 PLMs 作为知识库时的障碍和未来研究的方向。

Abstract
Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for factual knowledge. Our contributions are: (1) We propose a categorization scheme for factual probing methods that is based on how their inputs, outputs and the probed PLMs are adapted; (2) We provide an overview of the datasets used for factual probing; (3) We synthesize insights about knowledge retention and prompt optimization in PLMs, analyze obstacles to adopting PLMs as knowledge bases and outline directions for future work.

摘要

基于输入、输出和探测PLMs的改进方法的分类方案。2. 对用于事实探测的数据集进行概述。3. 对PLMs中知识保留和提问优化的分析，以及采用PLMs作为知识库的障碍和未来工作的规划。

paper_url: http://arxiv.org/abs/2310.16568
repo_url: None
paper_authors: Palak Jain, Livio Baldini Soares, Tom Kwiatkowski
for: 本研究旨在提出一种基于Transformer模型的单一解决方案，可以同时回答问题和检索证据。
methods: 该模型使用受限的解码进行搜索和回答，并通过比较与其他 retrieve-and-read 方法的性能指标来证明其竞争力。
results: 实验结果表明，1-Pager 可以与其他相似的 retrieve-and-read 方法相比，在回答准确率和检索率两个指标上具有竞争力。此外，1-Pager 还可以在不读取多个文档后产生答案的情况下，提供更高的回答准确率。

Abstract
We present 1-Pager the first system that answers a question and retrieves evidence using a single Transformer-based model and decoding process. 1-Pager incrementally partitions the retrieval corpus using constrained decoding to select a document and answer string, and we show that this is competitive with comparable retrieve-and-read alternatives according to both retrieval and answer accuracy metrics. 1-Pager also outperforms the equivalent closed-book question answering model, by grounding predictions in an evidence corpus. While 1-Pager is not yet on-par with more expensive systems that read many more documents before generating an answer, we argue that it provides an important step toward attributed generation by folding retrieval into the sequence-to-sequence paradigm that is currently dominant in NLP. We also show that the search paths used to partition the corpus are easy to read and understand, paving a way forward for interpretable neural retrieval.

摘要
我们介绍1-Pager，首个使用单一转换器模型和解码过程来回答问题并提取证据的系统。1-Pager逐步分割检索库使用受限解码方式选择文档和答案字符串，我们表明这与相似的检索和读取选择相当。1-Pager还超过相同的关闭书问答模型，通过固定预测在证据库中附加 Generation。虽然1-Pager还不及更加昂贵的系统，但我们认为它为归因生成带来了重要的一步，将检索嵌入序列到序列中的当前主流NLP框架中。我们还显示检索路径使用受限解码方式分割库是易于阅读和理解的，这为神经网络检索带来了可读性的前进。

An Early Evaluation of GPT-4V(ision)

paper_url: http://arxiv.org/abs/2310.16534
repo_url: https://github.com/albertwy/gpt-4v-evaluation
paper_authors: Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, Bing Qin
for: 本研究用于评估GPT-4V在视觉理解、语言理解、视觉拼图解决和其他modalities的能力。
methods: 我们手动构建656个测试实例，并仔细评估GPT-4V的表现。
results: 我们发现GPT-4V在英语视觉中benchmark表现出色，但无法识别简单的中文文本在图像中; GPT-4V在敏感特征相关问题上表现不一致; GPT-4V在语言理解任务上表现 inferior于GPT-4（API）; 几何提示可以提高GPT-4V的视觉理解和语言理解能力; GPT-4V在类似模式的任务上表现差。

Abstract
In this paper, we evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio. To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V. The highlights of our findings are as follows: (1) GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age; (3) GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both visual understanding and language understanding; (5) GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles; (6) GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. Our experimental results reveal the ability and limitations of GPT-4V and we hope our paper can provide some insights into the application and research of GPT-4V.

摘要
在这篇论文中，我们评估了GPT-4V的不同能力，包括视觉理解、语言理解、视觉逻辑解决、以及其他modalities such as depth、thermal、视频和audio的理解。为了估计GPT-4V的性能，我们手动构建了656个测试实例，并且精心评估了GPT-4V的结果。我们的发现包括：1. GPT-4V在英文视觉中benchmarks上表现出色，但是无法识别简单的中文文本在图像中;2. GPT-4V在归类敏感特征问题上表现不一致，包括性别、种族和年龄等;3. GPT-4V在语言理解任务上比GPT-4（API）表现更差，包括通用语言理解benchmarks和视觉常识知识评估benchmarks;4. 几个提示可以提高GPT-4V的视觉理解和语言理解性能;5. GPT-4V很难在两个类似图像之间找到细节和解决易于数学图像逻辑问题;6. GPT-4V在视频和热成像任务上表现不错，与图像任务相似。我们的实验结果表明GPT-4V的能力和局限性，我们希望这篇论文可以为GPT-4V的应用和研究提供一些启示。

CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval

paper_url: http://arxiv.org/abs/2310.16528
repo_url: None
paper_authors: Jindřich Helcl, Jindřich Libovický
for: 这个论文是为了参加2023年多语言多任务信息检索（MRL）共同任务的系统设计的。
methods: 论文使用了翻译测试方法，首先将无标例例 перевод成英语，然后使用强task特定模型进行推理。最后，我们使用标签敏感翻译模型对原语言中的标签进行评分，以保持在原语言中的标签。
results: 论文在两个下标任务中使用了翻译测试方法，但由于development数据和共同任务验证和测试集之间的领域不同，经过训练的分类模型无法超越基线。

Abstract
We present the Charles University system for the MRL~2023 Shared Task on Multi-lingual Multi-task Information Retrieval. The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages. Our solutions to both subtasks rely on the translate-test approach. We first translate the unlabeled examples into English using a multilingual machine translation model. Then, we run inference on the translated data using a strong task-specific model. Finally, we project the labeled data back into the original language. To keep the inferred tags on the correct positions in the original language, we propose a method based on scoring the candidate positions using a label-sensitive translation model. In both settings, we experiment with finetuning the classification models on the translated data. However, due to a domain mismatch between the development data and the shared task validation and test sets, the finetuned models could not outperform our baselines.

摘要
我们介绍了查尔斯大学系统 для MRL~2023共享任务的多语言多任务信息检索。该任务的目标是开发用于多种语言的命名实体识别和问答系统。我们的解决方案对于两个子任务都采用了翻译测试方法。我们首先将无标示示例翻译成英语使用多语言机器翻译模型。然后，我们运行在翻译后的数据上进行推理，使用强大的任务特定模型。最后，我们将原始语言中的标注数据投影回原始语言。为保持在原始语言中的推理结果的标注，我们提议一种基于标签敏感翻译模型的评分方法。在两个设置中，我们尝试了在翻译后的数据上进行训练分类模型，但由于共享任务验证和测试集与开发数据的领域差异，训练后的模型无法超越我们的基线。

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

paper_url: http://arxiv.org/abs/2310.16517
repo_url: None
paper_authors: Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, Jingren Zhou
for:* This paper aims to address the issue of occupational bias in instruction-tuning datasets for large language models (LLMs), which hinders the models’ ability to generate helpful responses to professional queries from practitioners in specific fields.methods:* The authors create an instruction-tuning dataset named “OccuQuest” that contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories.* They systematically request ChatGPT to generate responses to queries hierarchically based on Occupation, Responsibility, Topic, and Question to ensure comprehensive coverage of occupational specialty inquiries.results:* The authors compare OccuQuest with three commonly used datasets (Dolly, ShareGPT, and WizardLM) and find that OccuQuest exhibits a more balanced distribution across occupations.* They fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations, with a high win rate of 86.4% against WizardLM on the occu-quora set.

Abstract
The emergence of large language models (LLMs) has revolutionized natural language processing tasks. However, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned LLMs to generate helpful responses to professional queries from practitioners in specific fields. To mitigate this issue and promote occupation-inclusive LLMs, we create an instruction-tuning dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. We systematically request ChatGPT, organizing queries hierarchically based on Occupation, Responsibility, Topic, and Question, to ensure a comprehensive coverage of occupational specialty inquiries. By comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we observe that OccuQuest exhibits a more balanced distribution across occupations. Furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against WizardLM.

摘要
大量语言模型（LLM）的出现已经革命化了自然语言处理任务。然而，现有的 instrucion-tuning 数据集受到职业偏见：大多数数据关注只有一些职业，这使得 instrucion-tuned LLM 无法生成专业领域问题上有用的回答。为解决这问题并推广职业包容 LLM，我们创建了一个 instrucion-tuning 数据集名为“OccuQuest”，包含 более чем 110,000+ 提示完成对和 30,000+ 对话，覆盖26个职业类别中的1,000多个职业。我们系统地请求 ChatGPT，将提示组织按照职业、责任、话题和问题进行归类，以确保职业专业问题的全面覆盖。与 Dolly、ShareGPT 和 WizardLM 等三个常用数据集进行比较，我们发现 OccuQuest 的分布更加平衡。此外，我们组成了三个测试集，包括 occu-test set（覆盖25个职业类别）、 estate set（专注于房地产）和 occu-quora set（包含来自 Quora 的真实问题）。然后，我们使用 OccuQuest 进行精度调整，得到了 OccuLLaMA，它在专业问题上以高胜率（86.4%）击败了 WizardLM。

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

paper_url: http://arxiv.org/abs/2310.16484
repo_url: None
paper_authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank, Ivan Titov
for: 本研究旨在探讨语言模型在自然语言处理（NLP）中的基本知识空间是如何形成和如何在训练过程中交互。
methods: 研究人员使用了一种新的信息理论探索工具箱，可以直接比较不同任务的表现和表现空间，对九个任务（涉及语法、 semantics 和理解）进行了200万步骤的预训练和五个种子的分析。
results: 研究发现，在不同任务和时间点上，语言知识在不同阶段出现、交换信息和特циали化，从而影响模型的表现。 syntax 知识在训练的早期快速获得，而后续的表现提升主要来自于开放领域知识的获得，而 semantics 和理解任务则在后期受惠于更高的特циали化和长距离contextualization。检测交育 Task 之间的相似性也表明，语言相关任务在训练过程中进行了各种信息交换，并在关键学习阶段更加活跃。

Abstract
Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.

摘要
NATURAL LANGUAGE PROCESSING (NLP) 的基础知识是通过语言模型学习得到的表征空间，但是过去很少有人研究了在哪些时候和如何在训练中不同类型的语言信息emerge和交互。我们使用了一个新的信息论探测 suite，可以对不同任务的表征空间进行直接比较，我们分析了九个任务，覆盖了 syntax、 semantics 和理解，在200万个预训练步和五个种子上进行了分析。我们发现了训练过程中的关键学习阶段，在这些阶段表征空间出现、信息交换和后来分离以特化。在这些阶段，语法知识得到了快速的学习，而开放领域知识的获得则是训练的主要来源，而 semantics 和理解任务则在后来得到了更多的长距离 contextualization 和更高的特化。我们的发现对模型解释、多任务学习和学习从有限数据进行了启示。

CLEX: Continuous Length Extrapolation for Large Language Models

paper_url: http://arxiv.org/abs/2310.16450
repo_url: https://github.com/damo-nlp-sg/clex
paper_authors: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing
for: 提高LLMs的上下文窗口长度，以便在长上下文应用中表现出色。
methods: 基于Continuous Length EXtrapolation（CLEX）的方法，通过对length scaling factor进行普通微分方程的模型化，超越现有PE scaling方法的限制。
results: 在实验中，CLEX可以准确地扩展LLMs的上下文窗口长度至超过4倍或接近8倍的训练长度，无损性性能。此外，在实际的LongBench标准测试中，我们的模型在4k长度上表现竞争力强，与状态流开源模型在32k长度上训练的表现相当。

Abstract
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.

摘要
transformer-based 大型自然语言处理模型（LLM）在许多任务中取得了先锋的进步，然而其杰出的能力受到 transformer 中设置的上下文窗口的限制。位嵌入（PE）缩放方法可以延长上下文窗口的长度，但是它们在不同长度上的极限性能或者在上下文窗口内的一部分性能牺牲。长度极限方法可以将上下文窗口拓展到训练序列长度之 beyond，但是在实际长上下文应用中 frequently underperform。为解决这些挑战，我们提出了 Continuous Length EXtrapolation（CLEX） для LLM。我们将 PE 缩放方法扩展到模型化连续动力学，使得可以不受现有 PE 缩放方法设置的限制。此外，通过将动力学拓展到所需的上下文长度，CLEX 可以具有卓越的长度极限性能。我们在 LLM 中 embedding 旋转 Position Embedding 的模型中实现了 CLEX，并证明了它可以轻松地与 training 和推理时间相比，无损到性能。实验结果表明，CLEX 可以有效地将上下文窗口拓展到训练序列长度的4倍或更长，无损到性能。此外，当我们对 practical LongBench benchmark 进行评估时，我们的模型在4k 长度上表现与开源模型在上下文长度达32k 的状态之前的状态竞争。

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

paper_url: http://arxiv.org/abs/2310.16436
repo_url: None
paper_authors: Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, Sibei Yang
for: 本研究旨在提高人工智能系统在多Modal reasoning中的能力，使其能够像人类一样进行复杂的多Modal reasoning。
methods: 本研究使用了大语言模型（LLMs），通过模仿人类思维链（CoT）来实现多Modal reasoning。研究人员还提出了两个关键发现：“保持批判性思维”和“让每个人做自己的工作”。
results: 研究人员提出了一种新的DDCoT提问方法，可以维护语言模型的批判性思维能力，同时将视觉认知能力 integrate into reasoning过程。DDCoT提问方法在零shot提问和练习学习中，对大语言模型和小语言模型的理解能力进行了显著改进，并且具有很好的普适性和可解释性。

Abstract
A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

摘要
traditional goal of AI systems 是 perform complex multimodal reasoning 如人类。Recently, large language models (LLMs) have made remarkable progress in such multi-step reasoning on the language modality by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces increased challenges, including but not limited to the impractical need for labor-intensive annotation and limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work conducts an in-depth analysis of the challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

paper_url: http://arxiv.org/abs/2310.16427
repo_url: None
paper_authors: Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, Zhiting Hu
for: 这 paper 的目的是开发一种可以自动生成高质量专家级提问的优化方法，以提高大语言模型（LLM）的表现。
methods: 这 paper 使用了一种基于 Monte Carlo 搜索的原则导航算法，来寻找专家级提问空间中的优质提问。另外，它还引入了人类化的尝试-错误探索机制，以便从模型错误中获得精准的专家级 Insight 和深入的指导。
results: 这 paper 在 12 个任务中证明了 PromptAgent 可以备受提高 Chain-of-Thought 和最近的提问优化基准点。此外，它还进行了广泛的分析，证明了其能够具有高效、普适和域内专家级的提问生成能力。

Abstract
Highly effective, task-specific prompts are often heavily engineered by experts to integrate detailed instructions and domain insights based on a deep understanding of both instincts of large language models (LLMs) and the intricacies of the target task. However, automating the generation of such expert-level prompts remains elusive. Existing prompt optimization methods tend to overlook the depth of domain knowledge and struggle to efficiently explore the vast space of expert-level prompts. Addressing this, we present PromptAgent, an optimization method that autonomously crafts prompts equivalent in quality to those handcrafted by experts. At its core, PromptAgent views prompt optimization as a strategic planning problem and employs a principled planning algorithm, rooted in Monte Carlo tree search, to strategically navigate the expert-level prompt space. Inspired by human-like trial-and-error exploration, PromptAgent induces precise expert-level insights and in-depth instructions by reflecting on model errors and generating constructive error feedback. Such a novel framework allows the agent to iteratively examine intermediate prompts (states), refine them based on error feedbacks (actions), simulate future rewards, and search for high-reward paths leading to expert prompts. We apply PromptAgent to 12 tasks spanning three practical domains: BIG-Bench Hard (BBH), as well as domain-specific and general NLP tasks, showing it significantly outperforms strong Chain-of-Thought and recent prompt optimization baselines. Extensive analyses emphasize its capability to craft expert-level, detailed, and domain-insightful prompts with great efficiency and generalizability.

摘要
高效的任务特定提示通常由专家严格工程来整合详细的指令和领域知识，基于大语言模型（LLM）的本性和目标任务的细节。然而，自动生成专家水平提示的机器化仍然是一个未解之谜。现有的提示优化方法通常会忽略领域知识的深度和专家水平提示的巨大空间，而 PromptAgent 则是一种新的优化方法。PromptAgent 视提示优化为战略规划问题，并使用基于 Monte Carlo 搜索的原则正则算法来策略性浏览专家水平提示空间。被人类类似的尝试错误探索所 inspirited，PromptAgent 通过反思模型错误和生成有用的错误反馈来带来精准的专家水平启示和深入的指令。这种新的框架使得代理人可以随机检查中间提示（状态），根据错误反馈（动作）进行修改，在将来的奖励 simulate 和搜索高荷道路寻找专家提示。我们在 12 个任务中应用 PromptAgent，包括 BBH 和一些域特定和通用 NLP 任务，显示它与强大的 Chain-of-Thought 和最新的提示优化基线相比有显著的优势。广泛的分析表明它可以高效地制造专家水平的详细、领域内在的提示。

Enhanced Simultaneous Machine Translation with Word-level Policies

paper_url: http://arxiv.org/abs/2310.16417
repo_url: https://github.com/xl8-ai/wordsimt
paper_authors: Kang Kim, Hankyu Cho
for: 本研究的目的是提高同时机器翻译（SiMT）的性能，并解决现有研究中常见的一个假设，即在翻译过程中每步都需要读取或写入子单元（subword）。
methods: 本研究使用的方法包括提出了一种新的单词级策略（word-level policy），该策略可以在单词层面进行多个子单元的处理，以实现单词级别的翻译。此外，研究还提出了一种使用语言模型（LM）来提高SiMT模型的方法，该方法利用了word-level policy来解决LM和SiMT模型之间的子单元差异。
results: 研究发现，使用word-level policy可以提高SiMT模型的性能，并且可以 Addressing the subword disparity between LMs and SiMT models. Code is available at https://github.com/xl8-ai/WordSiMT.

Abstract
Recent years have seen remarkable advances in the field of Simultaneous Machine Translation (SiMT) due to the introduction of innovative policies that dictate whether to READ or WRITE at each step of the translation process. However, a common assumption in many existing studies is that operations are carried out at the subword level, even though the standard unit for input and output in most practical scenarios is typically at the word level. This paper demonstrates that policies devised and validated at the subword level are surpassed by those operating at the word level, which process multiple subwords to form a complete word in a single step. Additionally, we suggest a method to boost SiMT models using language models (LMs), wherein the proposed word-level policy plays a vital role in addressing the subword disparity between LMs and SiMT models. Code is available at https://github.com/xl8-ai/WordSiMT.

摘要
近年来，同时机器翻译（SiMT）领域发生了非常出色的进步，这主要归功于新的政策的引入，这些政策在翻译过程中每步都会决定是否阅读或写入。然而，许多现有研究假设在翻译过程中每步都会进行子词级别的操作，尽管在实际应用场景中，输入和输出标准单位通常是单词级别。本文表明，在子词级别采用的策略会被单词级别的策略所超越，后者可以在单步中处理多个子词，形成完整的单词。此外，我们建议使用语言模型（LM）来提升SiMT模型，其中提议的单词级别策略具有重要的地位，以Addressing LM和SiMT模型之间的子词差异。代码可以在https://github.com/xl8-ai/WordSiMT中找到。

Decoding Stumpers: Large Language Models vs. Human Problem-Solvers

paper_url: http://arxiv.org/abs/2310.16411
repo_url: None
paper_authors: Alon Goldstein, Miriam Havin, Roi Reichart, Ariel Goldstein
for: 本研究探讨了大语言模型（LLMs）的问题解决能力，通过评估它们在独特的单步直觉问题上的表现。
methods: 本研究使用了四种当今最先进的LLMs（Davinci-2、Davinci-3、GPT-3.5-Turbo和GPT-4）和人类参与者进行比较。
results: 研究发现，新一代LLMs在解决独特问题上表现出色，超越人类表现。然而，人类参与者在验证解决方案的能力方面表现更出色。这些研究增强了我们对LLMs的认知能力的理解，并为不同领域中LLMs的问题解决潜力提供了新的思路。

Abstract
This paper investigates the problem-solving capabilities of Large Language Models (LLMs) by evaluating their performance on stumpers, unique single-step intuition problems that pose challenges for human solvers but are easily verifiable. We compare the performance of four state-of-the-art LLMs (Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our findings reveal that the new-generation LLMs excel in solving stumpers and surpass human performance. However, humans exhibit superior skills in verifying solutions to the same problems. This research enhances our understanding of LLMs' cognitive abilities and provides insights for enhancing their problem-solving potential across various domains.

摘要
这篇论文研究了大语言模型（LLMs）的问题解决能力，通过评估它们在单步直觉问题上的表现，这些问题对人类解决者来说是困难的，但是易于验证。我们比较了四个当今最先进的LLMs（Davinci-2、Davinci-3、GPT-3.5-Turbo、GPT-4）与人类参与者的表现。我们的发现表明，新一代LLMs在解决这些问题方面表现出色，超越了人类表现。然而，人类参与者在验证解决方案的能力方面表现出优异。这些研究增加了我们对LLMs的认知能力的理解，并为各个领域中LLMs的问题解决潜力带来了新的想法。

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

paper_url: http://arxiv.org/abs/2310.16402
repo_url: None
paper_authors: Ji Jiang, Meng Cao, Tengtao Song, Long Chen, Yi Wang, Yuexian Zou
for: 本研究旨在提高视频表达理解（REC）中的目标对象定位精度，基于自然语言提交的问题。
methods: 该研究使用Transformer类型方法，并采用可学习的查询设计。然而，我们认为这种简单的查询设计不适合开放世界的视频REC，由于文本监督下的多种 semantics category。我们的解决方案是创建动态查询，它们是基于输入视频和自然语言来模型多种被引用的物体。特别是，我们在帧中预设一定数量的可学习 bounding box，并使用相关区域特征来提供先前信息。此外，我们发现现有的查询特征忽视了跨模态对齐的重要性。为此，我们将特定句子在句子中与semantic relevante的视觉区域进行对齐，并在现有的视频dataset（VID-Sentence和VidSTG）中进行标注。
results: 我们的提出的模型（名为ConFormer）在广泛的 benchmark dataset 上表现出优于其他模型。例如，在VID-Sentence dataset的测试分区中，ConFormer 在 Accu.@0.6 上实现了8.75%的绝对改进，比前一个状态的艺术模型更高。

Abstract
Video Referring Expression Comprehension (REC) aims to localize a target object in videos based on the queried natural language. Recent improvements in video REC have been made using Transformer-based methods with learnable queries. However, we contend that this naive query design is not ideal given the open-world nature of video REC brought by text supervision. With numerous potential semantic categories, relying on only a few slow-updated queries is insufficient to characterize them. Our solution to this problem is to create dynamic queries that are conditioned on both the input video and language to model the diverse objects referred to. Specifically, we place a fixed number of learnable bounding boxes throughout the frame and use corresponding region features to provide prior information. Also, we noticed that current query features overlook the importance of cross-modal alignment. To address this, we align specific phrases in the sentence with semantically relevant visual areas, annotating them in existing video datasets (VID-Sentence and VidSTG). By incorporating these two designs, our proposed model (called ConFormer) outperforms other models on widely benchmarked datasets. For example, in the testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute improvement on Accu.@0.6 compared to the previous state-of-the-art model.

摘要
视频寻 Referring Expression Comprehension (REC) 目标是根据查询的自然语言来地址视频中的目标对象。最近的改进方法使用 Transformer 基于方法，并使用可学习的查询。但我们认为这种愚然的查询设计并不适合开放世界的视频 REC 中，因为它们可能会忽略多种可能的SemanticCategory。我们的解决方案是创建动态的查询，它们是基于输入视频和语言来模型多种被引用的对象。具体来说，我们在帧中预定一定数量的可学习的 bounding box，并使用相应的区域特征来提供先前信息。同时，我们注意到现有的查询特征 ignore 视频和语言之间的协调。为解决这个问题，我们将特定的句子在句子中与 semantically 相关的视觉区域进行对齐，并在现有的视频 dataset（VID-Sentence和 VidSTG）中进行标注。通过这两种设计，我们的提议的模型（叫做 ConFormer）在广泛的标准化数据集上超越了其他模型。例如，在 VID-Sentence 数据集的测试分区中，ConFormer 在 Accu.@0.6 上减少了8.75%的绝对改进，相比之前的状态对应模型。

ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source Ensembling of Language Adapters

paper_url: http://arxiv.org/abs/2310.16393
repo_url: https://github.com/dair-iitd/zgul
paper_authors: Vipul Rathore, Rajdeep Dhingra, Parag Singla, Mausam
for: 这篇论文目的是解决zero-shot多语言传输问题在自然语言处理任务中。
methods: 论文使用语言适应器（LA）来实现多语言传输。LA通常是单个源语言（通常是英语）的适应器，在测试时使用目标语言或另一种相关语言的适应器。但是，训练目标语言的适应器需要无标签数据，这可能不太可能得到低资源的未看过语言：那些 neither seen by the underlying multilingual language model（例如，mBERT），也没有任何（标签或无标签）数据。因此，我们认为为更有效的跨语言传输，需要使用多个源语言的适应器，同时在训练和测试时使用它们。我们通过我们的新的神经网络架构ZGUL进行了调查。
results: 我们在四种语言组合中进行了广泛的实验，覆盖了15个未看过语言。结果表明，ZGUL比标准精度调整和其他强大基elines在POS标记和NER任务上提高了3.2个平均F1点。此外，我们还扩展了ZGUL，使其在有些未标签数据或少量训练示例available for the target language时也能够表现出色。在这些设置下，ZGUL仍然超过基elines。

Abstract
We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via the use of language adapters (LAs). Most of the earlier works have explored training with adapter of a single source (often English), and testing either using the target LA or LA of another related language. Training target LA requires unlabeled data, which may not be readily available for low resource unseen languages: those that are neither seen by the underlying multilingual language model (e.g., mBERT), nor do we have any (labeled or unlabeled) data for them. We posit that for more effective cross-lingual transfer, instead of just one source LA, we need to leverage LAs of multiple (linguistically or geographically related) source languages, both at train and test-time - which we investigate via our novel neural architecture, ZGUL. Extensive experimentation across four language groups, covering 15 unseen target languages, demonstrates improvements of up to 3.2 average F1 points over standard fine-tuning and other strong baselines on POS tagging and NER tasks. We also extend ZGUL to settings where either (1) some unlabeled data or (2) few-shot training examples are available for the target language. We find that ZGUL continues to outperform baselines in these settings too.

摘要
我们通过语言适配器（LA）解决了零样式跨语言传输问题在自然语言处理任务中。大多数先前的工作都是通过单个源语言（常常是英语）的适配器进行训练，然后在目标语言或另一种相关语言的适配器上进行测试。但是，训练目标语言的适配器需要无标签数据，这些数据可能不易 disponibility для低资源、未看到语言模型（如mBERT）中的语言。我们认为，为更有效的跨语言传输，不仅需要单一源语言的适配器，而是需要多种语言适配器，包括训练和测试时间。我们提出了一种新的神经网络架构，ZGUL，以 investigate 这种想法。我们在四种语言组中进行了广泛的实验，涵盖了15种未看到目标语言，并证明了ZGUL可以与标准精细调整和其他强大基elines 比较。我们还将ZGUL扩展到有限量的标签数据或几个培训示例的情况下。我们发现ZGUL仍然能够超越基elines 在这些情况下。

Transformer-based Live Update Generation for Soccer Matches from Microblog Posts

paper_url: http://arxiv.org/abs/2310.16368
repo_url: None
paper_authors: Masashi Oshika, Kosuke Yamada, Ryohei Sasano, Koichi Takeda
for: 这篇论文是为了生成来自推文的实时足球赛事更新，以便用户可以通过原始推文了解比赛的进程和激励。
methods: 该论文基于大型预训练语言模型，并实现了控制更新数量和减少重复更新的机制。
results: 该系统可以快速生成高质量的实时足球赛事更新，使用户可以快速了解比赛的进程和激励。

Abstract
It has been known to be difficult to generate adequate sports updates from a sequence of vast amounts of diverse live tweets, although the live sports viewing experience with tweets is gaining the popularity. In this paper, we focus on soccer matches and work on building a system to generate live updates for soccer matches from tweets so that users can instantly grasp a match's progress and enjoy the excitement of the match from raw tweets. Our proposed system is based on a large pre-trained language model and incorporates a mechanism to control the number of updates and a mechanism to reduce the redundancy of duplicate and similar updates.

摘要
Live 体育更新从涂浮 tweets 是一个Difficult task，Despite the popularity of live sports viewing experience with tweets. In this paper, we focus on soccer matches and work on building a system to generate live updates for soccer matches from tweets, so that users can instantly grasp the progress of the match and enjoy the excitement of the match from raw tweets. Our proposed system is based on a large pre-trained language model and incorporates a mechanism to control the number of updates and a mechanism to reduce the redundancy of duplicate and similar updates.

From Simple to Complex: A Progressive Framework for Document-level Informative Argument Extraction

paper_url: http://arxiv.org/abs/2310.16358
repo_url: https://github.com/zhangyx0417/simple_to_complex
paper_authors: Quzhe Huang, Yanxi Zhang, Dongyan Zhao
for: 这个论文旨在提高文档级事件抽象EXTRACTION（EAE）的准确率。
methods: 该论文提出了一种简单到复杂的推进 Framework，通过计算每个事件的难度，然后按照简单到复杂的顺序进行抽象。这样，模型可以使用更可靠的结果来帮助预测更加困难的事件。
results: 在WikiEvents数据集上进行实验，该模型的F1分数高于SOTA的1.4%，表明提出的简单到复杂推进 Framework 对EAE任务有用。

Abstract
Document-level Event Argument Extraction (EAE) requires the model to extract arguments of multiple events from a single document. Considering the underlying dependencies between these events, recent efforts leverage the idea of "memory", where the results of already predicted events are cached and can be retrieved to help the prediction of upcoming events. These methods extract events according to their appearance order in the document, however, the event that appears in the first sentence does not mean that it is the easiest to extract. Existing methods might introduce noise to the extraction of upcoming events if they rely on an incorrect prediction of previous events. In order to provide more reliable memory, we propose a simple-to-complex progressive framework for document-level EAE. Specifically, we first calculate the difficulty of each event and then, we conduct the extraction following a simple-to-complex order. In this way, the memory will store the most certain results, and the model could use these reliable sources to help the prediction of more difficult events. Experiments on WikiEvents show that our model outperforms SOTA by 1.4% in F1, indicating the proposed simple-to-complex framework is useful in the EAE task.

摘要
文档级事件功能抽取（EAE）需要模型从单个文档中提取多个事件的功能。尽管这些事件之间存在蕴含的依赖关系，但现有的方法却是通过“记忆”的想法来实现，即已经预测过的事件的结果被缓存并可以用于帮助预测未来的事件。这些方法通常是按照文档中事件的出现顺序来提取事件，但是第一个事件的出现并不意味着它是最容易提取的。现有的方法可能会对后续事件的提取引入噪音，如果它们基于错误的先前事件预测。为了提供更可靠的记忆，我们提议一种简单到复杂的进程式框架 для文档级EAE。具体来说，我们首先计算每个事件的难度，然后按照简单到复杂的顺序进行提取。这样，记忆将存储最可靠的结果，并且模型可以使用这些可靠的源来帮助预测更加困难的事件。在WikiEvents上的实验表明，我们的模型在F1指标上比SOTA高出1.4%，这表明我们提议的简单到复杂的框架是EAE任务中有用的。

paper_url: http://arxiv.org/abs/2310.16356
repo_url: None
paper_authors: Yoshinari Fujinuma, Siddharth Varia, Nishant Sankaran, Srikar Appalaraju, Bonan Min, Yogarshi Vyas
for: 本研究旨在提供更多和更好的文档图像分类数据集，以便进一步研究文档图像分类技术。
methods: 本研究使用了两个新的多语言文档图像数据集：WIKI-DOC和MULTIEURLEX-DOC，并对现有的文档图像分类模型进行了全面的测试。
results: 实验结果表明，当文档图像分类模型在不同语言之间进行零例转移时，其限制性很大，需要进一步改进。

Abstract
Document image classification is different from plain-text document classification and consists of classifying a document by understanding the content and structure of documents such as forms, emails, and other such documents. We show that the only existing dataset for this task (Lewis et al., 2006) has several limitations and we introduce two newly curated multilingual datasets WIKI-DOC and MULTIEURLEX-DOC that overcome these limitations. We further undertake a comprehensive study of popular visually-rich document understanding or Document AI models in previously untested setting in document image classification such as 1) multi-label classification, and 2) zero-shot cross-lingual transfer setup. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages. Our datasets and findings open the door for future research into improving Document AI models.

摘要
文档图像分类与文本文档分类不同，它涉及到理解文档的内容和结构，如表格、电子邮件等。我们表明现有的数据集（Lewis et al., 2006）有几个限制，我们介绍了两个新的多语言数据集——WIKI-DOC和MULTIEURLEX-DOC，这两个数据集可以缓解这些限制。我们进行了详细的文档理解或文档AI模型在文档图像分类中的测试，包括1）多个标签分类和2）零shot跨语言传输设置。实验结果显示跨语言文档AI模型在跨语言传输中存在限制。我们的数据集和发现开启了未来文档AI模型的改进研究的大门。

Unraveling Feature Extraction Mechanisms in Neural Networks

paper_url: http://arxiv.org/abs/2310.16350
repo_url: https://github.com/richardsun-voyager/ufemnn
paper_authors: Xiaobing Sun, Jiaxi Li, Wei Lu
for: investigate the underlying mechanisms of neural networks in capturing precise knowledge
methods: based on Neural Tangent Kernels (NTKs) to analyze the learning dynamics of target models
results: discovered that the choice of activation function can affect feature extraction, and that multiplication-based models excel in learning n-grams.

Abstract
The underlying mechanism of neural networks in capturing precise knowledge has been the subject of consistent research efforts. In this work, we propose a theoretical approach based on Neural Tangent Kernels (NTKs) to investigate such mechanisms. Specifically, considering the infinite network width, we hypothesize the learning dynamics of target models may intuitively unravel the features they acquire from training data, deepening our insights into their internal mechanisms. We apply our approach to several fundamental models and reveal how these models leverage statistical features during gradient descent and how they are integrated into final decisions. We also discovered that the choice of activation function can affect feature extraction. For instance, the use of the \textit{ReLU} activation function could potentially introduce a bias in features, providing a plausible explanation for its replacement with alternative functions in recent pre-trained language models. Additionally, we find that while self-attention and CNN models may exhibit limitations in learning n-grams, multiplication-based models seem to excel in this area. We verify these theoretical findings through experiments and find that they can be applied to analyze language modeling tasks, which can be regarded as a special variant of classification. Our contributions offer insights into the roles and capacities of fundamental components within large language models, thereby aiding the broader understanding of these complex systems.

摘要
大脑网络的内部机制如何捕捉精准知识，一直是研究的热点。在这项工作中，我们提出了基于神经 Tangent Kernels（NTK）的理论方法，以探索这些机制。具体来说，我们假设在训练数据上，目标模型的学习过程可以直观地解释它们从数据中继承的特征，从而深入了解它们的内部机制。我们应用我们的方法到一些基本模型上，揭示了这些模型在梯度下降过程中如何从数据中提取特征，以及如何将这些特征集成到最终决策中。我们还发现了活动函数的选择可能会影响特征提取，例如使用 ReLU 活动函数可能会引入偏见，从而解释其在最新的预训练语言模型中的替换。此外，我们发现了自注意力和 CNN 模型可能会在学习 n-grams 方面存在限制，而乘法基本模型则在这一方面表现出色。我们通过实验验证了这些理论发现，并发现它们可以应用于语言模型Task中，这可以视为一种特殊的分类任务。我们的贡献可以帮助我们更好地理解这些复杂系统中的基本组件的角色和能力，从而推动大脑网络的进一步发展。

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models

paper_url: http://arxiv.org/abs/2310.16343
repo_url: None
paper_authors: Xiang Chen, Xiaojun Wan
for: This paper aims to investigate the integration of intricate constraints into neural text generation using large language models (LLMs).
methods: The study employs multiple LLMs, including ChatGPT and GPT-4, and categorizes constraints into lexical, structural, and relation-based types. The authors present various benchmarks to facilitate fair evaluation.
results: The study reveals LLMs’ capacity and deficiency to incorporate constraints, providing insights for future developments in constrained text generation.Here’s the same information in Simplified Chinese:
for: 这篇论文目的是调查大语言模型（LLMs）中的约束性文本生成。
methods: 这些研究使用多种LLMs，包括ChatGPT和GPT-4，并将约束分为 lexical、structural 和relation-based 类型。作者们提供了多种标准化的评价指标。
results: 研究发现 LLMs 在约束下的文本生成能力和缺陷，提供了未来约束文本生成的指导。I hope that helps!

Abstract
Advancements in natural language generation (NLG) and large language models (LLMs) have led to proficient text generation in various tasks. However, integrating intricate constraints into neural text generation, due to LLMs' opacity, remains challenging. This study investigates constrained text generation for LLMs, where predefined constraints are applied during LLM's generation process. Our research examines multiple LLMs, including ChatGPT and GPT-4, categorizing constraints into lexical, structural, and relation-based types. We also present various benchmarks to facilitate fair evaluation. The study addresses some key research questions, including the extent of LLMs' compliance with constraints. Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation. Codes and datasets will be released upon acceptance.

摘要
Recent advancements in natural language generation (NLG) and large language models (LLMs) have led to significant improvements in text generation for various tasks. However, incorporating complex constraints into neural text generation remains a challenging task due to the opacity of LLMs. This study explores constrained text generation for LLMs, where predefined constraints are applied during the LLM's generation process. Our research categorizes constraints into three types: lexical, structural, and relation-based, and examines multiple LLMs, including ChatGPT and GPT-4. We also provide various benchmarks to facilitate fair evaluation. The study aims to answer several key research questions, including the extent of LLMs' compliance with constraints. The results shed light on LLMs' capacity and limitations in incorporating constraints and provide valuable insights for future developments in constrained text generation. The codes and datasets will be released upon acceptance.

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

paper_url: http://arxiv.org/abs/2310.16340
repo_url: None
paper_authors: Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, Qingsong Wen
for: 这个论文旨在提出一种自主代理工具框架，以便在实际工业环境中进行自主和隐私保护的根本原因分析（RCA）。
methods: 该框架使用了一些增强技术，包括自我一致性和多种上下文管理、稳定化和知识导入方法。
results: 实验结果显示，与ReAct相比，RCAgent在多个方面（包括根本原因预测、解决方案、证据和责任等）具有明显的优势，并且已经成功地 интеGRATED到了阿里巴巴云的Real-time Compute Platform for Apache Flink的诊断和问题探索工作流程中。

Abstract
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent's evident and consistent superiority over ReAct across all aspects of RCA -- predicting root causes, solutions, evidence, and responsibilities -- and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.

摘要

Generative Pre-training for Speech with Flow Matching

paper_url: http://arxiv.org/abs/2310.16338
repo_url: None
paper_authors: Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
for: 这个论文的目的是建立一个基础模型来进行语音生成任务。
methods: 这个论文使用的方法是在60000小时的不分译语音数据上进行流匹配和masked condition的预训练，然后根据任务特定的数据进行细化。
results: 实验结果显示，预训练后的生成模型可以与专家模型一样或超过它们在语音增强、分离和合成等下游任务中表现。这些结果建议了一个基于生成预训练的基础模型可以为语音生成任务提供支持。

Abstract
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

摘要
现代生成模型在过去几年内得到了更多的关注，因为它们在估计和采样数据分布以生成高效的合成数据方面取得了非常出色的成果。在语音领域，语音合成和神经 vocoder 是一些生成模型在取得了极高水平的成果的例子。然而，在语音领域，没有一个通用的生成模型可以直接模型语音。在这项工作中，我们向这个方向发展了一步，我们显示了一个预训练的生成模型，即 SpeechFlow，可以在不同的下游任务中达到强大的性能。 Specifically, we pre-trained SpeechFlow 模型在 60 万小时的无转录语音数据上进行 Flow Matching 和 masked 条件。实验结果表明，可以在任务特定的数据上练习这个预训练模型，以达到或超过现有专家模型的表现水平。我们的工作建议了基于生成预训练的基本模型可以在语音生成任务中建立。注意：以下是将文本翻译成 Simplified Chinese，但不包括所有特殊的语音相关 терминологи。如果需要更加详细的翻译，请随时指明。

Samsung R&D Institute Philippines at WMT 2023

paper_url: http://arxiv.org/abs/2310.16322
repo_url: None
paper_authors: Jan Christian Blaise Cruz
for: 这 paper 是为了描述 Samsung R&D Institute Philippines 在 WMT 2023 通用翻译任务中提交的受限MT 系统，包括 en$\rightarrow$he 和 he$\rightarrow$en 两个方向。
methods: 这些系统采用了一系列最佳实践，包括全面的数据处理管道、人工生成的反向翻译数据和在线解码中使用噪声通道重新排序。
results: 这些模型在两个公共测试集上表现良好，与强基线系统相当，甚至occasionally outperform，即使它们有许多 fewer 参数。

Abstract
In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.

摘要
在这篇论文中，我们描述了我们由Samsung R&D Institute Philippines提交到WMT 2023通用翻译任务的受限MT系统，包括en$\rightarrow$he和he$\rightarrow$en两个方向。我们的系统采用了基于Transformer的序列到序列模型，通过一系列最佳实践进行训练，包括全面的数据预处理管道、合成回传数据和在线解码中使用噪声通道重新排序。我们的模型与强基线系统如mBART50 M2M和NLLB 200 MoE相比，在两个公共benchmark上表现相当，有时 même outperform，即使我们的参数数量相对较少。

DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue Assessment

paper_url: http://arxiv.org/abs/2310.16319
repo_url: None
paper_authors: Yukun Zhao, Lingyong Yan, Weiwei Sun, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, Dawei Yin
for: 本研究是为了提供一个大规模的对话质量评估数据集（DiQAD），用于自动评估开放领域对话质量。
methods: 本研究使用了基于人类对对话质量的评估标准来确定评估标准，然后对实际用户之间的大规模对话进行了标注。
results: 本研究通过多种实验，报告了基线的性能在DiQAD上。同时，也公开了这个数据集，可以供后续研究使用。

Abstract
Dialogue assessment plays a critical role in the development of open-domain dialogue systems. Existing work are uncapable of providing an end-to-end and human-epistemic assessment dataset, while they only provide sub-metrics like coherence or the dialogues are conversed between annotators far from real user settings. In this paper, we release a large-scale dialogue quality assessment dataset (DiQAD), for automatically assessing open-domain dialogue quality. Specifically, we (1) establish the assessment criteria based on the dimensions conforming to human judgements on dialogue qualities, and (2) annotate large-scale dialogues that conversed between real users based on these annotation criteria, which contains around 100,000 dialogues. We conduct several experiments and report the performances of the baselines as the benchmark on DiQAD. The dataset is openly accessible at https://github.com/yukunZhao/Dataset_Dialogue_quality_evaluation.

摘要
对话评估在开放领域对话系统的发展中扮演了关键角色。现有的工作无法提供总体和人类知识基础的对话评估数据集，只提供了一些子指标，如对话 coherence 或者对话者之间的对话是在不实际用户设置下进行的。在这篇论文中，我们发布了一个大规模的对话质量评估数据集（DiQAD），用于自动评估开放领域对话质量。specifically，我们（1）确定了评估标准基于对话质量的人类判断维度，并（2）对实际用户之间的大规模对话进行了annotate，这些对话约有100,000个。我们进行了多个实验，并对基线进行了评估。数据集可以在中免费下载。

paper_url: http://arxiv.org/abs/2310.16303
repo_url: None
paper_authors: Ayesha Qamar, Chetan Verma, Ahmed El-Kishky, Sumit Binnani, Sneha Mehta, Taylor Berg-Kirkpatrick
for: 本研究旨在适应语言模型（LM）理解和表示网页内容，提高在社交媒体上分享和参与URL时的表达能力。
methods: 本研究提出了一种新的预训练目标，可以使LM适应理解URL和网页内容，并通过用户在社交媒体上的互动来学习URL的表示。
results: 通过对多语言版本BERT进行继续预训练，我们实际地证明了我们的框架可以提高 webpage 理解的多种任务和Twitter内部和外部 benchмарks 的表现。

Abstract
Understanding and representing webpages is crucial to online social networks where users may share and engage with URLs. Common language model (LM) encoders such as BERT can be used to understand and represent the textual content of webpages. However, these representations may not model thematic information of web domains and URLs or accurately capture their appeal to social media users. In this work, we introduce a new pre-training objective that can be used to adapt LMs to understand URLs and webpages. Our proposed framework consists of two steps: (1) scalable graph embeddings to learn shallow representations of URLs based on user engagement on social media and (2) a contrastive objective that aligns LM representations with the aforementioned graph-based representation. We apply our framework to the multilingual version of BERT to obtain the model URL-BERT. We experimentally demonstrate that our continued pre-training approach improves webpage understanding on a variety of tasks and Twitter internal and external benchmarks.

摘要
理解和表示网页是在在线社交网络中关键的，因为用户可能将链接和分享在网页上。常见的语言模型（LM）编码器，如BERT，可以用来理解和表示网页的文本内容。然而，这些表示可能不会模型网页的主题信息或正确地捕捉社交媒体用户的appeal。在这项工作中，我们介绍了一种新的预训练目标，可以用来适应LM理解URL和网页。我们的提posed框架包括两个步骤：（1）可扩展的图 embedding来学习URL的浅层表示，基于社交媒体上的用户互动，以及（2）一种对比目标，用于将LM表示与上述图基于的表示相对应。我们在多语言版本的BERT上应用了我们的框架，得到了模型URL-BERT。我们通过实验表明，我们的继续预训练方法可以提高网页理解的多种任务和Twitter内部和外部 bencmarks。

Is ChatGPT a Good Multi-Party Conversation Solver?

paper_url: http://arxiv.org/abs/2310.16301
repo_url: None
paper_authors: Chao-Hong Tan, Jia-Chen Gu, Zhen-Hua Ling
for: 这paper的目的是研究大自然语言模型（LLMs）在多方会话（MPCs）中的能力。
methods: 这paper使用的方法包括对ChatGPT和GPT-4的生成器模型进行零基础学习训练，并在多个MPC任务上进行评估。
results: 研究发现，ChatGPT在一些MPC任务上表现不佳，而GPT-4的表现更为出色。此外，通过添加MPC结构，包括说话人和受众建筑，可以提高表现。这项研究为MPC任务中应用生成LLMs提供了全面的评估和分析，并揭示了创造更高效和强大的MPC代理的挑战。

Abstract
Large Language Models (LLMs) have emerged as influential instruments within the realm of natural language processing; nevertheless, their capacity to handle multi-party conversations (MPCs) -- a scenario marked by the presence of multiple interlocutors involved in intricate information exchanges -- remains uncharted. In this paper, we delve into the potential of generative LLMs such as ChatGPT and GPT-4 within the context of MPCs. An empirical analysis is conducted to assess the zero-shot learning capabilities of ChatGPT and GPT-4 by subjecting them to evaluation across three MPC datasets that encompass five representative tasks. The findings reveal that ChatGPT's performance on a number of evaluated MPC tasks leaves much to be desired, whilst GPT-4's results portend a promising future. Additionally, we endeavor to bolster performance through the incorporation of MPC structures, encompassing both speaker and addressee architecture. This study provides an exhaustive evaluation and analysis of applying generative LLMs to MPCs, casting a light upon the conception and creation of increasingly effective and robust MPC agents. Concurrently, this work underscores the challenges implicit in the utilization of LLMs for MPCs, such as deciphering graphical information flows and generating stylistically consistent responses.

摘要
大型语言模型（LLM）已经在自然语言处理领域成为重要的工具，但它们在多方会话（MPC）中的能力仍然是未知之地。在这篇论文中，我们探索了基于生成的LLM，如ChatGPT和GPT-4，在MPC中的潜在。我们对三个MPC数据集进行了零基础学习评估，以评估这些模型在五种代表性任务中的表现。结果表明，ChatGPT在许多评估任务上表现不佳，而GPT-4的结果表明了未来的发展潜力。此外，我们还尝试了通过包含MPC结构，包括说话人和受众建筑，提高性能。这篇研究提供了生成LLM在MPC中的全面评估和分析，推翻了在MPC中使用LLM的挑战，如图像信息流的解读和生成具有风格一致的回应。

The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining

paper_url: http://arxiv.org/abs/2310.16261
repo_url: None
paper_authors: Ting-Rui Chiang, Dani Yogatama
for: 研究 whether the better sample efficiency and generalization capability of masked language models can be attributed to the semantic similarity encoded in the pretraining data’s distributional property.
methods: 使用 synthetic dataset 和 two real-world datasets 进行分析。
results: 发现 distributional property 对预训练模型的样本效率有益，但不能完全解释模型的泛化能力。

Abstract
We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct analyses over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.

摘要
我们从分布性假设的角度分析掩Masked语言模型预训练目标函数。我们研究whether better sample efficiency和预训练模型的更好泛化能力是由预训练数据的分布性质带来的semantic similarity编码。通过一个 sintetic dataset，我们的分析表明预训练数据的分布性 indeed leads to better sample efficiency of pretrained masked language models, but does not fully explain the generalization ability.我们还对两个实际 dataset进行了分析，并证明了预训练自然语言模型的泛化能力不受分布性的影响。我们的结果表明我们对预训练的理解仍然有限，并提供了未来研究的方向。

2023-10-25

BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs’ Generation

Follow-on Question Suggestion via Voice Hints for Voice Assistants

Conditionally Combining Robot Skills using Large Language Models

Data Augmentation for Emotion Detection in Small Imbalanced Text Data

Quality > Quantity: Synthetic Corpora from Foundation Models for Closed-Domain Extractive Question Answering

How well can machine-generated texts be identified and can language models be trained to avoid identification?

Understanding Social Structures from Contemporary Literary Fiction using Character Interaction Graph – Half Century Chronology of Influential Bengali Writers

Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text Generation

Muslim-Violence Bias Persists in Debiased GPT Models

Zephyr: Direct Distillation of LM Alignment

Learning Transfers over Several Programming Languages

Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors

Divide et Impera: Multi-Transformer Architectures for Complex NLP-Tasks

Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution

Language Agnostic Code Embeddings

Detecting Pretraining Data from Large Language Models

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

IntenDD: A Unified Contrastive Learning Approach for Intent Detection and Discovery

DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation

Disentangling Extraction and Reasoning in Multi-hop Spatial Reasoning

LLM Performance Predictors are good initializers for Architecture Search

BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?

SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning Framework for Emotion Recognition in Conversations

ChatGPT is a Potential Zero-Shot Dependency Parser

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

On the Interplay between Fairness and Explainability

Tailoring Personality Traits in Large Language Models via Unsupervisedly-Built Personalized Lexicons

WSDMS: Debunk Fake News via Weakly Supervised Detection of Misinforming Sentences with Contextualized Social Wisdom

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models

1-PAGER: One Pass Answer Generation and Evidence Retrieval

An Early Evaluation of GPT-4V(ision)

CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

CLEX: Continuous Length Extrapolation for Large Language Models

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

Enhanced Simultaneous Machine Translation with Word-level Policies

Decoding Stumpers: Large Language Models vs. Human Problem-Solvers

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source Ensembling of Language Adapters

Transformer-based Live Update Generation for Soccer Matches from Microblog Posts

From Simple to Complex: A Progressive Framework for Document-level Informative Argument Extraction

A Multi-Modal Multilingual Benchmark for Document Image Classification

Unraveling Feature Extraction Mechanisms in Neural Networks

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Generative Pre-training for Speech with Flow Matching

Samsung R&D Institute Philippines at WMT 2023

DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue Assessment

URL-BERT: Training Webpage Representations via Social Media Engagements

Is ChatGPT a Good Multi-Party Conversation Solver?

The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining