cs.CL - 2023-09-08

Can NLP Models ‘Identify’, ‘Distinguish’, and ‘Justify’ Questions that Don’t have a Definitive Answer?

paper_url: http://arxiv.org/abs/2309.04635
repo_url: None
paper_authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Bhavin Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral
for: investigate the ability of state-of-the-art NLP models to accurately identify and respond to questions that don’t have definitive answers.
methods: introduce a new dataset called QnotA, which consists of five categories of questions that don’t have definitive answers, and evaluate SOTA models including GPT-3 and Flan T5 on three evaluation tasks that test a system’s ability to identify, distinguish, and justify QnotA questions.
results: show that even SOTA models do not fare well on these tasks and lack considerably behind the human performance baseline, and conduct a thorough analysis that leads to several interesting findings.

Abstract
Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response? To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding QA instance i.e. an alternate question that ''can be'' answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings. Overall, we believe our work and findings will encourage and facilitate further research in this important area and help develop more robust models.

摘要
尽管现代NLP系统在各种语言理解任务上实现了很高的表现，但它们主要集中在具有正确答案的问题上。然而，在实际应用中，用户 oftentimes 会提问无法得到定inate答案的问题。如果NLP系统 incorrectly 答复这类问题，会对系统的可靠性和信任性产生负面影响。我们是否可以使用现代NLP模型来准确地识别这类问题，并提供合理的回答？为了解决以上问题，我们引入了QnotA dataset，该 dataset包含五种不同类型的问题，这些问题无法得到定inate答案。另外，为每个QnotA实例，我们还提供了一个相应的QA实例，即可以回答的问题。通过这些数据，我们定义了三个评估任务，以测试系统对QnotA问题的识别、分辨和证明能力。经过广泛的实验，我们发现，包括GPT-3和Flan T5在内的现代NLP模型在这些任务上表现不佳，落后于人类基准值。我们进行了详细的分析，并发现了一些有趣的发现。总的来说，我们认为我们的工作和发现将激发和促进这一重要领域的进一步研究，并帮助开发更加可靠的模型。

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

paper_url: http://arxiv.org/abs/2309.04564
repo_url: None
paper_authors: Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
for: 提高大语言模型（LLMs）的发展，采用自动筛选高质量数据集来减少噪音网络文本数据。
methods: 使用批处理评估数据质量的方法，包括抛物线评估、Error L2-Norm和记忆评估，以系统地评估预训练数据的质量。
results: 发现简单的评估方法减少了预训练数据的质量，并且在训练LLMs时使用30%的原始训练数据可以达到更好的性能。

Abstract
Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

摘要

Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

paper_url: http://arxiv.org/abs/2309.04561
repo_url: None
paper_authors: Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool
for: 本研究旨在解决3D场景中对自然语言描述引用的物体localization问题，具有广泛的应用场景，如自适应室内 robotics 和 AR/VR。
methods: 本研究使用的方法包括grounding-by-detection和dense 3D visual grounding，其中dense 3D visual grounding是指基于referral的3D实例分割。
results: 研究提出了一个名为ConcreteNet的 dense 3D grounding网络，该网络通过三个新的独立模块来提高对受挑战的重复实例（即同类 semantics 的干扰物）的定位性能。这三个模块分别是底层拼接注意力模块、对抗训练方案和学习全球摄像头令。ConcreteNet在ScanRefer online benchmark上取得了+”9.43%的精度，在50% IoU 下。此外，本研究还赢得了 ICVC 3rd Workshop on Language for 3D Scenes “3D Object Localization” 挑战。

Abstract
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring three novel stand-alone modules which aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next we construct a contrastive training scheme to induce separation in the latent space, and finally we resolve view-dependent utterances via a learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark by a considerable +9.43% accuracy at 50% IoU and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.

摘要
三维视觉固定是指根据自然语言描述来确定3D场景中的物体位置。它在自动化室内机器人、AR/VR等领域有广泛的应用，而且在最近几年内得到了广泛关注。一种常见的解决方法是基于检测的固定，其中通过矩形框来确定物体的位置。但是，在实际应用中，矩形框不够地描述物体的几何结构。因此，我们提出了dense 3D视觉固定问题，即基于物体实例的 referral 的3D实例分割。我们提出了一种名为ConcreteNet的密集3D固定网络，其包括三个新的独立模块，以提高固定性能。首先，我们引入了底层拥有注意力的融合模块，以解决间物体关系信息的混淆。然后，我们构建了一种对比训练方案，以强制在特征空间中强制分离物体。最后，我们使用学习的全球摄像头 токен来解决视依赖的问题。ConcreteNet在ScanRefer online bencmark上达到了+"9.43%的精度，并在ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge中赢得了首席。

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

paper_url: http://arxiv.org/abs/2309.04550
repo_url: None
paper_authors: Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, Byron C. Wallace
for: 这个研究旨在使用现代大语言模型来提高电子医疗纪录（EHR）数据中的资讯探索和概要化。methods: 这个研究使用了一个名为Flan-T5 XXL的现代大语言模型，并在零例设定下训练这个模型以探索和概要化EHR数据中的资讯。results: 研究发现，这个LLM-based方法可以提供与标准信息检索基准相比的更好的输出，但也发现LLMs可能会伪造证据，并且提供了一个方法来识别LLMs是否伪造证据。

Abstract
Unstructured Electronic Health Record (EHR) data often contains critical information complementary to imaging data that would inform radiologists' diagnoses. However, time constraints and the large volume of notes frequently associated with individual patients renders manual perusal of such data to identify relevant evidence infeasible in practice. Modern Large Language Models (LLMs) provide a flexible means of interacting with unstructured EHR data, and may provide a mechanism to efficiently retrieve and summarize unstructured evidence relevant to a given query. In this work, we propose and evaluate an LLM (Flan-T5 XXL) for this purpose. Specifically, in a zero-shot setting we task the LLM to infer whether a patient has or is at risk of a particular condition; if so, we prompt the model to summarize the supporting evidence. Enlisting radiologists for manual evaluation, we find that this LLM-based approach provides outputs consistently preferred to a standard information retrieval baseline, but we also highlight the key outstanding challenge: LLMs are prone to hallucinating evidence. However, we provide results indicating that model confidence in outputs might indicate when LLMs are hallucinating, potentially providing a means to address this.

摘要
不结构化电子医疗记录（EHR）数据经常包含有关诊断的关键信息，但由于时间约束和每个患者的备注量的限制，人工浏览这些数据以找到相关证据是在实践中不可能的。现代大型自然语言模型（LLM）提供了一种灵活的交互方式，可以有效地从不结构化EHR数据中提取和概括相关证据。在这种情况下，我们提出了一种使用Flan-T5 XXL模型来实现这一目标。 Specifically，我们在零批学情况下要求模型判断患者是否有某种疾病或风险，如果有，则请求模型概括支持证据。我们征得了医生的手动评估，并发现这种LLM基本上的方法比标准信息检索基准更为可靠，但我们还指出了关键的挑战：LLMs有假象证据的倾向。然而，我们提供了结果，表明模型对输出的自信度可能可以指示LLMs是否假象证据。

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

paper_url: http://arxiv.org/abs/2309.04461
repo_url: https://github.com/yangyi-chen/cotconsistency
paper_authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran
for:This paper explores the ability of vision-language models (VLMs) to demonstrate human-like reasoning based on perceived information, and evaluates their reasoning consistency using a chain-of-thought (CoT) based consistency measure.methods:The paper proposes a LLM-Human-in-the-Loop pipeline to reduce the cost of evaluating VLMs’ reasoning consistency, and builds the CURE benchmark to measure zero-shot reasoning performance and consistency. The paper also proposes a two-stage training framework to improve VLMs’ reasoning performance and consistency, involving supervised fine-tuning and incorporating feedback from LLMs.results:The paper finds that even the best-performing VLM is unable to demonstrate strong visual reasoning capabilities and consistency, indicating the need for substantial efforts to enable VLMs to perform visual reasoning as systematically and consistently as humans. The paper proposes a two-stage training framework to improve VLMs’ reasoning performance and consistency, and empirically highlights the effectiveness of the framework.

Abstract
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

摘要
现代视力语言模型（VLM）已经展现出强大的能力，可以作为视觉助手来理解自然语言中的问题，并生成人类化的输出。在这项工作中，我们探索了VLM的理智能力是否与人类相似。为了解决VLM的理智能力是否具有完全一致和基础的问题，我们还measure VLM的理智一致性。我们实现了这一目标 by proposing a chain-of-thought（CoT） based consistency measure。然而，such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains，which is costly。我们解决这个挑战 by proposing a LLM-Human-in-the-Loop pipeline，which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset。基于这个管道和现有的粗糙注释数据集，我们建立了CURE benchmark，用于测试VLM的零Instance reasoning性和一致性。我们评估了现有的state-of-the-art VLM，发现even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency， indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans。为了进一步提高VLM的理智能力和一致性，我们提出了一个两阶段培训框架。在第一阶段，我们使用监督微调VLMs使用步骤加法样本自动生成by LLMs进行supervised fine-tuning。在第二阶段，我们进一步增强培训过程，通过 incorporating feedback provided by LLMs来生成高一致性和基础的理智链。我们通过实验证明了我们的框架在理智性和一致性方面的效果。

CSPRD: A Financial Policy Retrieval Dataset for Chinese Stock Market

paper_url: http://arxiv.org/abs/2309.04389
repo_url: https://github.com/noewangjy/csprd_dataset
paper_authors: Jinyuan Wang, Hai Zhao, Zhong Wang, Zeyang Zhu, Jinhao Xie, Yong Yu, Yongjian Fei, Yue Huang, Dawei Cheng
for: 这个论文主要为了解决 dense passage retrieval 领域中的专业领域知识 Retrieval 问题，提出了一个新的任务——政策检索。
methods: 该论文使用了中国股票政策检索数据集 (CSPRD)，该数据集包含了700多个预言文本和10000多个条目的中文政策文档，并由专业人士进行了丰富的标注。
results: 实验结果表明，使用 lexical、embedding 和 fine-tuned bi-encoder 模型可以有效地解决政策检索问题，但还有很大的发展空间。最佳基eline 在 dev 集上 achieve 56.1% MRR@10、28.5% NDCG@10、37.5% Recall@10 和 80.6% Precision@10。

Abstract
In recent years, great advances in pre-trained language models (PLMs) have sparked considerable research focus and achieved promising performance on the approach of dense passage retrieval, which aims at retrieving relative passages from massive corpus with given questions. However, most of existing datasets mainly benchmark the models with factoid queries of general commonsense, while specialised fields such as finance and economics remain unexplored due to the deficiency of large-scale and high-quality datasets with expert annotations. In this work, we propose a new task, policy retrieval, by introducing the Chinese Stock Policy Retrieval Dataset (CSPRD), which provides 700+ prospectus passages labeled by experienced experts with relevant articles from 10k+ entries in our collected Chinese policy corpus. Experiments on lexical, embedding and fine-tuned bi-encoder models show the effectiveness of our proposed CSPRD yet also suggests ample potential for improvement. Our best performing baseline achieves 56.1% MRR@10, 28.5% NDCG@10, 37.5% Recall@10 and 80.6% Precision@10 on dev set.

摘要
“在最近的几年中，大幅提前语言模型（PLM）的进步引起了广泛的研究注意力，并实现了 dense passage retrieval 的批处，即从大量文献中检索相关的段落。然而，现有的 dataset 主要对 PLM 进行了通用常识的 factoid 查询，而专业领域如金融和经济仍然未得到了大规模的高质量数据集和专家标注。在这项工作中，我们提出了一项新任务——政策检索，通过引入中文股票政策检索数据集（CSPRD），该数据集包含700多份 prospectus 文本，由经验丰富的专家标注相关的文章从10000多篇收集到的中文政策库中。实验表明，我们的提议的 CSPRD Task 具有效果，同时也表明了进一步改进的潜在。我们的最佳基eline 在开发集上达到了56.1% MRR@10，28.5% NDCG@10，37.5% Recall@10和80.6% Precision@10。”

MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers

paper_url: http://arxiv.org/abs/2309.04372
repo_url: None
paper_authors: Sijia Li, Chen Chen, Haonan Lu
for: 这种研究旨在提出一种基于扩散模型的文本指导图像生成方法，以便在开放领域图像修改任务中实现全面的零基础能力。
methods: 该方法使用了混合专家（MOE）控制器，将文本指导的扩散模型与不同类型的人工指令进行对应，以便处理各种开放领域图像修改任务。
results: 经过大规模实验，该方法在各种图像修改任务中表现出色，可以快速和精准地实现图像的全球和本地修改。Here’s a breakdown of each point:
for: The research aims to propose a method based on diffusion models for text-guided image generation, in order to achieve comprehensive zero-shot capabilities for open-domain image manipulation tasks.
methods: The method uses a mixture-of-expert (MOE) controller to align the text-guided capacity of diffusion models with different types of human instructions, enabling the model to handle various open-domain image manipulation tasks.
results: Extensive experiments demonstrate that the approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions.

Abstract
Diffusion-model-based text-guided image generation has recently made astounding progress, producing fascinating results in open-domain image manipulation tasks. Few models, however, currently have complete zero-shot capabilities for both global and local image editing due to the complexity and diversity of image manipulation tasks. In this work, we propose a method with a mixture-of-expert (MOE) controllers to align the text-guided capacity of diffusion models with different kinds of human instructions, enabling our model to handle various open-domain image manipulation tasks with natural language instructions. First, we use large language models (ChatGPT) and conditional image synthesis models (ControlNet) to generate a large number of global image transfer dataset in addition to the instruction-based local image editing dataset. Then, using an MOE technique and task-specific adaptation training on a large-scale dataset, our conditional diffusion model can edit images globally and locally. Extensive experiments demonstrate that our approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions. Please refer to our project page: [https://oppo-mente-lab.github.io/moe_controller/]

摘要
Diffusion模型基于文本指导图像生成技术最近几年发展很快，在开放领域图像修改任务中取得了惊人的进步。然而，目前只有少数模型具备完全零shot能力，包括全球和本地图像修改。在这项工作中，我们提议使用混合专家（MOE）控制器，将文本指导的 diffusion模型与不同类型的人类指令相匹配，以便处理各种开放领域图像修改任务。首先，我们使用大型自然语言模型（ChatGPT）和condition Image Synthesis模型（ControlNet）生成大量全球图像传输数据集，以及指令基于的本地图像修改数据集。然后，我们使用MOE技术和任务特定适应训练，使我们的条件扩散模型可以全球和本地修改图像。广泛的实验表明，我们的方法在处理开放领域图像修改任务时表现出色，请参考我们项目页面：[https://oppo-mente-lab.github.io/moe_controller/](https://oppo-mente-lab.github.io/moe_controller/)。

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

paper_url: http://arxiv.org/abs/2309.04333
repo_url: https://github.com/ronaldseoh/multi2spe
paper_authors: Ronald Seoh, Haw-Shiuan Chang, Andrew McCallum
for: 本文适用于多个科学领域的文档处理任务，如科技论文分类和引用预测。
methods: 本文使用多个CLS tokens，使Transformer更好地特化于多个科学领域。我们提出了Multi2SPE，它鼓励每个CLS token学习不同的方式归并token embedding，然后将它们加权求和。
results: 我们在多科学领域的Multi-SciDocs测试数据集上测试了多个科学论文vector编码器，发现Multi2SPE可以在多科学领域的引用预测任务中减少误差达25%，而且只需要额外计算一个BERT前进 pass。

Abstract
Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in addition to one BERT forward pass.

摘要
多种有用任务在科学文档中，如主题分类和引用预测，通常使用跨多个科学领域的 corpora。通常，这些任务通过使用 transformer 的单个 CLS token 来表示文本。在这篇论文中，我们 argue 使用多个 CLS token 可以让 transformer 更好地特化到多个科学领域。我们提出 Multi2SPE：它鼓励每个多个 CLS token 学习不同的方式归并token embedding，然后将它们综合起来创建单个 вектор表示。我们还提出我们的新的多个领域测试套件 Multi-SciDocs，用于在多个领域的科学文档vector编码器进行测试。我们表明，Multi2SPE 可以在多个领域的引用预测中减少错误率达25%，只需要额外计算一个 BERT 前进 pass。

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

paper_url: http://arxiv.org/abs/2309.04269
repo_url: None
paper_authors: Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad
for: 本研究旨在提高自然语言生成器（GPT-4）的摘要质量，以增强其报道性和可读性。
methods: 研究人员使用了一种名为“链式密度”（Chain of Density，CoD）的提示，以帮助GPT-4生成更加抽象和融合的摘要。CoD提示首先生成一个缺乏实体的摘要，然后逐渐添加缺失的突出性实体，而不是直接增加摘要的长度。
results: 人类偏好GPT-4生成的CoD摘要比vanilla提示生成的摘要更加精炼和有趣，几乎与人类写的摘要相当。qualitative分析表明，存在一种质量和可读性之间的负担。

Abstract
Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstractive, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generated by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt and almost as dense as human written summaries. Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability. 500 annotated CoD summaries, as well as an extra 5,000 unannotated summaries, are freely available on HuggingFace (https://huggingface.co/datasets/griffin/chain_of_density).

摘要
选择“正确”的信息量包括在摘要中是一项困难任务。一个好的摘要应该是详细的，同时不太繁杂和难以遗弃。为了更好地理解这个贸易，我们采用一种“链式粒度”（CoD）提问来 solicit GPT-4 生成不同粒度的摘要。Specifically, GPT-4 首先生成一个entity-sparse摘要，然后逐渐添加缺失的突出Entity Without increasing the length。CoD 生成的摘要更加抽象，更加具有融合特征，并且具有较少的领先偏见。我们对 CNN DailyMail 文章100篇进行了人类喜好调查，发现人们偏好 GPT-4 生成的 denser 摘要，与vanilla prompt生成的摘要相比，几乎与人类写的摘要一样。Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability.我们在 HuggingFace 上提供了500个注解CoD摘要，以及5,000个未注解的摘要（https://huggingface.co/datasets/griffin/chain_of_density）。

The CALLA Dataset: Probing LLMs’ Interactive Knowledge Acquisition from Chinese Medical Literature

paper_url: http://arxiv.org/abs/2309.04198
repo_url: https://github.com/scir-hi/huatuo-llama-med-chinese
paper_authors: Yanrui Du, Sendong Zhao, Muzhen Cai, Jianyu Chen, Haochun Wang, Yuhan Chen, Haoqiang Guo, Bing Qin
for: 这个研究旨在探讨大型自然语言模型（LLMs）在医疗领域中的应用，特别是通过医学知识图构建指令精细调整（IFT）数据来润色LLMs在互动医学知识方面的能力。methods: 该研究使用了中文医学文献作为丰富的医学知识来源，并通过自由对话检查任务评估LLMs在互动医学知识方面的熟练性。研究人员还发现了一种被称为“事实回应”的现象，其中LLMs倾向于在问题中提到的事实上发表肯定回应，而不愿意挑战它们。为了消除这种不准确的评估，研究人员 artifically构建了一些测试数据，其中一些与事实一致，而另一些与事实不一致。results: 结果显示，IFT数据高度相关于医学文献库资料服务为LLMs提供了强大的刺激，使其能够在互动enario中高效地利用在预训练阶段获得的医学知识。此外，研究人员还提出了一种自动构建IFT数据的框架，并讨论了一些实际应用场景。

Abstract
The application of Large Language Models (LLMs) to the medical domain has stimulated the interest of researchers. Recent studies have focused on constructing Instruction Fine-Tuning (IFT) data through medical knowledge graphs to enrich the interactive medical knowledge of LLMs. However, the medical literature serving as a rich source of medical knowledge remains unexplored. Our work introduces the CALLA dataset to probe LLMs' interactive knowledge acquisition from Chinese medical literature. It assesses the proficiency of LLMs in mastering medical knowledge through a free-dialogue fact-checking task. We identify a phenomenon called the ``fact-following response``, where LLMs tend to affirm facts mentioned in questions and display a reluctance to challenge them. To eliminate the inaccurate evaluation caused by this phenomenon, for the golden fact, we artificially construct test data from two perspectives: one consistent with the fact and one inconsistent with the fact. Drawing from the probing experiment on the CALLA dataset, we conclude that IFT data highly correlated with the medical literature corpus serves as a potent catalyst for LLMs, enabling themselves to skillfully employ the medical knowledge acquired during the pre-training phase within interactive scenarios, enhancing accuracy. Furthermore, we design a framework for automatically constructing IFT data based on medical literature and discuss some real-world applications.

摘要
大量语言模型（LLMs）在医疗领域的应用已经吸引了研究人员的关注。最近的研究主要关注于通过医疗知识图构建 instrucion fine-tuning（IFT）数据，以把医疗知识丰富化大量语言模型的交互能力。然而，医疗文献作为丰富的医疗知识来源尚未被探索。我们的工作介绍了 CALLA 数据集，以评估 LLMS 在中文医疗文献中获得交互知识的能力。我们发现了一种现象，称为“事实跟随回应”， LLMS 在问题中提到的事实会被肯定，并显示不愿意挑战它们。为了消除这种不准确的评估，我们人工构建了两种视角的测试数据：一种与事实相符，一种与事实不符。通过对 CALLA 数据集的探索，我们得出结论：IFT 数据高度相关于医疗文献库资料服务为 LLMS 提供了强大的刺激，使其在交互enario中能够准确地运用在预训练阶段获得的医疗知识。此外，我们设计了一个自动构建 IFT 数据的框架，基于医疗文献，并讨论了一些真实应用。

GLS-CSC: A Simple but Effective Strategy to Mitigate Chinese STM Models’ Over-Reliance on Superficial Clue

paper_url: http://arxiv.org/abs/2309.04162
repo_url: None
paper_authors: Yanrui Du, Sendong Zhao, Yuhan Chen, Rai Bai, Jing Liu, Hua Wu, Haifeng Wang, Bing Qin
for: 本研究旨在探讨中文短文匹配模型对表面特征的过度依赖，以提高其Robustness和泛化能力。
methods: 我们提出了一种新的重采训练策略，即慢慢学习含有表面特征的样本（GLS-CSC），以降低中文STM模型对表面特征的过度依赖。
results: 我们通过对I.D., Rob.和O.O.D.测试集进行广泛的评估，发现GLS-CSC方法可以比 existed方法提高中文STM模型的Robustness和泛化能力。此外，我们还进行了现有方法的分析，并发现它们之间的共同点。

Abstract
Pre-trained models have achieved success in Chinese Short Text Matching (STM) tasks, but they often rely on superficial clues, leading to a lack of robust predictions. To address this issue, it is crucial to analyze and mitigate the influence of superficial clues on STM models. Our study aims to investigate their over-reliance on the edit distance feature, commonly used to measure the semantic similarity of Chinese text pairs, which can be considered a superficial clue. To mitigate STM models' over-reliance on superficial clues, we propose a novel resampling training strategy called Gradually Learn Samples Containing Superficial Clue (GLS-CSC). Through comprehensive evaluations of In-Domain (I.D.), Robustness (Rob.), and Out-Of-Domain (O.O.D.) test sets, we demonstrate that GLS-CSC outperforms existing methods in terms of enhancing the robustness and generalization of Chinese STM models. Moreover, we conduct a detailed analysis of existing methods and reveal their commonality.

摘要
Translation in Simplified Chinese:预训模型在中文短文匹配任务中取得成功，但它们常常依赖于 superficiale 的指导，导致预测不够Robust。为了解决这个问题，我们需要分析和mitigate STM模型中superficial clue的影响。我们的研究旨在调查STM模型对edit distance特征的过度依赖，这可以被视为 superficiale 的指导。为了减少STM模型对superficial clue的依赖，我们提出了一种新的重采训练策略 called Gradually Learn Samples Containing Superficial Clue (GLS-CSC)。通过对In-Domain (I.D.)、Robustness (Rob.)和Out-Of-Domain (O.O.D.)测试集进行广泛的评估，我们表明GLS-CSC在改善中文STM模型的Robustness和泛化性方面表现出色。此外，我们还进行了现有方法的etailed分析，并揭示了它们的共同点。

Cross-Utterance Conditioned VAE for Speech Generation

paper_url: http://arxiv.org/abs/2309.04156
repo_url: None
paper_authors: Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun
for: 提高语音生成的自然性和表达性，特别是在多媒体生产中。
methods: 基于预训练语言模型和变量自动编码器（VAEs）的 Cross-Utterance Conditioned Variational Autoencoder（CUC-VAE）框架，以提取上下文敏感的语音特征，并通过Context-sensitive prosody generation来更好地模仿人类语音生成。
results: 在LibriTTS dataset上，提议的模型在语音生成和修改方面具有显著的改善，生成的语音更自然和表达力强。

Abstract
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

摘要
<> translate the following text into Simplified Chinese<> neural network-based speech synthesis systems show promise for multimedia production, but often struggle with producing expressive speech and seamless editing. In response, we propose the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.Here's the translation in Simplified Chinese:<> translate the following text into Simplified Chinese<> нейрон网络基于的语音合成系统在多媒体生产中展示了承诺，但经常面临表达性和无缝编辑的问题。为此，我们提出了跨话语的变量自动编码器语音合成（CUC-VAE S2）框架，以提高表达和自然语音生成。这个框架利用预训练语言模型的强大表达能力和变量自动编码器（VAEs）的重新表达能力。CUC-VAE S2框架的核心组件是跨话语CVAE，它从周围的句子中提取了语音、说话者和文本特征，生成了上下文敏感的表达特征，更准确地模拟人类表达生成。我们还提出了两种实用算法，特地针对不同的语音合成应用：CUC-VAE TTS для文本到语音和CUC-VAE SE для语音编辑。CUC-VAE TTS是直接应用框架，用于生成上下文语音。而CUC-VAE SE算法利用了真实的mel spectrogram sampling，基于上下文信息，生成了真实的声音，从而实现了灵活的语音编辑基于文本，如删除、插入和替换等操作。LibriTTS数据集的实验结果表明，我们提出的模型可以明显提高语音合成和编辑，生成更自然和表达性强的语音。

RST-style Discourse Parsing Guided by Document-level Content Structures

paper_url: http://arxiv.org/abs/2309.04141
repo_url: None
paper_authors: Ming Li, Ruihong Huang
for: 这篇论文是关于 Rhetorical Structure Theory based Discourse Parsing (RST-DP) 的研究，旨在探讨 clause、 sentence 和大量文本句子如何组成整个 дискурス，并将 дискурс结构表示为一个层次结构。
methods: 该论文提出了一种新的 RST 分析管道，该管道利用 News Discourse Profiling 任务来生成具有高级内容相关信息的结构意识新闻句子表示。该管道只添加了一些额外层次，并且在多种 RST 分析指标上表现出色。
results: 该论文的实验结果表明，通过将高级内容相关信息 incorporated 到 RST 分析管道中，可以提高 RST 分析的性能，并且在多种 RST 分析指标上表现出色。

Abstract
Rhetorical Structure Theory based Discourse Parsing (RST-DP) explores how clauses, sentences, and large text spans compose a whole discourse and presents the rhetorical structure as a hierarchical tree. Existing RST parsing pipelines construct rhetorical structures without the knowledge of document-level content structures, which causes relatively low performance when predicting the discourse relations for large text spans. Recognizing the value of high-level content-related information in facilitating discourse relation recognition, we propose a novel pipeline for RST-DP that incorporates structure-aware news content sentence representations derived from the task of News Discourse Profiling. By incorporating only a few additional layers, this enhanced pipeline exhibits promising performance across various RST parsing metrics.

摘要

Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

paper_url: http://arxiv.org/abs/2309.06415
repo_url: None
paper_authors: Adel Khorramrouz, Sujan Dutta, Arka Dutta, Ashiqur R. KhudaBukhsh
for: 本研究通过一种新的毒性兔洞框架对PaLM 2的安全反馈进行了一种耐性审核。
methods: 该框架从一个刻板印象开始，然后 repeatedly instruct PaLM 2 生成更加毒性的内容，直到 PaLM 2 安全护卫线Throw一个安全违反。
results: 我们的实验发现，PaLM 2 的安全护卫线无法评估高度恶势夹杂的内容，包括反犹太、伊斯兰压迫、种族歧视、同性恋歧视和妇女歧视等。

Abstract
This paper conducts a robustness audit of the safety feedback of PaLM 2 through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs PaLM 2 to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing PaLM 2 to generate more toxic content than the previous iteration until PaLM 2 safety guardrails throw a safety violation. Our experiments uncover highly disturbing antisemitic, Islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that PaLM 2 safety guardrails do not evaluate as highly unsafe.

摘要
这篇论文通过一种新的恶意孔雀框架进行了PaLM 2的安全反馈稳定性测试。从一种刻板印象开始，框架指令PaLM 2生成更多的恶意内容 than the stereotype。每一次循环都会继续指令PaLM 2生成更多的恶意内容，直到PaLM 2的安全护照抛出安全违反。我们的实验发现PaLM 2的安全护照并不评估这些内容的危险性，包括反犹太、伊斯兰差别、种族歧视、同性恋歧视和对女性的歧视（只是列举一些）。

Meta predictive learning model of natural languages

paper_url: http://arxiv.org/abs/2309.04106
repo_url: https://github.com/qjbtiger/meta-predictive-coding
paper_authors: Chan Li, Junbin Qiu, Haiping Huang
for: 这个论文旨在研究人工智能语言模型和大脑计算之间的关系，以及在语言处理中的predictive coding框架和自适应学习的作用。
methods: 本文提出了一种基于mean-field学习的predictive coding模型，假设每个连接的synaptic weight采用了频率分布，并只有分布进行了训练。
results: 该模型在分类手写数字和语言资料集上得到了成功验证，并表明大多数连接在学习后变为决定性的，输出连接具有更高水平的变化。模型的性能随数据负荷的变化，并在更多的训练数据提供下进一步提高。

Abstract
Large language models based on self-attention mechanisms have achieved astonishing performances not only in natural language itself, but also in a variety of tasks of different nature. However, regarding processing language, our human brain may not operate using the same principle. Then, a debate is established on the connection between brain computation and artificial self-supervision adopted in large language models. One of most influential hypothesis in brain computation is the predictive coding framework, which proposes to minimize the prediction error by local learning. However, the role of predictive coding and the associated credit assignment in language processing remains unknown. Here, we propose a mean-field learning model within the predictive coding framework, assuming that the synaptic weight of each connection follows a spike and slab distribution, and only the distribution is trained. This meta predictive learning is successfully validated on classifying handwritten digits where pixels are input to the network in sequence, and on the toy and real language corpus. Our model reveals that most of the connections become deterministic after learning, while the output connections have a higher level of variability. The performance of the resulting network ensemble changes continuously with data load, further improving with more training data, in analogy with the emergent behavior of large language models. Therefore, our model provides a starting point to investigate the physics and biology correspondences of the language processing and the unexpected general intelligence.

摘要
大语言模型基于自注意机制已经实现了不可思议的表现，不仅在自然语言中，还在多种不同性质的任务中。然而，人脑对语言处理可能不使用同样的原理。因此，人脑计算和大语言模型中的人工自我监督之间的连接成为了讨论的焦点。人脑计算中最有影响力的假设是预测编码框架，该框架提出了减少预测错误的本地学习。然而，预测编码和其相关的信任分配在语言处理中的作用仍然未知。我们提出了基于预测编码框架的mean-field学习模型，假设每个连接的 synaptic Weight 遵循爆发和杠杆分布，并仅训练分布。这种meta预测学习成功应用于分类手写数字，以及 Toy 和实际语言 corpus。我们发现大多数连接在学习后变为决定性的，输出连接具有更高水平的变化。结果的网络集群表现 continuous 变化与数据负荷相关，并且随着更多的训练数据，表现进一步提高，与大语言模型的 emergent 行为相似。因此，我们的模型提供了研究语言处理和大语言模型之间的物理和生物相关性的开始点。

Unsupervised Multi-document Summarization with Holistic Inference

paper_url: http://arxiv.org/abs/2309.04087
repo_url: None
paper_authors: Haopeng Zhang, Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Hongwei Wang, Jiawei Zhang, Dong Yu
for: 本研究旨在提出一种新的无监督多文摘要框架，以提高多文摘要的效果。
methods: 本方法利用自适应搜索和可靠度评价来选择最佳摘要句子，并通过评价集成度和多样性来衡量摘要的质量。
results: 对于小规模和大规模多文摘要数据集，本方法具有显著的提升效果，ROUGE分数和多样性指标均达到了或超过了基eline。此外，研究还发现了多样性对多文摘要性能的重要性。

Abstract
Multi-document summarization aims to obtain core information from a collection of documents written on the same topic. This paper proposes a new holistic framework for unsupervised multi-document extractive summarization. Our method incorporates the holistic beam search inference method associated with the holistic measurements, named Subset Representative Index (SRI). SRI balances the importance and diversity of a subset of sentences from the source documents and can be calculated in unsupervised and adaptive manners. To demonstrate the effectiveness of our method, we conduct extensive experiments on both small and large-scale multi-document summarization datasets under both unsupervised and adaptive settings. The proposed method outperforms strong baselines by a significant margin, as indicated by the resulting ROUGE scores and diversity measures. Our findings also suggest that diversity is essential for improving multi-document summary performance.

摘要
多文摘要目标是从同一个主题下的多个文档中提取核心信息。本文提出了一种新的整体框架，用于无监督多文摘要抽取。我们的方法将整体搜索评估方法与整体测量结合，称为子集代表指数（SRI）。SRI可以在无监督和适应性下计算，并考虑文档来源的重要性和多样性。为证明我们的方法的有效性，我们在小规模和大规模多文摘要数据集上进行了广泛的实验，包括无监督和适应性下的测试。我们的方法在ROUGE分数和多样性指标上都有显著的提升，而且发现多样性对多文摘要性能的提高具有重要意义。