2023-10-24

cs.CL

cs.CL - 2023-10-24

GlotLID: Language Identification for Low-Resource Languages

paper_url: http://arxiv.org/abs/2310.16248
repo_url: https://github.com/cisnlp/glotsparse
paper_authors: Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze
for: 本研究的目的是提供一个可靠、高效的语言识别模型，以推动低资源语言的人工智能技术的普及和提高。
methods: 本研究使用了一种基于对照的语言识别模型，并运用了一些特殊的技术来解决低资源语言的问题，例如：错误的资料metadata、高资源语言的泄漏、等等。
results: 本研究的结果显示，GlotLID-M模型能够实现与先前的模型比较的高度匹配，并且在调整F1和错误率之间取得平衡。此外，研究也显示了一些低资源语言的特殊挑战，例如：metadata错误、语言泄漏、等等。

Abstract
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

摘要

ZzzGPT: An Interactive GPT Approach to Enhance Sleep Quality

paper_url: http://arxiv.org/abs/2310.16242
repo_url: https://github.com/marwahalaofi/ubicomp23-student-challenge
paper_authors: Yonchanok Khaokaew, Thuc Hanh Nguyen, Kaixin Ji, Hiruni Kegalle, Marwah Alaofi
for: 这篇论文旨在提供有用的睡眠预测和反馈，以提高用户的睡眠质量。
methods: 该论文使用两stage框架，结合大量自然语言模型（LLMs），以提供准确的睡眠预测和有用的反馈。
results: 该研究使用GLOBEM数据集和生成的 sintetic数据，显示使用XGBoost模型可以提高预测的准确性。

Abstract
In today's world, sleep quality is pivotal for overall well-being. While wearable sensors offer real-time monitoring, they often lack actionable insights, leading to user abandonment. This paper delves into the role of technology in understanding sleep patterns. We introduce a two-stage framework, utilizing Large Language Models (LLMs), aiming to provide accurate sleep predictions with actionable feedback. Leveraging the GLOBEM dataset and synthetic data from LLMs, we highlight enhanced results with models like XGBoost. Our approach merges advanced machine learning with user-centric design, blending scientific accuracy with practicality.

摘要
今天的世界中，睡眠质量对总体健康非常重要。虽然佩戴式感知器可以提供实时监测，但它们经常缺乏有用的反馈，导致用户废弃。这篇论文探讨技术在理解睡眠模式方面的作用。我们介绍了一个两个阶段的框架，利用大语言模型（LLMs），以提供准确的睡眠预测和有用的反馈。我们利用GLOBEM数据集和LLMs生成的 sintetic数据，显示了加强的结果，如XGBoost模型。我们的方法结合了先进的机器学习技术和用户中心的设计，将科学准确性融合到实用中。

Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

paper_url: http://arxiv.org/abs/2310.16240
repo_url: None
paper_authors: Raymond Li, Gabriel Murray, Giuseppe Carenini
for: 这篇论文旨在将两个流行的研究领域融合到预训练语言模型中，通过在PEFT设置下进行参数有效的微调。
methods: 我们提出了一种将并行适配器模块编码不同语言结构的混合多语言专家架构，使用Gumbel-Softmax门控制每层模型中每个专家的重要性。为降低参数数量，我们先在 fixes 小数量的步骤上训练模型，然后根据专家的重要性分数进行预测。
results: 我们的方法可以与其他PEFT方法相比，在相同的参数数量下达到更高的性能水平。此外，我们还提供了额外的分析，以便对每层模型选择的专家进行深入的探究。

Abstract
In this work, we propose a method that combines two popular research areas by injecting linguistic structures into pre-trained language models in the parameter-efficient fine-tuning (PEFT) setting. In our approach, parallel adapter modules encoding different linguistic structures are combined using a novel Mixture-of-Linguistic-Experts architecture, where Gumbel-Softmax gates are used to determine the importance of these modules at each layer of the model. To reduce the number of parameters, we first train the model for a fixed small number of steps before pruning the experts based on their importance scores. Our experiment results with three different pre-trained models show that our approach can outperform state-of-the-art PEFT methods with a comparable number of parameters. In addition, we provide additional analysis to examine the experts selected by each model at each layer to provide insights for future studies.

摘要
在这项工作中，我们提出了一种方法，该方法将两个流行研究领域结合在一起，通过在预训练语言模型的参数高效调教（PEFT）设置中注入语言结构。我们的方法使用一种新的混合语言专家架构，其中Gumbel-Softmax门控制每层模型中每个专家的重要性。为了减少参数数量，我们首先在fixed小数目步骤上训练模型，然后根据每个专家的重要性分数进行专家折叠。我们的实验结果表明，我们的方法可以与同样数量的参数 OUTPERFORM现状的PEFT方法。此外，我们还提供了进一步的分析，以便为未来研究提供洞察。

TiC-CLIP: Continual Training of CLIP Models

paper_url: http://arxiv.org/abs/2310.16226
repo_url: None
paper_authors: Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, Fartash Faghri
for: This paper is written for training vision-language models on time-continuous data, specifically to address the problem of continually training large foundation models without retraining from scratch.
methods: The paper introduces the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models, including TiC-DataCompt, TiC-YFCC, and TiC-RedCaps, which contain over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). The paper also introduces a simple rehearsal-based approach for efficiently training models on time-continuous data.
results: The paper shows that OpenAI’s CLIP (trained on data up to 2020) loses approximately 8% zero-shot accuracy on the curated retrieval task from 2021-2022 compared with more recently trained models in the OpenCLIP repository. The paper also demonstrates that the simple rehearsal-based approach can reduce compute by 2.5 times compared to the standard practice of retraining from scratch.Here is the information in Simplified Chinese text:
for: 这篇论文是为了训练视觉语言模型而写的，具体来说是为了解决大型基础模型不断 retraining 的问题。
methods: 这篇论文引入了首个 web-scale 时间连续 (TiC) 测试 benchmark для训练视觉语言模型，包括 TiC-DataCompt、TiC-YFCC 和 TiC-RedCaps，这些 benchmark 包含了9年(2014-2022) 时间内的12.7亿个时间戳image-text对。论文还提出了一种简单的回忆型方法来高效地训练模型。
results: 论文显示，OpenAI的 CLIP (训练到2020年的数据) 在 authors 精心制定的检索任务上减少了约8%的零shot准确率，与 OpenCLIP 库中更近期训练的模型相比。论文还证明了这种简单的回忆型方法可以将计算量减少为2.5倍。

Abstract
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch.

摘要
维护大型基础模型的最新数据是昂贵的。为了避免不断 retraining 的高成本，需要不断训练这些模型。这个问题被加剧了由于大规模 continual learning 的缺乏标准 benchamarks 和基准。我们发布了首个 web-scale Time-Continual（TiC）benchmarks，包括 TiC-DataCompt、TiC-YFCC 和 TiC-RedCaps，共包含12.7亿个时间戳图像文本对。我们首先使用我们的benchmarks来推动不同的动态评估，以测试时间 robustness 的现有模型。我们发现 OpenAI 的 CLIP（训练数据止2020年）在我们精心编辑的检索任务上的零shot准确率下降了约8%，与更近期训练的 OpenCLIP 存储库中的模型相比。然后我们研究如何有效地在时间连续的数据上训练模型。我们发现一种简单的待机执行方法，通过从上一个检查点继续训练并重复旧数据，可以将计算量减少了2.5倍，相比于标准的重新训练从零开始的做法。

Background Summarization of Event Timelines

paper_url: http://arxiv.org/abs/2310.16197
repo_url: None
paper_authors: Adithya Pratapa, Kevin Small, Markus Dreyer
for: 新闻事件的 concise 概述是一项自然语言处理任务的挑战。而新闻记者通常会编辑时间线，以便强调关键的子事件，但新来的读者可能会面临困难catching up with 新闻事件的历史背景。本文提出了背景新闻概述任务，该任务的目的是为每个时间步骤的新闻事件提供相关的前一系列事件的背景概述。我们构建了一个数据集，通过将现有的时间线数据集合并请求人工标注员为每个时间步骤的新闻事件编写背景概述。我们建立了强大的基eline性能，并提出了一种关注点 variant 来生成背景概述。为评估背景概述质量，我们提出了一种问题回答 metric，即背景用户分数（BUS），该分数测量一个当前事件时间步骤的问题中，背景概述是否能够回答。我们的实验表明，经过 fine-tuning 的 Flan-T5 系统以及 GPT-3.5 的强大零 shot 性能。

Abstract
Generating concise summaries of news events is a challenging natural language processing task. While journalists often curate timelines to highlight key sub-events, newcomers to a news event face challenges in catching up on its historical context. In this paper, we address this need by introducing the task of background news summarization, which complements each timeline update with a background summary of relevant preceding events. We construct a dataset by merging existing timeline datasets and asking human annotators to write a background summary for each timestep of each news event. We establish strong baseline performance using state-of-the-art summarization systems and propose a query-focused variant to generate background summaries. To evaluate background summary quality, we present a question-answering-based evaluation metric, Background Utility Score (BUS), which measures the percentage of questions about a current event timestep that a background summary answers. Our experiments show the effectiveness of instruction fine-tuned systems such as Flan-T5, in addition to strong zero-shot performance using GPT-3.5.

摘要
传送新闻事件简要摘要是一项自然语言处理任务，具有挑战性。虽然记者们经常摘要时间线以便强调关键事件，但新手 faced with a news event often faces challenges in understanding its historical context. 在这篇论文中，我们解决这个需求，我们引入背景新闻摘要任务，每个时间点的新闻事件都需要一个背景摘要，涵盖有关的前一系列事件。我们构建了一个数据集，将现有的时间线数据集合并让人工标注者为每个时间点的新闻事件写一个背景摘要。我们建立了强大的基线性能，使用现有的摘要系统，并提出了一种关注点variant来生成背景摘要。为评估背景摘要质量，我们提出了一个问题回答 metric，背景用于评估器（Background Utility Score，BUS），该指标测量一个当前事件时间点的背景摘要是否能回答有关该事件的问题。我们的实验表明，在训练过程中使用 Flan-T5 等系统可以实现有效的辅助 fine-tuning，同时 Zero-shot 性能也非常强。

BLP 2023 Task 2: Sentiment Analysis

paper_url: http://arxiv.org/abs/2310.16183
repo_url: https://github.com/blp-workshop/blp_task2
paper_authors: Md. Arid Hasan, Firoj Alam, Anika Anjum, Shudipta Das, Afiyat Anjum
for: 这个研究是为了探讨社交媒体文本中的 sentiment 检测问题，以便更好地理解用户对产品或服务的看法。
methods: 这个研究使用了多种方法，包括传统机器学习模型、预先训练模型的 fine-tuning、以及大语言模型（LLMs）在零或少数shot设置下的使用。
results: 这个研究共收到了71名参与者的参与，其中29个 коман队在开发阶段提交了系统，并在评估阶段提交了30个系统。总共有597个运行被提交。然而，共有15个团队提交了系统描述文献。

Abstract
We present an overview of the BLP Sentiment Shared Task, organized as part of the inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is defined as the detection of sentiment in a given piece of social media text. This task attracted interest from 71 participants, among whom 29 and 30 teams submitted systems during the development and evaluation phases, respectively. In total, participants submitted 597 runs. However, a total of 15 teams submitted system description papers. The range of approaches in the submitted systems spans from classical machine learning models, fine-tuning pre-trained models, to leveraging Large Language Model (LLMs) in zero- and few-shot settings. In this paper, we provide a detailed account of the task setup, including dataset development and evaluation setup. Additionally, we provide a brief overview of the systems submitted by the participants. All datasets and evaluation scripts from the shared task have been made publicly available for the research community, to foster further research in this domain

摘要
我们提供了BLP情感共享任务的概述，这是BLP 2023工作坊的一部分，并与EMNLP 2023相位。这个任务的定义是社交媒体文本中的情感检测。这个任务吸引了71名参与者的关注，其中29个团队在开发阶段和评估阶段分别提交了597次运行。然而，总共15个团队提交了系统描述论文。参与者们的提交的方法包括经典机器学习模型、精度调整预先训练模型以及在零和几个附加设置下使用大语言模型（LLMs）。在这篇论文中，我们提供了任务设置的详细资料，包括数据集开发和评估设置。此外，我们还提供了参与者们提交的系统的简要概述。所有的数据集和评估脚本都已经公开发布给研究社区，以促进这个领域的进一步研究。

Hidden Citations Obscure True Impact in Science

paper_url: http://arxiv.org/abs/2310.16181
repo_url: None
paper_authors: Xiangyi Meng, Onur Varol, Albert-László Barabási
for: 本研究旨在探讨科学家如何使用文献来评估发现的影响，并发现了隐藏的参考。
methods: 研究者采用了无监督可解释机器学习方法，对每篇论文的全文进行分析，系统地发现隐藏的参考。
results: 研究发现，对于影响大的发现，隐藏的参考数量更多于公开的参考数量，不受发表venue和学科影响。此外，隐藏参考的存在不是受到公开参考数量的影响，而是受到文献中对话的度量。

Abstract
References, the mechanism scientists rely on to signal previous knowledge, lately have turned into widely used and misused measures of scientific impact. Yet, when a discovery becomes common knowledge, citations suffer from obliteration by incorporation. This leads to the concept of hidden citation, representing a clear textual credit to a discovery without a reference to the publication embodying it. Here, we rely on unsupervised interpretable machine learning applied to the full text of each paper to systematically identify hidden citations. We find that for influential discoveries hidden citations outnumber citation counts, emerging regardless of publishing venue and discipline. We show that the prevalence of hidden citations is not driven by citation counts, but rather by the degree of the discourse on the topic within the text of the manuscripts, indicating that the more discussed is a discovery, the less visible it is to standard bibliometric analysis. Hidden citations indicate that bibliometric measures offer a limited perspective on quantifying the true impact of a discovery, raising the need to extract knowledge from the full text of the scientific corpus.

摘要
科学家们常采用参考来示previous knowledge，但现在这些参考变成了广泛使用且不当地用于科学影响的标准。当发现成为通用知识时，参考会被包含在内化。这会导致隐藏引用的概念，表示文献中的明文资料 credit。我们利用不监督可解释机器学习对每篇论文全文进行系统地寻找隐藏引用。我们发现，对影响发现的隐藏引用数量比引用数量更多，不受出版平台和学科影响。我们表明，隐藏引用的存在不是基于引用数量，而是基于文献中话题的讨论程度， indicating that the more discussed is a discovery, the less visible it is to standard bibliometric analysis。隐藏引用表明， bibliometric measures 只能提供科学发现的有限影响量，需要从科学文献中提取知识。

WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

paper_url: http://arxiv.org/abs/2310.16153
repo_url: None
paper_authors: Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim Elmadany, Nagham Hamad, Alaa’ Omar
for: 本文主要关注阿拉伯语命名实体识别（NER）任务，提供了新的NER数据集（i.e., Wojood），并定义了用于促进不同NER方法比较的meaningful的子任务。
methods: 本文使用了45个团队参与了共同任务，其中11个团队参与了测试阶段。specifically, 11 teams participated in FlatNER, while 8 teams tackled NestedNER。
results: 赢得奖winning teams achieved F1 scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively.

Abstract
We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate meaningful comparisons between different NER approaches. WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 unique teams registered for this shared task, with 11 of them actively participating in the test phase. Specifically, 11 teams participated in FlatNER, while $8$ teams tackled NestedNER. The winning teams achieved F1 scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively.

摘要
我们现在介绍WojoodNER-2023，这是第一个阿拉伯语命名实体识别（NER）共同任务。WojoodNER-2023的主要焦点是阿拉伯语NER，提供了新的NER数据集（即Wojood），以及为不同NER方法进行比较有意义的子任务定义。WojoodNER-2023包括了两个子任务：FlatNER和NestedNER。总共45个团队 регистри过这个共同任务，其中11个团队参加了测试阶段。特别是，11个团队参加了FlatNER，而8个团队解决了NestedNER。赢利的团队在FlatNER和NestedNER中的F1分数分别为91.96和93.73。

Can You Follow Me? Testing Situational Understanding in ChatGPT

paper_url: http://arxiv.org/abs/2310.16135
repo_url: https://github.com/yangalan123/situationaltesting
paper_authors: Chenghao Yang, Allyson Ettinger
for: 这篇论文的目的是检验人工智能代理人（ChatGPT）的情景理解（Situational Understanding，SU）能力，以确定其是否具备人类样式的对话能力。
methods: 作者们使用了一个新的synthetic environment来测试ChatGPT的SU能力，通过评估模型在不同环境下的表现来评估其能力。
results: 研究发现，尽管ChatGPT在对话任务上表现出色，但它在保持正确的环境状态方面存在问题。研究人员发现，ChatGPT的表现受到各种因素的影响，包括模型的忘记率和假的更新。这些发现表明，ChatGPT目前还不具备坚定的情景理解能力。

Abstract
Understanding sentence meanings and updating information states appropriately across time -- what we call "situational understanding" (SU) -- is a critical ability for human-like AI agents. SU is essential in particular for chat models, such as ChatGPT, to enable consistent, coherent, and effective dialogue between humans and AI. Previous works have identified certain SU limitations in non-chatbot Large Language models (LLMs), but the extent and causes of these limitations are not well understood, and capabilities of current chat-based models in this domain have not been explored. In this work we tackle these questions, proposing a novel synthetic environment for SU testing which allows us to do controlled and systematic testing of SU in chat-oriented models, through assessment of models' ability to track and enumerate environment states. Our environment also allows for close analysis of dynamics of model performance, to better understand underlying causes for performance patterns. We apply our test to ChatGPT, the state-of-the-art chatbot, and find that despite the fundamental simplicity of the task, the model's performance reflects an inability to retain correct environment states across time. Our follow-up analyses suggest that performance degradation is largely because ChatGPT has non-persistent in-context memory (although it can access the full dialogue history) and it is susceptible to hallucinated updates -- including updates that artificially inflate accuracies. Our findings suggest overall that ChatGPT is not currently equipped for robust tracking of situation states, and that trust in the impressive dialogue performance of ChatGPT comes with risks. We release the codebase for reproducing our test environment, as well as all prompts and API responses from ChatGPT, at https://github.com/yangalan123/SituationalTesting.

摘要
<>将文本翻译成简化中文。<>人类智能代理机器人需要具备"情境理解"（SU）能力，以便在时间上更新信息状态。chatGPT等 chatbot需要这种能力，以实现人机对话的一致、 coherent 和有效。previous works已经发现了一些 SU 限制在非 chatbot 大语言模型（LLMs）中，但这些限制的EXTENT和原因还不够了解。此外，当前的 chat-based 模型在这个领域的能力还没有得到探索。在这项工作中，我们提出了一种新的情境测试环境，以控制和系统地测试 chat-oriented 模型的 SU 能力。我们通过评估模型的环境状态跟踪和总结来进行测试。我们在这个环境中测试了 chatGPT，发现它的性能表现出了无法保持正确的环境状态的问题。我们的跟进分析表明，chatGPT 的性能下降的主要原因是它没有持续性的内容快照（although it can access the full dialogue history），并且容易受到幻想的更新的影响。我们的发现建议 chatGPT 目前不具备 Robust 的情境跟踪能力，并且对它的印象性对话性能有风险。我们在 GitHub 上发布了测试环境的代码基本，以及所有的提示和 API 响应，可以在中下载。

GenKIE: Robust Generative Multimodal Document Key Information Extraction

paper_url: http://arxiv.org/abs/2310.16131
repo_url: https://github.com/glasgow-ai4biomed/genkie
paper_authors: Panfeng Cao, Ye Wang, Qiang Zhang, Zaiqiao Meng
for: 提高扫描文档中关键信息提取精度
methods: 提出一种新的生成型终端模型（GenKIE），利用多modal编码器将视觉、布局和文本特征嵌入，并使用decoder生成需要的输出
results: 广泛实验表明，GenKIE在不同类型的文档上具有良好的泛化能力，并实现了状态之最的结果，同时模型也具有自动纠正OCR错误的能力。

Abstract
Key information extraction (KIE) from scanned documents has gained increasing attention because of its applications in various domains. Although promising results have been achieved by some recent KIE approaches, they are usually built based on discriminative models, which lack the ability to handle optical character recognition (OCR) errors and require laborious token-level labelling. In this paper, we propose a novel generative end-to-end model, named GenKIE, to address the KIE task. GenKIE is a sequence-to-sequence multimodal generative model that utilizes multimodal encoders to embed visual, layout and textual features and a decoder to generate the desired output. Well-designed prompts are leveraged to incorporate the label semantics as the weakly supervised signals and entice the generation of the key information. One notable advantage of the generative model is that it enables automatic correction of OCR errors. Besides, token-level granular annotation is not required. Extensive experiments on multiple public real-world datasets show that GenKIE effectively generalizes over different types of documents and achieves state-of-the-art results. Our experiments also validate the model's robustness against OCR errors, making GenKIE highly applicable in real-world scenarios.

摘要
针对扫描文档中的关键信息提取（Key Information Extraction，KIE）问题，随着不同领域的应用，拥有增加的关注。虽然一些最新的KIE方法已经实现了可观的成果，但是这些方法通常是基于分类模型，缺乏对光学字符识别（OCR）错误的处理能力，同时需要繁琐的单个字符标注。在这篇论文中，我们提出了一种新的生成型终端模型，名为GenKIE，用于解决KIE任务。GenKIE是一种序列到序列多Modal生成模型，利用多Modal编码器将视觉、格式和文本特征编码，并使用解码器生成需要的输出。Well-designed prompts被利用来把标签 semantics 作为弱样本标注，让生成器自动生成关键信息。一个GenKIE的优点是它可以自动更正 OCR 错误。此外，单个字符精度标注不是必要的。广泛的实验表明，GenKIE可以高效地泛化到不同类型的文档，并 achieve 状态的最佳结果。我们的实验还证明了模型对 OCR 错误的Robustness，使得GenKIE在实际应用中非常可靠。

Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

paper_url: http://arxiv.org/abs/2310.16127
repo_url: None
paper_authors: AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
for: 这个研究是为了开发一个可以处理广泛任务的阿拉伯文本生成工具组。
methods: 这个研究使用了一种新的阿拉伯文本转换模型，名为AraT5v2，并在该模型上进行了训练。该训练包括了不同的预训练策略，包括无超给、有超给、共同预训练等。
results: 根据研究结果，这个新的模型在与其他比较基eline的模型进行比较时，具有了大幅度的提升。此外，研究者还开发了一个名为Octopus的Python基本包和命令行工具组，可以帮助开发者快速地进行阿拉伯文本生成任务。

Abstract
Understanding Arabic text and generating human-like responses is a challenging endeavor. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a novel Arabic text-to-text Transformer model, namely AraT5v2. Our new model is methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing Octopus, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We release the models and the toolkit on our public repository.

摘要
理解阿拉伯文本和生成人类化响应是一项复杂的任务。虽然许多研究人员已经提出了模型和解决方案，但是总的来说是缺乏一个全面的阿拉伯自然语言生成工具包，可以处理广泛的任务。在这项工作中，我们提出了一种新的阿拉伯文本-文本变换器模型，即AraT5v2。我们的新模型通过对广泛和多样化的数据进行系统训练，并使用扩展的序列长度2048个元素。我们研究了不同的预训练策略，包括无监督、监督、共同预训练等，在单任务和多任务 setting下进行了测试。我们的模型在比较基eline上表现出了明显的优势。为了进一步推动这项工作，我们还开发了一个名为Octopus的Python基本包和命令行工具集，用于执行八种阿拉伯生成任务，均基于单个模型。我们将模型和工具集公开发布到我们的公共存储库上。

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

paper_url: http://arxiv.org/abs/2310.16117
repo_url: None
paper_authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, Chiyu Zhang, El Moatez Billah Nagoudi, Houda Bouamor, Nizar Habash
for: 本研究的目的是提高阿拉伯语自然语言处理的状态作图，通过创造合作的研究团队在标准化的环境下竞争。
methods: 本研究使用了多种方法，包括 диалект识别和 machine translation。
results: 研究结果显示，三个子任务仍然具有挑战性，并且鼓励未来的研究。winning teams的成绩为87.27 F1、14.76 Bleu和21.10 Bleu，分别在三个子任务中。

Abstract
We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comparisons between different approaches. NADI 2023 targeted both dialect identification (Subtask 1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams have participated (with 76 valid submissions during test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27 F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

摘要
我团队描述了第四届细腻阿拉伯语言标注分享任务（NADI 2023）的发现。NADI 的目标是通过共同竞争的标准化条件来进步阿拉伯语言处理技术。它专注于阿拉伯方言，提供了新的数据集和定义了可比较的子任务，以便进行有意义的对不同方法的比较。NADI 2023 targeted both dialect identification (Subtask 1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams participated (with 76 valid submissions during the test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27% F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

Locally Differentially Private Document Generation Using Zero Shot Prompting

paper_url: http://arxiv.org/abs/2310.16111
repo_url: None
paper_authors: Saiteja Utpala, Sara Hooker, Pin Yu Chen
for: 防止推导语言模型隐私风险
methods: 提出了一种 мест化差异隐私机制 called DP-Prompt，利用预训练大语言模型和零戳训练提示来对作者解匿攻击进行防御，并最小化下游性能的影响
results: 对IMDB数据集进行测试，DP-Prompt（与ChatGPT结合）可以完美地恢复清洁情感F1分数，同时对于静止攻击者而言，可以实现46%的作者识别F1分数减少，对于适应攻击者而言，可以实现26%的减少

Abstract
Numerous studies have highlighted the privacy risks associated with pretrained large language models. In contrast, our research offers a unique perspective by demonstrating that pretrained large language models can effectively contribute to privacy preservation. We propose a locally differentially private mechanism called DP-Prompt, which leverages the power of pretrained large language models and zero-shot prompting to counter author de-anonymization attacks while minimizing the impact on downstream utility. When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks, showing that it surpasses existing approaches by a considerable margin despite its simpler design. For instance, in the case of the IMDB dataset, DP-Prompt (with ChatGPT) perfectly recovers the clean sentiment F1 score while achieving a 46\% reduction in author identification F1 score against static attackers and a 26\% reduction against adaptive attackers. We conduct extensive experiments across six open-source large language models, ranging up to 7 billion parameters, to analyze various effects of the privacy-utility tradeoff.

摘要
多数研究已经强调了预训练大型自然语言模型的隐私风险。然而，我们的研究呈现了一种独特的视角，即预训练大型自然语言模型可以有效地增进隐私保护。我们提出了一种本地差分隐私机制called DP-Prompt，利用了预训练大型自然语言模型和零扩展提示的力量，对抗作者去掌握攻击而减少下游实用性的影响。当DP-Prompt与强大的语言模型如ChatGPT（gpt-3.5）一起使用时，我们观察到了对抗攻击者去掌握的成功率下降，而且与现有方法相比，DP-Prompt具有更简单的设计。例如，在IMDB dataset中，DP-Prompt（与ChatGPT）可以完美地恢复清洁的 sentiment F1 分数，同时对于静态攻击者实现46%的减少，对于适应性攻击者实现26%的减少。我们在六个开源大型自然语言模型（最大参数70亿）上进行了广泛的实验，以分析不同的隐私Utility贸易。

CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn from Financial Reports

paper_url: http://arxiv.org/abs/2310.16095
repo_url: https://github.com/cr-copec/cr-copec
paper_authors: Ye Eun Chun, Sunjae Kwon, Kyunghwan Sohn, Nakwon Sung, Junyoup Lee, Byungki Seo, Kevin Compher, Seung-won Hwang, Jaesik Choi
for: 这 paper 是为了构建一个大规模域适化 causal 句子数据集，用于检测公司财务性能变化。
methods: 这 paper 使用了 10-K 年度报告中专家的 causal 分析，以满足 accounting 标准。 dataset 可以广泛地用于个人投资者和分析师作为投资决策的材料资源，无需阅读大量文档。
results: 这 paper 在 twelve 个industry 中考虑了不同的特征，因此可以在不同的industry中分辨 causal 句子。 authors 还提供了 dataset 的构建和分析，以及实验代码。

Abstract
In this paper, we introduce CR-COPEC called Causal Rationale of Corporate Performance Changes from financial reports. This is a comprehensive large-scale domain-adaptation causal sentence dataset to detect financial performance changes of corporate. CR-COPEC contributes to two major achievements. First, it detects causal rationale from 10-K annual reports of the U.S. companies, which contain experts' causal analysis following accounting standards in a formal manner. This dataset can be widely used by both individual investors and analysts as material information resources for investing and decision making without tremendous effort to read through all the documents. Second, it carefully considers different characteristics which affect the financial performance of companies in twelve industries. As a result, CR-COPEC can distinguish causal sentences in various industries by taking unique narratives in each industry into consideration. We also provide an extensive analysis of how well CR-COPEC dataset is constructed and suited for classifying target sentences as causal ones with respect to industry characteristics. Our dataset and experimental codes are publicly available.

摘要
在这篇论文中，我们引入了CR-COPEC，即财务报表中公司业绩变化的原因分析 dataset。这是一个广泛的领域适应性 causal 句据集，用于检测公司财务业绩变化。CR-COPEC 在两个主要成果方面做出了贡献。首先，它从美国公司的10-K年度报告中检测出了专家们根据会计标准进行形式化的 causal 分析。这个数据集可以被广泛使用于投资和决策过程中，无需阅读所有文档。其次，它仔细考虑了不同领域对公司财务业绩的影响，因此可以在不同领域中分辨出 causal 句。我们还提供了对 CR-COPEC 数据集的广泛分析和适用于分类目标句的研究。我们的数据集和实验代码都公开可用。

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

paper_url: http://arxiv.org/abs/2310.16049
repo_url: https://github.com/zayne-sprague/musr
paper_authors: Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett
for: 评估大语言模型（LLM）的理性能力
methods: 使用链式思维提示技术和自然语言生成算法
results: 评估多步软理解任务的语言模型表现不佳，需要进一步改进

Abstract
While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.

摘要
大型语言模型（LLM）已经具备了一些技术，如链式思维提示，表现出了很好的能力。然而，LLM在复杂的设定下进行Robust reasoning仍然弱点。然而，评估LLM的理解是困难的，因为系统的能力不断提高，而对于逻辑推理的 benchmark数据集仍然保持不变。我们介绍了 MuSR，一个用于评估语言模型多步软件逻辑任务的自然语言 narative。这个数据集有两个重要特点：首先，它通过一种新的 neurosymbolic 生成算法来生成复杂的逻辑任务，可以挑战 GPT-4（例如，谋杀 MYSTERY 约1000个单词长），并可以进一步扩展为更有能力的 LLM 发布。其次，我们的数据集实例是自然语言 narative，与实际世界的 reasoning Domain相对应，这使得它比其他生成的 benchmark 更加具有挑战性和真实性，同时也可以让人类评估员解决高度准确。我们对这些数据集进行了一系列LLM和提示技术的评估，并描述了链式思维在这些任务中的缺陷。

Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2310.16033
repo_url: https://github.com/saccharomycetes/visual_crop_zsvqa
paper_authors: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
for: 这 paper investigate multimodal Large Language Models (LLMs) 在视觉问答 (VQA) 任务中的限制，特别是它们是否可以正确地捕捉图像中的小型细节。
methods: 这 paper 使用了 multimodal LLMs 和人工视觉截割来提高 VQA 任务的性能。
results: 研究发现，multimodal LLMs 在图像中的小型细节捕捉能力很弱，其 zero-shot 性能与问题中图像Subject的大小有直接关系，随着Subject的尺寸增大，性能下降至 46%。此外，研究发现，人工视觉截割可以有效地改善 multimodal LLMs 的性能。

Abstract
Multimodal Large Language Models (LLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. Our code and data are publicly available.

摘要
多modal大语言模型（LLMs）在视觉问答（VQA）任务上最近获得了promising的零shot准确率——一个影响多个下渠应用和领域的基本任务。考虑到这些模型的广泛应用的潜力，因此调查其对不同图像和问题属性的局限性是非常重要的。在这项工作中，我们调查多modal LLMs是否可以正确地捕捉图像中的小 Details和大 Details。具体来说，我们发现其零shot准确率在回答视觉问题时对图像中的视觉主题大小具有极高的敏感性，随着主题大小的增加，准确率可以下降至46%。此外，我们发现这种效果是 causal，因为人类视觉cropping可以明显减轻其对大小的敏感性。 inspirited by the usefulness of human cropping，我们提出了三种自动视觉cropping方法作为infere时机制来提高多modal LLMs的零shot性能。我们在四个popular VQA dataset上研究了这些方法的效果，并在VQAv2 dataset上进行了一些精细Visual Details的子集研究。我们的发现表明：多modal LLMs在细节敏感的VQA应用中应用should be cautious，而visual cropping是一个有前途的方向来提高其零shot性能。我们的代码和数据公开可用。

Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

paper_url: http://arxiv.org/abs/2310.15961
repo_url: None
paper_authors: Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzygóźdź, Marek Cygan
for: 提高Transformer模型的参数计数，保持训练和推理成本的MoE模型。
methods: 使用 feed-forward layer中的一些专家来代表所有token，但这会导致训练不稳定和专家使用不均匀。
results: 提出了一种新的Mixture of Tokens模型，可以保持MoE模型的好处而不是经受以上问题，通过将token从不同的示例混合而不是路由它们到专家。

Abstract
Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization. Existing techniques designed to address these concerns, such as auxiliary losses or balance-aware matching, result either in lower model performance or are more difficult to train. In response to these issues, we propose Mixture of Tokens, a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties. Rather than routing tokens to experts, this approach mixes tokens from different examples prior to feeding them to experts, enabling the model to learn from all token-expert combinations. Importantly, this mixing can be disabled to avoid mixing of different sequences during inference. Crucially, this method is fully compatible with both masked and causal Large Language Model training and inference.

摘要
尽管混合专家（MoE）模型在提高Transformer模型参数数量的同时保持训练和推理成本的承诺，但它们的应用还存在一些困难。这些模型的关键策略是，对每个处理的单词，只启用最多几个专家——Feed-Forward层中的子集。但这种方法存在许多问题，如训练不稳定和专家不均衡使用。现有的技术，如辅助损失或平衡感知匹配，可以解决这些问题，但它们会导致模型性能下降或更难以训练。为了解决这些问题，我们提议 Mixture of Tokens，一种完全可导的模型，保留MoE架构的优点而避免上述困难。而不是路由单词到专家，这种方法将不同的单词混合在一起，以便模型可以学习所有单词-专家组合。这种混合可以被禁用，以避免在推理过程中混合不同的序列。此外，这种方法与 маsked和 causal Large Language Model 的训练和推理完全兼容。

NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes

paper_url: http://arxiv.org/abs/2310.15959
repo_url: None
paper_authors: Junda Wang, Zonghai Yao, Zhichao Yang, Huixue Zhou, Rumeng Li, Xun Wang, Yucheng Xu, Hong Yu
for: 这个论文旨在Automating the creation of clinical records drafted by doctors after each patient’s visit, using language models to reduce the workload of doctors.
methods: 该论文提出了一种名为NoteChat的多agger扩展模型，利用大型自然语言模型（LLMs）生成医生与病人之间的合作对话，并且可以根据医疗记录来conditioning。NoteChat包括规划、角色扮演和精炼模块。
results: 对于NoteChat，我们提供了全自动和人工评估，与当前状态的模型进行比较，包括OpenAI的ChatGPT和GPT-4。结果表明，NoteChat可以生成高质量的医生与病人之间的合作对话，从而探索人工智能在医疗领域的潜在应用。这是多个LLMs合作完成基于医疗记录的医生与病人对话的第一个实例，为人工智能在医疗领域的发展提供了promising的可能性。

Abstract
The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare

摘要
医生 после每次诊断的细节记录是医疗干部和研究人员的关键资料。使用语言模型自动生成这些笔记可以减轻医生的工作负担。然而，训练这些模型可以困难，因为医生和病人之间的对话非常有限。在这篇论文中，我们介绍NoteChat，一种合作多代理框架，利用大型语言模型（LLMs）生成基于医疗笔记的 sintetic 医生-病人对话。NoteChat包括规划、角色扮演和磨练模块。我们提供了完整的自动和人工评估NoteChat，与现有模型，包括OpenAI的ChatGPT和GPT-4进行比较。结果表明，NoteChat可以生成高质量的 sintetic 医生-病人对话，强调了人工智能在医疗领域的潜在价值。这是首次多个LLMs合作完成基于医疗笔记的医生-病人对话，提供了潜在的人工智能和医疗领域的交叉点。

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

paper_url: http://arxiv.org/abs/2310.15941
repo_url: https://github.com/hitz-zentroa/this-is-not-a-dataset
paper_authors: Iker García-Ferrero, Begoña Altuna, Javier Álvez, Itziar Gonzalez-Dios, German Rigau
for: 这 paper 是为了解释 LLMs 对否定语言的理解能力的研究。
methods: 该 paper 使用了一个大量自动生成的描述句子数据集，用于测试 LLMs 的总结和推理能力。
results: 研究发现， LLMs 在处理负面句子时表现不佳，通常仅仅依靠 superficiale 的cue。 fine-tuning 模型可以提高其表现，但negation 理解和总结仍然存在挑战。

Abstract
Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although fine-tuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available.

摘要
尽管大语言模型（LLMs）已经显示出了一定程度的语法知识和总结能力，但它们对否定的解释仍然存在困难。我们尝试解释LLMs理解否定的原因。我们创建了一个大型 semi-自动生成的描述句子集，包含约400,000个描述句子，其中否定存在约2/3的句子中，具有不同的形式。我们使用了我们的数据集和最大可用的开放LLMs进行零容量方法来评估这些模型的总结和推理能力，并对一些模型进行了微调以评估否定理解是否可以被训练。我们的发现表明，虽然LLMs在有Affirmative句子上表现出色，但它们对Negative句子表示困难，并且缺乏深入的否定理解，通常仅仅依靠 superficies 的cue。虽然微调模型可以提高其表现，但否定理解的总体化问题仍然存在，这 highlights continue challenges of LLMs regarding negation understanding and generalization.我们的数据集和代码公开available。

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

paper_url: http://arxiv.org/abs/2310.15921
repo_url: https://github.com/kuriyan1204/sentence-encoder-word-weighting
paper_authors: Hiroto Kurita, Goro Kobayashi, Sho Yokoi, Kentaro Inui
for: 这篇论文旨在提高句子编码器的性能，通过对比损失进行微调。
methods: 这篇论文使用了对比学习，并通过 тео리тиче和实验方法显示了模型在对比学习过程中对字符的重要性的影响。
results: 实验结果表明，对比 fine-tuning 会使模型对 informative 字符进行更多的重要性赋值，而不重要的字符则 receive 更少的重要性。

Abstract
The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.

摘要
通过简单的对比损失精度调整，可以大幅提高句子编码器的性能。一个自然的问题是：模型在对比学习中所获得的特征是什么？这篇论文通过理论和实验表明，对比学习基于句子编码器会隐式地根据信息理论量来Weight词语，即更有信息的词语会 receiving 更大的权重，而其他词语则 receiving 更小的权重。我们的理论表明，在最优化对比学习目标下的下界中，词语 embedding 的 нор幅反映了周围词语分布中的信息增加。我们还进行了多种模型、多个数据集、两种测量模型（集成梯度和SHAP）以及两种信息理论量（信息增加和自信息）的实验。结果证明了，对比精度调整会强调有用的词语。

In-Context Learning Creates Task Vectors

paper_url: http://arxiv.org/abs/2310.15916
repo_url: https://github.com/roeehendel/icl_task_vectors
paper_authors: Roee Hendel, Mor Geva, Amir Globerson
for: 该研究探讨了宏语言模型（LLM）中的协同学习（ICL） paradigma的下面机制。
methods: 该研究使用了一系列实验来证明ICL学习的函数结构往往非常简单，即将训练集($S$)映射到一个单个任务向量($\boldsymbol{\theta}(S)$)，然后使用这个任务向量来修饰transformer模型生成输出。
results: 该研究通过对多种模型和任务进行广泛的实验来支持上述声明，并证明ICL可以压缩训练集($S$)到一个单个任务向量($\boldsymbol{\theta}(S)$)，从而使用这个任务向量来修饰transformer模型生成输出。

Abstract
In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set $S$ to find a best-fitting function $f(x)$ in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the query $x$ and a single "task vector" calculated from the training set. Thus, ICL can be seen as compressing $S$ into a single task vector $\boldsymbol{\theta}(S)$ and then using this task vector to modulate the transformer to produce the output. We support the above claim via comprehensive experiments across a range of models and tasks.

摘要
受Contextual learning（ICL）在大语言模型（LLM）中emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set $S$ to find a best-fitting function $f(x)$ in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the query $x$ and a single "task vector" calculated from the training set. Thus, ICL can be seen as compressing $S$ into a single task vector $\boldsymbol{\theta}(S)$ and then using this task vector to modulate the transformer to produce the output. We support the above claim via comprehensive experiments across a range of models and tasks.Here's the translation in Traditional Chinese:受Contextual learning（ICL）在大语言模型（LLM）中emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set $S$ to find a best-fitting function $f(x)$ in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the query $x$ and a single "task vector" calculated from the training set. Thus, ICL can be seen as compressing $S$ into a single task vector $\boldsymbol{\theta}(S)$ and then using this task vector to modulate the transformer to produce the output. We support the above claim via comprehensive experiments across a range of models and tasks.

Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

paper_url: http://arxiv.org/abs/2310.15904
repo_url: https://github.com/alanagiasi/emoplmsynth
paper_authors: Alan Cowap, Yvette Graham, Jennifer Foster
for: This paper aims to address the issue of identifying synthetic text generated by high-performance generative AI models, specifically by leveraging the emotional content present in human-authored text.
methods: The authors fine-tune pre-trained language models (PLMs) on emotion to develop an emotionally-aware detector, which is tested on various synthetic text generators, model sizes, datasets, and domains.
results: The emotionally-aware detector achieves significant improvements in identifying synthetic text, particularly when compared to ChatGPT, reinforcing the potential of emotion as a signal for identifying synthetic text.

Abstract
Recent developments in generative AI have shone a spotlight on high-performance synthetic text generation technologies. The now wide availability and ease of use of such models highlights the urgent need to provide equally powerful technologies capable of identifying synthetic text. With this in mind, we draw inspiration from psychological studies which suggest that people can be driven by emotion and encode emotion in the text they compose. We hypothesize that pretrained language models (PLMs) have an affective deficit because they lack such an emotional driver when generating text and consequently may generate synthetic text which has affective incoherence i.e. lacking the kind of emotional coherence present in human-authored text. We subsequently develop an emotionally aware detector by fine-tuning a PLM on emotion. Experiment results indicate that our emotionally-aware detector achieves improvements across a range of synthetic text generators, various sized models, datasets, and domains. Finally, we compare our emotionally-aware synthetic text detector to ChatGPT in the task of identification of its own output and show substantial gains, reinforcing the potential of emotion as a signal to identify synthetic text. Code, models, and datasets are available at https: //github.com/alanagiasi/emoPLMsynth

摘要
近期的生成AI发展把关注高性能的合成文本生成技术。现在这些模型的广泛可用性和使用的容易度，强调了需要提供相应的技术来识别合成文本。基于心理学研究，我们发现人们在编写文本时会受到情感驱动，并且在文本中编码情感。我们 hypothesize that 预训练语言模型（PLM）缺乏情感驱动，因此可能会生成无情感含量的合成文本，即情感无 coherence。我们随后开发了一种情感意识检测器，通过细化 PLM 来实现。实验结果表明，我们的情感意识检测器在不同的合成文本生成器、模型大小、数据集和领域中均显示出改进。最后，我们与 ChatGPT 进行了对自己输出的识别任务，并显示了substantial 的提高，证明情感可以作为识别合成文本的信号。代码、模型和数据集可以在上获取。

BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT

paper_url: http://arxiv.org/abs/2310.15896
repo_url: https://github.com/scutcyr/bianque
paper_authors: Yirong Chen, Zhenyu Wang, Xiaofen Xing, huimin zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, Xiangmin Xu
For: The paper aims to improve the chain of questioning (CoQ) of large language models (LLMs) in providing personalized and effective health suggestions.* Methods: The proposed BianQue model is a ChatGLM-based LLM finetuned with the self-constructed health conversation dataset BianQueCorpus, which includes multiple turns of questioning and health suggestions polished by ChatGPT.* Results: Experimental results demonstrate that the proposed BianQue can simultaneously balance the capabilities of both questioning and health suggestions, which will help promote the research and application of LLMs in the field of proactive health.Here is the information in Simplified Chinese text:* For: 本研究目的是提高大语言模型（LLMs）的问题链（CoQ），以提供个性化和有效的健康建议。* Methods: 我们提出的 BianQue 模型是基于 ChatGLM 的 LLM，通过自构建的健康对话集 BianQueCorpus 进行 fine-tuning，该集包括多个问题和健康建议的循环对话，并通过 ChatGPT 进行精细调整。* Results: 实验结果表明，我们的 BianQue 可以同时保持问题和健康建议的能力，从而推动 LLMs 在健康预防领域的研究和应用。

Abstract
Large language models (LLMs) have performed well in providing general and extensive health suggestions in single-turn conversations, exemplified by systems such as ChatGPT, ChatGLM, ChatDoctor, DoctorGLM, and etc. However, the limited information provided by users during single turn results in inadequate personalization and targeting of the generated suggestions, which requires users to independently select the useful part. It is mainly caused by the missing ability to engage in multi-turn questioning. In real-world medical consultations, doctors usually employ a series of iterative inquiries to comprehend the patient's condition thoroughly, enabling them to provide effective and personalized suggestions subsequently, which can be defined as chain of questioning (CoQ) for LLMs. To improve the CoQ of LLMs, we propose BianQue, a ChatGLM-based LLM finetuned with the self-constructed health conversation dataset BianQueCorpus that is consist of multiple turns of questioning and health suggestions polished by ChatGPT. Experimental results demonstrate that the proposed BianQue can simultaneously balance the capabilities of both questioning and health suggestions, which will help promote the research and application of LLMs in the field of proactive health.

摘要
(Simplified Chinese)大型语言模型（LLMs）在单转对话中表现良好，如ChatGPT、ChatGLM、ChatDoctor、DoctorGLM等系统。然而，用户在单转提供的信息有限，导致生成的建议没有充分个性化和目标化，需要用户独立选择有用的部分。这主要由单转问题的缺失引起。在实际医疗咨询中，医生通常采用多个迭代问题来深入了解患者的情况，以便提供有效和个性化的建议，这可以定义为链条问题（CoQ） для LLMs。为了提高LLMs的CoQ，我们提出 BianQue，基于ChatGLM的LLM，通过自动构建的健康对话集 BianQueCorpus 进行微调，该集包含多个问题和健康建议的多个转换。实验结果表明，我们的 BianQue 可以同时保持问题和健康建议的能力，这将有助于推动LLMs在健康预防领域的研究和应用。

A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.18363
repo_url: None
paper_authors: Fathima Abdul Rahman, Guang Lu
for: 本研究旨在提高对话代理人的情感认知能力，以提供更加人性化的交互体验。
methods: 本研究使用图像、视频和文本模式的图 convolutional neural network（GCN）和奖励学习（RL）来实现情感认知。
results: 对比其他状态当前模型，conER-GRL模型在IEMOCAP数据集上表现出了优于其他模型的情感认知能力。

Abstract
Owing to the recent developments in Generative Artificial Intelligence (GenAI) and Large Language Models (LLM), conversational agents are becoming increasingly popular and accepted. They provide a human touch by interacting in ways familiar to us and by providing support as virtual companions. Therefore, it is important to understand the user's emotions in order to respond considerately. Compared to the standard problem of emotion recognition, conversational agents face an additional constraint in that recognition must be real-time. Studies on model architectures using audio, visual, and textual modalities have mainly focused on emotion classification using full video sequences that do not provide online features. In this work, we present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL). Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information. The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances. More importantly, Graph Convolutional Networks (GCN) and Reinforcement Learning (RL) agents are cascade trained to capture the complex dependencies of emotion features in interactive scenarios. Comparing the results of the conER-GRL model with other state-of-the-art models on the benchmark dataset IEMOCAP demonstrates the advantageous capabilities of the conER-GRL architecture in recognizing emotions in real-time from multimodal conversational signals.

摘要

SoK: Memorization in General-Purpose Large Language Models

paper_url: http://arxiv.org/abs/2310.18362
repo_url: None
paper_authors: Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, Robert West
for: 本研究旨在探讨大语言模型（LLM）在各种应用中的发展，以及LLM memorization的问题。
methods: 本研究使用了一种新的分类方法，以描述LLM中的 memorization 类型，包括 verbatim text、事实、想法、算法、写作风格、分布性和对齐目标。
results: 研究发现，LLM可以记忆短语和概念，但也可能记忆文本中的具体信息和写作风格。这些记忆可能导致隐私和安全问题，同时也可能提高模型的性能。研究还揭示了LLM的各种问题，如模型行为定义和排序算法的影响。

Abstract
Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a wide range of tasks. A major part of this success is due to their huge training datasets and the unprecedented number of model parameters, which allow them to memorize large amounts of information contained in the training data. This memorization goes beyond mere language, and encompasses information only present in a few documents. This is often desirable since it is necessary for performing tasks such as question answering, and therefore an important part of learning, but also brings a whole array of issues, from privacy and security to copyright and beyond. LLMs can memorize short secrets in the training data, but can also memorize concepts like facts or writing styles that can be expressed in text in many different ways. We propose a taxonomy for memorization in LLMs that covers verbatim text, facts, ideas and algorithms, writing styles, distributional properties, and alignment goals. We describe the implications of each type of memorization - both positive and negative - for model performance, privacy, security and confidentiality, copyright, and auditing, and ways to detect and prevent memorization. We further highlight the challenges that arise from the predominant way of defining memorization with respect to model behavior instead of model weights, due to LLM-specific phenomena such as reasoning capabilities or differences between decoding algorithms. Throughout the paper, we describe potential risks and opportunities arising from memorization in LLMs that we hope will motivate new research directions.

摘要
LLMs 可以记忆短语、事实、写作风格、分布性、对齐目标等。我们提出了 LLMs 的记忆分类，并描述了每种记忆的正面和负面影响，包括模型性能、隐私、安全、版权等方面。我们还描述了如何检测和预防记忆。然而，由于 LLMs 的特殊性，如推理能力或decoding算法的差异，我们需要更加注意记念的定义方式。在这篇论文中，我们描述了 LLMs 的记忆所带来的风险和机遇，希望能够激发新的研究方向。

Self-Guard: Empower the LLM to Safeguard Itself

paper_url: http://arxiv.org/abs/2310.15851
repo_url: None
paper_authors: Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, Kam-Fai Wong
for: 防止Large Language Model（LLM）被破坏攻击，避免LLM生成危险内容的负面社会影响。
methods: 提出了两种主要方法来解决破坏攻击：安全培训和安全保护。安全培训通过进一步训练LLM提高其安全性，而安全保护则通过外部模型或筛选器来防止危险输出。但是，安全培训在新型攻击的适应性有限，常会导致模型性能下降，而安全保护的帮助有限。
results: Self-Guard方法可以强化LLM对危险内容识别的能力，并使LLM能够自动检测和避免破坏攻击。实验结果表明，Self-Guard方法具有Robust性和稳定性。在坏CASE分析中，我们发现LLM occasional提供无害回答危险查询。此外，我们还评估了LLM的总能力之前和之后安全培训，证明Self-Guard方法不会导致LLM性能下降。在敏感测试中，Self-Guard方法不仅避免了LLM偏敏的问题，而且还可以减轻这种问题。

Abstract
The jailbreak attack can bypass the safety measures of a Large Language Model (LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. Safety training focuses on further training LLM to enhance its safety. On the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. However, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. Safeguards have proven to be of limited help. To tackle these issues, we propose a novel approach called Self-Guard, which combines the strengths of both safety methods. Self-Guard includes two stages. In the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we instruct the model to consistently perform harmful content detection on its own responses. The experiment has demonstrated that Self-Guard is robust against jailbreak attacks. In the bad case analysis, we find that LLM occasionally provides harmless responses to harmful queries. Additionally, we evaluated the general capabilities of the LLM before and after safety training, providing evidence that Self-Guard does not result in the LLM's performance degradation. In sensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM but also can even mitigate this issue.

摘要
大型自然语言模型（LLM）的安全措施可以被绕过，从而生成危险的内容。这种LLM的诈骗使得社会产生了负面的后果。目前，有两种主要方法来解决绑架攻击：安全训练和安全措施。安全训练是对LLM进行进一步训练，以增强其安全性。而安全措施则是通过对LLM的输出进行外部模型或节流的实现，以防止危险的输出。但是，安全训练受到新的攻击类型的限制，而且常会导致模型性能下降。安全措施则被证明是有限的帮助。为了解决这些问题，我们提出了一个新的方法 called Self-Guard，它结合了安全训练和安全措施的优点。Self-Guard包括两个阶段。在第一阶段，我们将增强LLM的危险内容评估能力。在第二阶段，我们将 instrucLLM 通过自己的回应来预防危险输出。实验结果表明，Self-Guard具有对绑架攻击的Robust性。在坏情况分析中，我们发现LLM occasional提供无害回应危险查询。此外，我们还评估了LLM的一般能力之前和之后安全训练，证明Self-Guard不会导致LLM的性能下降。在敏感测试中，Self-Guard不仅能避免对LLM的过敏化，而且甚至可以缓和这个问题。

Unnatural language processing: How do language models handle machine-generated prompts?

paper_url: http://arxiv.org/abs/2310.15829
repo_url: None
paper_authors: Corentin Kervadec, Francesca Franzon, Marco Baroni
for: 这篇论文主要是为了研究语言模型推荐的优化问题。
methods: 这篇论文使用了自动生成的token序列来检测语言模型的响应。
results: 研究发现，自动生成的token序列可以 Routinely outperform manually crafted prompts，并且可以让模型表现出不同的响应模式。

Abstract
Language model prompt optimization research has shown that semantically and grammatically well-formed manually crafted prompts are routinely outperformed by automatically generated token sequences with no apparent meaning or syntactic structure, including sequences of vectors from a model's embedding space. We use machine-generated prompts to probe how models respond to input that is not composed of natural language expressions. We study the behavior of models of different sizes in multiple semantic tasks in response to both continuous and discrete machine-generated prompts, and compare it to the behavior in response to human-generated natural-language prompts. Even when producing a similar output, machine-generated and human prompts trigger different response patterns through the network processing pathways, including different perplexities, different attention and output entropy distributions, and different unit activation profiles. We provide preliminary insight into the nature of the units activated by different prompt types, suggesting that only natural language prompts recruit a genuinely linguistic circuit.

摘要
Language model 提示优化研究显示，通常有意义和 grammatical 的手动生成的提示将被自动生成的Token序列所超越，包括模型的 embedding 空间中的Vector序列。我们使用机器生成的提示来探索模型对不同类型的输入的响应。我们研究不同大小的模型在多个semantic任务中对于连续和离散机器生成的提示的响应，并与人类生成的自然语言提示进行比较。即使生成相同的输出，机器生成和人类提示仍会触发不同的网络处理路径，包括不同的混乱度、注意力和输出 entropy 分布、和不同的单元活动profile。我们提供了初步的启示，表明只有自然语言提示会激活真正的语言环路。

paper_url: http://arxiv.org/abs/2310.15819
repo_url: None
paper_authors: Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander van der Linden, Jon Roozenbeek
for: 这个研究探讨了现代大语言模型是否具有基本社会身份偏见，以及这些偏见是如何从人类语言模型中学习的。
methods: 研究人员使用51个大语言模型进行调查，发现大多数基础语言模型和一些指导练习模型在完成句子时表现出明显的团队positive和对外group negative的偏见。
results: 研究发现，通过对模型在练习数据中含有更多团队positive或对外group negative句子的控制，可以使模型表现出更高的团队solidarity和更大的对外group hostility。此外，从练习数据中去除团队positive或对外group negative句子也可以减少模型的偏见。这些结果表明，现代大语言模型具有基本社会身份偏见，并且可以通过精心制定练习数据来减少这些偏见。

Abstract
The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. In this study, we investigate whether ingroup solidarity and outgroup hostility, fundamental social biases known from social science, are present in 51 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative biases when prompted to complete sentences (e.g., "We are..."). A comparison of LLM-generated sentences with human-written sentences on the internet reveals that these models exhibit similar level, if not greater, levels of bias than human text. To investigate where these biases stem from, we experimentally varied the amount of ingroup-positive or outgroup-negative sentences the model was exposed to during fine-tuning in the context of the United States Democrat-Republican divide. Doing so resulted in the models exhibiting a marked increase in ingroup solidarity and an even greater increase in outgroup hostility. Furthermore, removing either ingroup-positive or outgroup-negative sentences (or both) from the fine-tuning data leads to a significant reduction in both ingroup solidarity and outgroup hostility, suggesting that biases can be reduced by removing biased training data. Our findings suggest that modern language models exhibit fundamental social identity biases and that such biases can be mitigated by curating training data. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.

摘要
现代大型语言模型的流行性带来了对这些模型可能学习到的偏见的担忧。在这项研究中，我们调查了51个大型语言模型是否具有社会偏见。我们发现，大多数基础语言模型和一些特定任务练习模型在完成句子时表现出了明显的内群团结和外群敌对偏见。与人类文本相比，这些模型的偏见水平可能相当或更高。为了探索这些偏见的来源，我们在模型微调过程中采用了不同的群体偏见训练数据，并观察到模型在美国民主党和共和党的分化下表现出了明显的内群团结和外群敌对偏见。此外，从微调数据中移除内群团结或外群敌对的句子后，模型中的偏见有显著减少的趋势，这表明可以通过修改训练数据来减少偏见。我们的发现表明现代大型语言模型具有基本的社会标识偏见，并且可以通过精心修改训练数据来减少这些偏见。这些结论有实际意义，可以帮助创建更少偏见的大型语言模型，并且更加重要的是，防止人类与LLM之间的偏见循环。

BLESS: Benchmarking Large Language Models on Sentence Simplification

paper_url: http://arxiv.org/abs/2310.15773
repo_url: https://github.com/zurichnlp/bless
paper_authors: Tannon Kew, Alison Chi, Laura Vásquez-Rodríguez, Sweta Agrawal, Dennis Aumiller, Fernando Alva-Manchego, Matthew Shardlow
for: 这个论文的目的是为了测试最新的大语言模型（LLMs）在文本简化（TS）任务上的性能，以及这些模型是否可以解决这个复杂的任务。
methods: 这个论文使用了44种不同的大语言模型，包括不同的大小、结构、预训练方法和可访问性。这些模型在三个不同的领域（Wikipedia、新闻和医学）上进行了三种不同的测试集。
results: 研究发现，最佳的LLMs，即使没有专门为TS进行训练，也能与当前TS基线一样好。此外，研究还发现了一些模型在执行编辑操作方面的多样性和创新性。这个性能评估将成为未来TS方法和评价度量的开发资源。

Abstract
We present BLESS, a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our analysis considers a suite of automatic metrics as well as a large-scale quantitative investigation into the types of common edit operations performed by the different models. Furthermore, we perform a manual qualitative analysis on a subset of model outputs to better gauge the quality of the generated simplifications. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines. Additionally, we find that certain LLMs demonstrate a greater range and diversity of edit operations. Our performance benchmark will be available as a resource for the development of future TS methods and evaluation metrics.

摘要
我们提出了BLESS，一个全面的性能评测标准，用于评测最新的大语言模型（LLMs）在文本简化（TS）任务上的表现。我们对44种不同的模型进行了评测，这些模型之间有不同的大小、结构、预训练方法和可访问性。我们使用三个来自不同领域（Wikipedia、新闻和医学）的测试集，在几个shot设定下进行了评测。我们的分析包括一系列自动度量器以及大规模的量化分析，以评估不同模型在TS任务上的表现。此外，我们还进行了一些手动质量分析，以更好地评估模型生成的简化结果质量。我们的评估结果表明，最佳的LLMs，即使没有直接针对TS进行训练，也能够与当前TS基准集成比肩。此外，我们发现某些LLMs在生成简化结果时拥有更广泛和多样化的编辑操作。我们的性能评测标准将作为未来TS方法和评价度量器的开发资源。

Learning From Free-Text Human Feedback – Collect New Datasets Or Extend Existing Ones?

paper_url: http://arxiv.org/abs/2310.15758
repo_url: https://github.com/ukplab/emnlp2023-learning-from-free-text-human-feedback
paper_authors: Dominic Petrak, Nafise Sadat Moosavi, Ye Tian, Nikolai Rozanov, Iryna Gurevych
for: 这篇论文的目的是研究对话系统学习自由文本人类反馈的可能性，以及使用现成对话数据集进行增强。
methods: 该论文使用了现成对话数据集，包括MultiWoZ、SGD、BABI、PersonaChat、Wizards-of-Wikipedia以及Self-Feeding Chatbot的人机分割。然后，通过对这些数据集中的自由文本人类反馈进行分类，derive出新的对话数据集的分类法。最后，用三种现状前景语言生成模型进行response生成，以评估包括这些数据集的影响。
results: 该论文的结果显示，包括MultiWoZ、SGD、BABI、PersonaChat、Wizards-of-Wikipedia以及Self-Feeding Chatbot的现成对话数据集中，自由文本人类反馈的类型和频率各不相同。此外，通过使用新的分类法，可以对对话数据集进行更好的增强。此外，通过使用三种现状前景语言生成模型进行response生成，可以评估包括这些数据集的影响。

Abstract
Learning from free-text human feedback is essential for dialog systems, but annotated data is scarce and usually covers only a small fraction of error types known in conversational AI. Instead of collecting and annotating new datasets from scratch, recent advances in synthetic dialog generation could be used to augment existing dialog datasets with the necessary annotations. However, to assess the feasibility of such an effort, it is important to know the types and frequency of free-text human feedback included in these datasets. In this work, we investigate this question for a variety of commonly used dialog datasets, including MultiWoZ, SGD, BABI, PersonaChat, Wizards-of-Wikipedia, and the human-bot split of the Self-Feeding Chatbot. Using our observations, we derive new taxonomies for the annotation of free-text human feedback in dialogs and investigate the impact of including such data in response generation for three SOTA language generation models, including GPT-2, LLAMA, and Flan-T5. Our findings provide new insights into the composition of the datasets examined, including error types, user response types, and the relations between them.

摘要
学习从自由文本人类反馈是对对话系统的关键，但已经标注的数据匮乏，通常只覆盖了对话AI中的一小部分错误类型。相反，利用现有对话数据的同时生成技术可以增强对话数据的标注。然而，以评估这种努力的可行性为目的，我们需要了解这些数据集中自由文本人类反馈的类型和频率。在这种工作中，我们对多个常用的对话数据集进行调查，包括MultiWoZ、SGD、BABI、PersonaChat、Wizards-of-Wikipedia以及自然语言生成模型的人类分裂。通过我们的观察，我们 derivates新的对话自由文本人类反馈的分类法，并 investigate如何在响应生成中包含这些数据的影响。我们的发现提供了对这些数据集的详细了解，包括错误类型、用户反馈类型以及它们之间的关系。

Do Differences in Values Influence Disagreements in Online Discussions?

paper_url: http://arxiv.org/abs/2310.15757
repo_url: https://github.com/m0re4u/value-disagreement
paper_authors: Michiel van der Meer, Piek Vossen, Catholijn M. Jonker, Pradeep K. Murukannaiah
for: 本研究旨在 investigating 在 онлайн讨论中的不同意见是否与个人价值观有关，并探讨如何使用现有的模型来估算个人价值观。
methods: 本研究使用现有的语言模型来估算在线讨论中的个人价值观，然后将估算的价值观集成成价值 profil。 finally, 研究者使用人类标注的一致标签来评估价值 profil 的准确性。
results: 研究发现，在某些情况下，个人价值观的不同程度与不同意见有直接关系。此外，包含价值信息在一致预测中能够提高性能。

Abstract
Disagreements are common in online discussions. Disagreement may foster collaboration and improve the quality of a discussion under some conditions. Although there exist methods for recognizing disagreement, a deeper understanding of factors that influence disagreement is lacking in the literature. We investigate a hypothesis that differences in personal values are indicative of disagreement in online discussions. We show how state-of-the-art models can be used for estimating values in online discussions and how the estimated values can be aggregated into value profiles. We evaluate the estimated value profiles based on human-annotated agreement labels. We find that the dissimilarity of value profiles correlates with disagreement in specific cases. We also find that including value information in agreement prediction improves performance.

摘要
互联网讨论中的不同意见是常见的。一些情况下，不同意见可能会促进协作并提高讨论的质量。然而，文献中对不同意见的影响因素还没有进行深入的研究。我们提出一种假设，即在线讨论中的个人价值差异是不同意见的指标。我们使用现有的模型来估算在线讨论中的价值，并将估算的价值聚合成价值profile。我们根据人类标注的一致标签来评估价值profile的评估结果。我们发现，价值profile的不同程度与specific cases中的不同意见相关。此外，包含价值信息在一致预测中可以提高性能。

Failures Pave the Way: Enhancing Large Language Models through Tuning-free Rule Accumulation

paper_url: http://arxiv.org/abs/2310.15746
repo_url: https://github.com/thunlp-mt/tran
paper_authors: Zeyuan Yang, Peng Li, Yang Liu
for: 提高大型语言模型（LLM）的性能
methods: 使用自适应规则积累（Tuning-free Rule Accumulation，TRAN）框架，让 LLM 从错误案例中学习并改进性能
results: 实验表明，Compared with recent baselines, TRAN 可以提高 LLM 的性能的大幅度。

Abstract
Large Language Models (LLMs) have showcased impressive performance. However, due to their inability to capture relationships among samples, these frozen LLMs inevitably keep repeating similar mistakes. In this work, we propose our Tuning-free Rule Accumulation (TRAN) framework, which guides LLMs in improving their performance by learning from previous mistakes. Considering data arrives sequentially, LLMs gradually accumulate rules from incorrect cases, forming a rule collection. These rules are then utilized by the LLMs to avoid making similar mistakes when processing subsequent inputs. Moreover, the rules remain independent of the primary prompts, seamlessly complementing prompt design strategies. Experimentally, we show that TRAN improves over recent baselines by a large margin.

摘要
大型语言模型（LLM）有表现出色。然而，由于它们无法捕捉预测项目之间的关系，这些冻结的 LLM 总是重复相似的错误。在这个工作中，我们提出了我们的调整-自由规律积累（TRAN）框架，帮助 LLM 改善其表现。对于预测项目的给定，LLM 会逐渐累累集合错误的规律，形成一个规律集合。这些规律可以让 LLM 在处理后续的输入时避免重复错误。此外，这些规律与主要提示无关，可以与提示设计策略相融合。实验结果显示，TRAN 比最近的基eline高得多。

RAPL: A Relation-Aware Prototype Learning Approach for Few-Shot Document-Level Relation Extraction

paper_url: http://arxiv.org/abs/2310.15743
repo_url: None
paper_authors: Shiao Meng, Xuming Hu, Aiwei Liu, Shu’ang Li, Fukun Ma, Yawen Yang, Lijie Wen
for: 这篇论文主要用于提高几shot文档关系提取中的Semantic关系识别精度。
methods: 该方法使用度量基于的元学习框架，通过建立类prototype来进行分类。
results: 对于两个FSDLRE benchmark中的多种设置，该方法比前state-of-the-art方法提高了2.61%的$F_1$值。

Abstract
How to identify semantic relations among entities in a document when only a few labeled documents are available? Few-shot document-level relation extraction (FSDLRE) is crucial for addressing the pervasive data scarcity problem in real-world scenarios. Metric-based meta-learning is an effective framework widely adopted for FSDLRE, which constructs class prototypes for classification. However, existing works often struggle to obtain class prototypes with accurate relational semantics: 1) To build prototype for a target relation type, they aggregate the representations of all entity pairs holding that relation, while these entity pairs may also hold other relations, thus disturbing the prototype. 2) They use a set of generic NOTA (none-of-the-above) prototypes across all tasks, neglecting that the NOTA semantics differs in tasks with different target relation types. In this paper, we propose a relation-aware prototype learning method for FSDLRE to strengthen the relational semantics of prototype representations. By judiciously leveraging the relation descriptions and realistic NOTA instances as guidance, our method effectively refines the relation prototypes and generates task-specific NOTA prototypes. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by average 2.61% $F_1$ across various settings of two FSDLRE benchmarks.

摘要
如何在文档中IdentifyEntities的 semantic relations When only a few labeled documents are available? 几shot文档关系抽取 (FSDLRE) 是解决实际场景中数据缺乏的问题的关键。度量基于的meta学是一种广泛采用的效果的框架，它在文档级别上构建分类的类prototype。然而，现有的工作frequently struggle to obtain class prototype with accurate relational semantics：1. 为目标关系类型构建类 prototype，它们将所有包含该关系的实体对的表示拟合在一起，但这些实体对可能也包含其他关系，从而干扰 prototype。2. 使用所有任务中的通用 NOTA (None-of-the-above) 类型的代表，忽略了不同目标关系类型下的 NOTA semantics 的差异。在这篇论文中，我们提出了一种关系意识 prototype learning 方法，用于加强文档中的关系semantics。我们judiciously leveraging relation descriptions和realistic NOTA instances as guidance，our method effectively refines the relation prototypes and generates task-specific NOTA prototypes。经过extensive experiments demonstrate that our method outperforms state-of-the-art approaches by average 2.61% $F_1$ across various settings of two FSDLRE benchmarks.

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

paper_url: http://arxiv.org/abs/2310.15724
repo_url: https://github.com/thunlp/compression-plugin
paper_authors: Chaojun Xiao, Yuqi Luo, Wenbin Zhang, Pengle Zhang, Xu Han, Yankai Lin, Zhengyan Zhang, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou
for: 提高 NLP 任务的计算效率，减少 Parameters 的大小和计算成本。
methods: 使用可插入的压缩插件，通过压缩多个隐藏向量为一个压缩多个隐藏向量为一个，并在原始 PLM 中固定训练。
results: 在七个数据集上 validate 了 Variator 的有效性，可以Save 53% 的计算成本，仅占 Parameters 的 0.9%，性能下降低于 2%。

Abstract
Pre-trained language models (PLMs) have achieved remarkable results on NLP tasks but at the expense of huge parameter sizes and the consequent computational costs. In this paper, we propose Variator, a parameter-efficient acceleration method that enhances computational efficiency through plug-and-play compression plugins. Compression plugins are designed to reduce the sequence length via compressing multiple hidden vectors into one and trained with original PLMs frozen. Different from traditional model acceleration methods, which compress PLMs to smaller sizes, Variator offers two distinct advantages: (1) In real-world applications, the plug-and-play nature of our compression plugins enables dynamic selection of different compression plugins with varying acceleration ratios based on the current workload. (2) The compression plugin comprises a few compact neural network layers with minimal parameters, significantly saving storage and memory overhead, particularly in scenarios with a growing number of tasks. We validate the effectiveness of Variator on seven datasets. Experimental results show that Variator can save 53% computational costs using only 0.9% additional parameters with a performance drop of less than 2%. Moreover, when the model scales to billions of parameters, Variator matches the strong performance of uncompressed PLMs.

摘要

Re-Temp: Relation-Aware Temporal Representation Learning for Temporal Knowledge Graph Completion

paper_url: http://arxiv.org/abs/2310.15722
repo_url: None
paper_authors: Kunze Wang, Soyeon Caren Han, Josiah Poon
for: 预测未来事物中缺失的实体（Temporal Knowledge Graph Completion under extrapolation setting）
methods: 利用显式时间嵌入和跳过信息流来避免无关信息的泄露，并 introduce two-phase forward propagation method to prevent information leakage
results: 在六个TKGC（extrapolation）数据集上比其他八个最新的状态艺术模型表现出色，具体 result 见 paper 中的表格

Abstract
Temporal Knowledge Graph Completion (TKGC) under the extrapolation setting aims to predict the missing entity from a fact in the future, posing a challenge that aligns more closely with real-world prediction problems. Existing research mostly encodes entities and relations using sequential graph neural networks applied to recent snapshots. However, these approaches tend to overlook the ability to skip irrelevant snapshots according to entity-related relations in the query and disregard the importance of explicit temporal information. To address this, we propose our model, Re-Temp (Relation-Aware Temporal Representation Learning), which leverages explicit temporal embedding as input and incorporates skip information flow after each timestamp to skip unnecessary information for prediction. Additionally, we introduce a two-phase forward propagation method to prevent information leakage. Through the evaluation on six TKGC (extrapolation) datasets, we demonstrate that our model outperforms all eight recent state-of-the-art models by a significant margin.

摘要
<>通过提供答案，我将文本翻译成简化中文。>temporal knowledge graph completion（TKGC）下的推断任务是预测未来中的缺失实体，这种任务更加接近实际预测问题。现有研究大都使用逻辑图神经网络对最近的快照进行编码，但这些方法通常忽略了查询中实体相关的关系，以及时间信息的直接表达。为了解决这个问题，我们提出了我们的模型Re-Temp（关系意识时间表示学习），它利用显式的时间嵌入作为输入，并在每个时间戳点后进行跳过不必要信息的流动。此外，我们还提出了一种两阶段前进协议，以避免信息泄露。经过六个TKGC（推断）数据集的评估，我们示出了我们的模型在八个最新的状态艺术模型中具有显著的超越。

Ensemble of Task-Specific Language Models for Brain Encoding

paper_url: http://arxiv.org/abs/2310.15720
repo_url: https://github.com/jr-john/ensemble_brain_encoders
paper_authors: Sanjai Kumaran, Arvindh Arun, Jerrin John
for: 用于提高语言模型对大脑响应的预测性能
methods: 使用10种流行的自然语言处理任务的表示学习来转移学习，并创建了一个ensemble模型
results: 比基eline平均提高10%的性能 across all ROIs

Abstract
Language models have been shown to be rich enough to encode fMRI activations of certain Regions of Interest in our Brains. Previous works have explored transfer learning from representations learned for popular natural language processing tasks for predicting brain responses. In our work, we improve the performance of such encoders by creating an ensemble model out of 10 popular Language Models (2 syntactic and 8 semantic). We beat the current baselines by 10% on average across all ROIs through our ensembling methods.

摘要
Language models有可能编码certain Regions of Interest（ROI）的fMRI活动。 previous works曾经explored transfer learning from representations学习的 popular natural language processing tasks来预测brain responses。在我们的工作中，我们提高了such encoders的性能by creating an ensemble model out of 10 popular Language Models（2 syntactic和8 semantic）。我们在所有ROIs上 average beat the current baselines by 10%。Note: "Regions of Interest" (ROIs) are specific areas of the brain that are being studied. In this text, the author is referring to the ability of language models to encode the activity of these areas based on fMRI scans.

Enhancing Biomedical Lay Summarisation with External Knowledge Graphs

paper_url: http://arxiv.org/abs/2310.15702
repo_url: https://github.com/tgoldsack1/enhancing_biomedical_lay_summarisation_with_external_knowledge_graphs
paper_authors: Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, Chenghua Lin
for: 这篇论文主要是为了提供一种自动生成简要摘要的方法，以便将专业技术文章简化成普通读者可以理解的语言。
methods: 这篇论文使用了知识图来提高自动生成简要摘要的效果，并系统地研究了三种不同的方法，每种方法targeting一个不同的encoder-decoder模型结构。
results: 结果表明，通过 интеGRATING graph-based domain knowledge，可以很大地提高自动生成简要摘要的可读性和技术概念的解释。

Abstract
Previous approaches for automatic lay summarisation are exclusively reliant on the source article that, given it is written for a technical audience (e.g., researchers), is unlikely to explicitly define all technical concepts or state all of the background information that is relevant for a lay audience. We address this issue by augmenting eLife, an existing biomedical lay summarisation dataset, with article-specific knowledge graphs, each containing detailed information on relevant biomedical concepts. Using both automatic and human evaluations, we systematically investigate the effectiveness of three different approaches for incorporating knowledge graphs within lay summarisation models, with each method targeting a distinct area of the encoder-decoder model architecture. Our results confirm that integrating graph-based domain knowledge can significantly benefit lay summarisation by substantially increasing the readability of generated text and improving the explanation of technical concepts.

摘要
précédentes approches pour la résumé automatique sont exclusivement dépendantes de l'article de source qui, étant écrit pour un public technique (par exemple, des chercheurs), est peu probable de définir explicitement tous les concepts techniques ou d'indiquer toutes les informations de fond pertinentes pour un public lay. Nous résolvons ce problème en augmentant eLife, un dataset existant de résumé lay biomédical, avec des graphiques de connaissances spécifiques à l'article, chacune contenant des informations détaillées sur les concepts biomédicaux pertinents. En utilisant des évaluations automatiques et humaines, nous étudions de manière systématique l'efficacité de trois approches différentes pour intégrer des connaissances graphiques dans les modèles de résumé lay, chacune ciblant une région distincte de l'architecture encoder-decoder. Nos résultats confirment que l'intégration de la connaissance domainale basée sur les graphes peut considérablement améliorer la lisibilité du texte généré et l'explication des concepts techniques.

COPF: Continual Learning Human Preference through Optimal Policy Fitting

paper_url: http://arxiv.org/abs/2310.15694
repo_url: None
paper_authors: Han Zhang, Lin Gui, Yuanzhao Zhai, Hui Wang, Yu Lei, Ruifeng Xu
for: 提高预训练语言模型（LM） conform to human preferences
methods: 使用Reinforcement Learning from Human Feedback（RLHF）方法，但不需要每次新的查询或反馈都进行全量重新训练
results: 实验结果表明，COPF方法在不同任务和领域中一直与人类偏好保持一致，并且超越了强大的Continuous learning（CL）基elines

Abstract
The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Fitting (COPF), in which we estimate a series of optimal policies using the Monte Carlo method, and then continually fit the policy sequence with the function regularization. COPF involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data, making it flexible for continual preference learning. Our experimental results show that COPF outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences on different tasks and domains.

摘要
RLHF（人类反馈学习强化）技术是通常用于改进预训练语言模型（LM）的方法，以提高其遵循人类偏好的能力。然而，现有RLHF基于LM的模型每次新增查询或反馈都需要全面重新训练，这会成为一项具有挑战性的任务，因为人类偏好可能在不同的领域或任务中发生变化。重新训练LM对于许多实际应用场景来说具有困难和费时的问题，同时也存在数据隐私问题。为解决这个限制，我们提出了一种新方法 called Continual Optimal Policy Fitting（COPF），其中我们使用Monte Carlo方法来估算一系列的优化政策，然后不断地使用功能规则来适应政策序列。COPF只需一次学习阶段，不需要复杂的强化学习。这种方法与RLHF一样可以学习从无标签数据中，这使其具有适应不断改变的偏好的灵活性。我们的实验结果表明，COPF在不同任务和领域中一直适应人类偏好的表现比强大的连续学习（CL）基eline更好。

Creating a silver standard for patent simplification

paper_url: http://arxiv.org/abs/2310.15689
repo_url: https://github.com/slvcsl/patentsilverstandard
paper_authors: Silvia Casola, Alberto Lavelli, Horacio Saggion
for: 本文提出了一种自动简化专利文本的方法，以便提高专利文本的访问性和机器可读性。
methods: 本文使用了一种基于人工智能的自动生成简化字典，并使用了特定的筛选器来生成 cleaner 的译文集。
results: 人工评估表明，生成的简化译文集具有 grammaticality、准确性和简洁性。

Abstract
Patents are legal documents that aim at protecting inventions on the one hand and at making technical knowledge circulate on the other. Their complex style -- a mix of legal, technical, and extremely vague language -- makes their content hard to access for humans and machines and poses substantial challenges to the information retrieval community. This paper proposes an approach to automatically simplify patent text through rephrasing. Since no in-domain parallel simplification data exist, we propose a method to automatically generate a large-scale silver standard for patent sentences. To obtain candidates, we use a general-domain paraphrasing system; however, the process is error-prone and difficult to control. Thus, we pair it with proper filters and construct a cleaner corpus that can successfully be used to train a simplification system. Human evaluation of the synthetic silver corpus shows that it is considered grammatical, adequate, and contains simple sentences.

摘要
专利文档是法律文档，旨在一方面保护发明，另一方面让技术知识流通。它们的复杂风格——包括法律、技术和极其抽象语言——使得它们的内容困难 для人类和机器访问，对信息检索社区提出了重大挑战。这篇论文提议自动简化专利文本的方法，由于没有相关领域的平行简化数据，我们提议自动生成大规模的银色标准集。为获得候选者，我们使用通用领域重叠系统，但这个过程存在误差和难以控制的问题。因此，我们对其进行过滤，并构建了一个更加清晰的集合，可以成功地用于培训简化系统。人工评估该银色集表示，它具有正确的格式、充分的表达和简单的句子。

Prevalence and prevention of large language model use in crowd work

paper_url: http://arxiv.org/abs/2310.15683
repo_url: None
paper_authors: Veniamin Veselovsky, Manoel Horta Ribeiro, Philip Cozzolino, Andrew Gordon, David Rothschild, Robert West
for: The paper is written to investigate the use of large language models (LLMs) among crowd workers and to develop targeted mitigation strategies to reduce LLM use.
methods: The paper uses a text summarization task where workers were not directed in any way regarding their LLM use, and compares the estimated prevalence of LLM use with and without targeted mitigation strategies. The paper also conducts secondary analyses to explore the impact of LLM use on the quality and homogeneity of responses.
results: The paper finds that targeted mitigation strategies can significantly reduce, but not eliminate, LLM use among crowd workers. The paper also finds that LLM use yields high-quality but homogeneous responses, which may harm research concerned with human (rather than model) behavior and degrade future models trained with crowdsourced data. Additionally, the paper finds that preventing LLM use may be at odds with obtaining high-quality responses.

Abstract
We show that the use of large language models (LLMs) is prevalent among crowd workers, and that targeted mitigation strategies can significantly reduce, but not eliminate, LLM use. On a text summarization task where workers were not directed in any way regarding their LLM use, the estimated prevalence of LLM use was around 30%, but was reduced by about half by asking workers to not use LLMs and by raising the cost of using them, e.g., by disabling copy-pasting. Secondary analyses give further insight into LLM use and its prevention: LLM use yields high-quality but homogeneous responses, which may harm research concerned with human (rather than model) behavior and degrade future models trained with crowdsourced data. At the same time, preventing LLM use may be at odds with obtaining high-quality responses; e.g., when requesting workers not to use LLMs, summaries contained fewer keywords carrying essential information. Our estimates will likely change as LLMs increase in popularity or capabilities, and as norms around their usage change. Yet, understanding the co-evolution of LLM-based tools and users is key to maintaining the validity of research done using crowdsourcing, and we provide a critical baseline before widespread adoption ensues.

摘要
我们表明大语言模型（LLM）在观众工作者中的使用是普遍的，并且目标的 Mitigation Strategies 可以大幅降低，但不能完全消除 LLM 的使用。在一个文本摘要任务中，工作者没有任何指导，LLM 的使用率约为 30%，但通过请求工作者不使用 LLM 和提高使用它们的成本，例如禁用复制键，可以大幅降低 LLM 的使用率，约从 30% 降至 15%。次要分析显示 LLM 使用对人类（而不是模型）的行为有高质量但同质的回应，这可能对研究造成伤害，并且对未来由观众集成的数据训练的模型造成负面影响。同时，防止 LLM 使用可能与获得高质量回应相抵触，例如当请求工作者不使用 LLM 时，摘要中的关键词数量减少。我们的估计将在 LLM 的普及度和能力增加，以及使用 norms 的改变时改变。但是，理解 LLM 基本的工具和用户之间的共演是维护透过观众集成所进行的研究的有效性的关键。我们提供了一个基本的估计，以便在大规模的采用前，我们可以更好地理解 LLM 的影响。

How Much Context Does My Attention-Based ASR System Need?

paper_url: http://arxiv.org/abs/2310.15672
repo_url: https://github.com/robflynnyh/long-context-asr
paper_authors: Robert Flynn, Anton Ragni
for: 这个研究旨在检验在语音识别任务中使用更长的听录上下文时的效果。
methods: 这些实验使用了 dense-attention 基于的语音和语言模型，并在不同的上下文长度（5秒至1小时）下进行训练和评估。
results: 研究发现，使用约80秒的听录上下文可以获得14.9%的相对提升，并通过批处理搜索与长上下文转换器语言模型组合来实现长上下文语音识别系统，与当前状态前进竞争。

Abstract
For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon, and under-investigated in literature. In this work, we examine the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium demonstrate a benefit from training with around 80 seconds of acoustic context, showing up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.

摘要
For the task of speech recognition, using more than 30 seconds of acoustic context during training is rare and under-investigated in literature. In this work, we study the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments, we use a dataset of approximately 100,000 pseudo-labeled Spotify podcasts, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium show a benefit from training with around 80 seconds of acoustic context, with up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.

Expression Syntax Information Bottleneck for Math Word Problems

paper_url: http://arxiv.org/abs/2310.15664
repo_url: https://github.com/menik1126/math_esib
paper_authors: Jing Xiong, Chengming Li, Min Yang, Xiping Hu, Bin Hu
for: automatic solving of mathematical questions in texts
methods: Expression Syntax Information Bottleneck (ESIB) method based on variational information bottleneck, with self-distillation loss to improve generalization and generate more diverse expressions
results: state-of-the-art results and more diverse solutions on two large-scale benchmarks

Abstract
Math Word Problems (MWP) aims to automatically solve mathematical questions given in texts. Previous studies tend to design complex models to capture additional information in the original text so as to enable the model to gain more comprehensive features. In this paper, we turn our attention in the opposite direction, and work on how to discard redundant features containing spurious correlations for MWP. To this end, we design an Expression Syntax Information Bottleneck method for MWP (called ESIB) based on variational information bottleneck, which extracts essential features of expression syntax tree while filtering latent-specific redundancy containing syntax-irrelevant features. The key idea of ESIB is to encourage multiple models to predict the same expression syntax tree for different problem representations of the same problem by mutual learning so as to capture consistent information of expression syntax tree and discard latent-specific redundancy. To improve the generalization ability of the model and generate more diverse expressions, we design a self-distillation loss to encourage the model to rely more on the expression syntax information in the latent space. Experimental results on two large-scale benchmarks show that our model not only achieves state-of-the-art results but also generates more diverse solutions. The code is available.

摘要
mathematical word problems (MWP) 目的是自动解决在文本中提供的数学问题。在前一些研究中，旨在设计复杂的模型，以便 capture额外信息在原始文本中，以便模型可以获得更加全面的特征。在这篇论文中，我们弯我们的注意力向另一个方向，并努力如何抛弃 redundancy 中的偶极特征，以便为 MWP 提供更好的解决方案。为此，我们提出了一种基于变量信息瓶颈的表达 syntax 信息瓶颈方法（ESIB），该方法可以提取表达 syntax 树中的重要特征，同时过滤 latent-specific 的偶极特征。ESIB 的关键思想是通过互学习来鼓励多个模型对不同的问题表示进行同一个表达 syntax 树的预测，以便捕捉表达 syntax 树中的一致信息，并抛弃 latent-specific 的偶极特征。为了提高模型的通用能力和生成更多的表达，我们还设计了一种自我混合损失，以便让模型更加依赖于表达 syntax 信息在潜在空间中。实验结果表明，我们的模型不仅达到了当前最佳 результаados，还能够生成更多的解决方案。代码可用。

CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation

paper_url: http://arxiv.org/abs/2310.15638
repo_url: https://github.com/salt-nlp/coannotating
paper_authors: Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F. Chen, Zhengyuan Liu, Diyi Yang
for: 本研究旨在提出一种新的人机共约注解模型（CoAnnotating），用于大规模的不结构化文本注解。
methods: 该模型利用了机器学习模型的不确定性来估算机器模型的注解能力。
results: 实验结果表明，CoAnnotating可以有效地划分工作，并且在不同的数据集上达到21%的性能提升。

Abstract
Annotated data plays a critical role in Natural Language Processing (NLP) in training models and evaluating their performance. Given recent developments in Large Language Models (LLMs), models such as ChatGPT demonstrate zero-shot capability on many text-annotation tasks, comparable with or even exceeding human annotators. Such LLMs can serve as alternatives for manual annotation, due to lower costs and higher scalability. However, limited work has leveraged LLMs as complementary annotators, nor explored how annotation work is best allocated among humans and LLMs to achieve both quality and cost objectives. We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Under this framework, we utilize uncertainty to estimate LLMs' annotation capability. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline. For code implementation, see https://github.com/SALT-NLP/CoAnnotating.

摘要
<>translate the following text into Simplified Chinese<>自然语言处理（NLP）中，注释数据扮演了关键角色，在训练模型和评估其性能方面都是非常重要的。在最近的大语言模型（LLMs）中，models如ChatGPT显示了零开始能力，在许多文本注释任务上表现相当于或 même exceeding human annotators。这些LLMs可以作为手动注释的替代方案，因为其成本较低，可扩展性较高。然而，有限的工作没有利用LLMs作为补充注释者，也没有探索如何将注释工作分配给人类和LLMs以实现质量和成本目标。我们提出了CoAnnotating，一种新的人类-LLM共同注释文本的框架。在这个框架下，我们利用uncertainty来估计LLMs的注释能力。我们的实验表明，CoAnnotating是一种有效的方法，可以在不同的数据集上分配工作，并且比Random baseline提高性能达21%。 For code implementation, see .

Tips for making the most of 64-bit architectures in langage design, libraries or garbage collection

paper_url: http://arxiv.org/abs/2310.15632
repo_url: None
paper_authors: Benoît Sonntag, Dominique Colnet
for: 该论文探讨了如何利用64位处理器的低级编程可能性，以提高计算速度和内存利用率。
methods: 论文提出了三个具体的例子，包括实现多精度整数库、使用UTF-8字符串索引和优化垃圾回收器。
results: 论文的例子显示了在计算速度和内存利用率方面的性能提升。

Abstract
The 64-bit architectures that have become standard today offer unprecedented low-level programming possibilities. For the first time in the history of computing, the size of address registers far exceeded the physical capacity of their bus.After a brief reminder of the possibilities offered by the small size of addresses compared to the available 64 bits,we develop three concrete examples of how the vacant bits of these registers can be used.Among these examples, two of them concern the implementation of a library for a new statically typed programming language.Firstly, the implementation of multi-precision integers, with the aim of improving performance in terms of both calculation speed and RAM savings.The second example focuses on the library's handling of UTF-8 character strings.Here, the idea is to make indexing easier by ignoring the physical size of each UTF-8 characters.Finally, the third example is a possible enhancement of garbage collectors, in particular the mark \& sweep for the object marking phase.

摘要
现代64位架构已成为标准，提供了前所未有的低级编程可能性。这是计算史上首次，地址 Register 的大小超过了总线的物理容量。我们从小地址比例64位中的可用 bits的角度出发，提出三个具体的例子，其中两个是实现一种新的静态类型编程语言的库。首先，我们实现了多精度整数，以提高计算速度和内存占用量。第二个例子是库处理 UTF-8 字符串，以便更容易地索引。在这个例子中，我们忽略了每个 UTF-8 字符的物理大小，使索引更加简单。最后，第三个例子是对垃圾收集器的优化，特别是对对象标识阶段的 mark & sweep 垃圾收集器。

Machine Translation for Nko: Tools, Corpora and Baseline Results

paper_url: http://arxiv.org/abs/2310.15612
repo_url: None
paper_authors: Moussa Koulako Bala Doumbouya, Baba Mamadi Diané, Solo Farabado Cissé, Djibrila Diané, Abdoulaye Sow, Séré Moussa Doumbouya, Daouda Bangoura, Fodé Moriba Bayo, Ibrahima Sory 2. Condé, Kalo Mory Diané, Chris Piech, Christopher Manning
for: Addressing the lack of usable machine translation systems for Nko, a language spoken by tens of millions of people across multiple West African countries.
methods: Developed a set of tools, resources, and baseline results aimed towards the development of usable machine translation systems for Nko and other languages that do not currently have sufficiently large parallel text corpora available.
results: Presented a novel collaborative parallel text curation software (Friallel), expanded the FLoRes-200 and NLLB-Seed corpora with high-quality Nko translations, and developed a collection of trilingual and bilingual corpora (nicolingua-0005) with over 3 million Nko words. The best model scored 30.83 English-Nko chrF++ on FLoRes-devtest.

Abstract
Currently, there is no usable machine translation system for Nko, a language spoken by tens of millions of people across multiple West African countries, which holds significant cultural and educational value. To address this issue, we present a set of tools, resources, and baseline results aimed towards the development of usable machine translation systems for Nko and other languages that do not currently have sufficiently large parallel text corpora available. (1) Friallel: A novel collaborative parallel text curation software that incorporates quality control through copyedit-based workflows. (2) Expansion of the FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko translations in parallel with 204 and 40 other languages. (3) nicolingua-0005: A collection of trilingual and bilingual corpora with 130,850 parallel segments and monolingual corpora containing over 3 million Nko words. (4) Baseline bilingual and multilingual neural machine translation results with the best model scoring 30.83 English-Nko chrF++ on FLoRes-devtest.

摘要
Currently, there is no usable machine translation system for Nko, a language spoken by tens of millions of people across multiple West African countries, which holds significant cultural and educational value. To address this issue, we present a set of tools, resources, and baseline results aimed towards the development of usable machine translation systems for Nko and other languages that do not currently have sufficiently large parallel text corpora available.(1) Friallel: 一款新的合作平行文本纠正软件，包含质量控制通过复制编辑工作流程。(2) FLoRes-200和NLLB-Seed corpora的扩展，包含2,009和6,193个高质量的Nko翻译和204种其他语言的平行翻译。(3) nicolingua-0005：一个包含130,850个平行段和更多than 3 million Nko单词的多语言和双语 corpora集。(4) 使用最佳模型，得到了30.83的英文-Nko chrF++ 成绩在FLoRes-devtest上。

MUSER: A Multi-View Similar Case Retrieval Dataset

paper_url: http://arxiv.org/abs/2310.15602
repo_url: https://github.com/thulawtech/muser
paper_authors: Qingquan Li, Yiran Hu, Feng Yao, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, Weixing Shen
for: 提高司法公平性，开发智能法律应用程序
methods: 多视角相似度测量、全面法律元素标注
results: incorporating 法律元素可以提高相似案件模型的表现，但还需要继续解决MUSER中的挑战。Here’s a brief explanation of each point:
for: The paper is written to improve the fairness of the judicial system by developing a smart legal application.
methods: The paper uses a multi-view similarity measurement and comprehensive legal element annotation to evaluate the similarity between cases.
results: Incorporating legal elements can improve the performance of similar case retrieval models, but there are still challenges to be addressed in the MUSER dataset.

Abstract
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness. However, existing SCR datasets only focus on the fact description section when judging the similarity between cases, ignoring other valuable sections (e.g., the court's opinion) that can provide insightful reasoning process behind. Furthermore, the case similarities are typically measured solely by the textual semantics of the fact descriptions, which may fail to capture the full complexity of legal cases from the perspective of legal knowledge. In this work, we present MUSER, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations. Specifically, we select three perspectives (legal fact, dispute focus, and law statutory) and build a comprehensive and structured label schema of legal elements for each of them, to enable accurate and knowledgeable evaluation of case similarities. The constructed dataset originates from Chinese civil cases and contains 100 query cases and 4,024 candidate cases. We implement several text classification algorithms for legal element prediction and various retrieval methods for retrieving similar cases on MUSER. The experimental results indicate that incorporating legal elements can benefit the performance of SCR models, but further efforts are still required to address the remaining challenges posed by MUSER. The source code and dataset are released at https://github.com/THUlawtech/MUSER.

摘要
相似案件检索（SCR）是法律人工智能应用的代表之一，对法院公正发挥重要作用。然而，现有的 SCR 数据集仅基于事实描述部分进行相似性评估，忽略其他有价值的部分（例如法院意见），这可能导致失去有价值的推理过程。另外，案例相似性通常基于事实描述部分的文本 semantics 进行评估，这可能不能捕捉法律案例的全面性。在这种情况下，我们提出了 MUSER，一个基于多视角相似度测量和全面的法律元素的similar case retrieval数据集。具体来说，我们选择了三个视角（法律事实、争议重点和法律法规），并建立了每个视角的结构化标签 schema 以便准确和知识化评估案例的相似性。构建的数据集来自中国民事案例，包含100个查询案例和4,024个候选案例。我们实现了多种文本分类算法以便法律元素预测，以及多种检索方法以便检索相似案例。实验结果表明，包含法律元素可以提高 SCR 模型的性能，但还需要进一步努力以 Address MUSER 中的剩下挑战。数据集和代码可以在获取。

ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts

paper_url: http://arxiv.org/abs/2310.15587
repo_url: https://github.com/dili-lab/scandl
paper_authors: Lena S. Bolliger, David R. Reich, Patrick Haller, Deborah N. Jakobi, Paul Prasse, Lena A. Jäger
for: 这个论文的目的是为了研究人类语言处理的认知机制，以及利用眼动数据进行语言相关的机器学习任务。
methods: 这篇论文使用了一种基于扩散过程的数据驱动的眼动数据生成模型，称为ScanDL，以生成人类化的扫描路径。模型利用预训练的单词表示和共同嵌入句子和固定序列，以捕捉多模态的句子和眼动之间的互动。
results: 作者在几个数据集上进行了内部和跨数据集的评估，并证明了ScanDL在生成眼动数据方面表现出色，大大超过了当前的状态艺术。此外，作者还进行了广泛的心理语言分析，证明ScanDL能够展现出人类化的读取行为。

Abstract
Eye movements in reading play a crucial role in psycholinguistic research studying the cognitive mechanisms underlying human language processing. More recently, the tight coupling between eye movements and cognition has also been leveraged for language-related machine learning tasks such as the interpretability, enhancement, and pre-training of language models, as well as the inference of reader- and text-specific properties. However, scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research. Initially, this problem was tackled by resorting to cognitive models for synthesizing eye movement data. However, for the sole purpose of generating human-like scanpaths, purely data-driven machine-learning-based methods have proven to be more suitable. Following recent advances in adapting diffusion processes to discrete data, we propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts. By leveraging pre-trained word representations and jointly embedding both the stimulus text and the fixation sequence, our model captures multi-modal interactions between the two inputs. We evaluate ScanDL within- and across-dataset and demonstrate that it significantly outperforms state-of-the-art scanpath generation methods. Finally, we provide an extensive psycholinguistic analysis that underlines the model's ability to exhibit human-like reading behavior. Our implementation is made available at https://github.com/DiLi-Lab/ScanDL.

摘要
阅读时的眼动对心理语言研究具有关键性。更 reciently, 眼动和认知之间的紧密关系也被用于语言相关的机器学习任务，如语言模型的可读性、加强和预训练，以及文本和读者特有的属性的推断。然而，眼动数据的罕见和应用时无法获取的问题成为了这一研究领域的主要挑战。初始地，这个问题被解决了通过启用认知模型来生成眼动数据。然而，为了生成人类化的扫描路径，数据驱动的机器学习基本方法证明更加适用。基于最近的扩散过程的适应，我们提出了ScanDL，一种新的离散序列到序列扩散模型，可以生成人工智能化的扫描路径。通过将预训练word表示和扫描序列共同嵌入，我们的模型捕捉了文本和扫描序列之间的多模态交互。我们在不同的数据集上进行了 dentro- y across-dataset 的评估，并证明ScanDL在生成扫描路径方面显著超越了现有的扫描路径生成方法。最后，我们进行了广泛的心理语言分析，并证明ScanDL能够展现出人类化的阅读行为。我们的实现可以在https://github.com/DiLi-Lab/ScanDL 上找到。

Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

paper_url: http://arxiv.org/abs/2310.15585
repo_url: None
paper_authors: Wafa Aissa, Marin Ferecatu, Michel Crucianu
for: 提高图像问答模型的效果和可读性
methods: 使用大规模交叉模式Encoder获取特征，并利用学习导师约束来改善模型的训练方法
results: 通过增加交叉模式特征和改进训练方法，实现图像问答模型的性能和可读性之间的平衡

Abstract
Neural Module Networks (NMN) are a compelling method for visual question answering, enabling the translation of a question into a program consisting of a series of reasoning sub-tasks that are sequentially executed on the image to produce an answer. NMNs provide enhanced explainability compared to integrated models, allowing for a better understanding of the underlying reasoning process. To improve the effectiveness of NMNs we propose to exploit features obtained by a large-scale cross-modal encoder. Also, the current training approach of NMNs relies on the propagation of module outputs to subsequent modules, leading to the accumulation of prediction errors and the generation of false answers. To mitigate this, we introduce an NMN learning strategy involving scheduled teacher guidance. Initially, the model is fully guided by the ground-truth intermediate outputs, but gradually transitions to an autonomous behavior as training progresses. This reduces error accumulation, thus improving training efficiency and final performance.We demonstrate that by incorporating cross-modal features and employing more effective training techniques for NMN, we achieve a favorable balance between performance and transparency in the reasoning process.

摘要
神经模块网络（NMN）是一种吸引人的方法 для视觉问答，允许将问题转化为一系列的逻辑子任务，并在图像上顺序执行以生成答案。NMN提供了提高可读性的优势，allowing for a better understanding of the underlying reasoning process. To improve the effectiveness of NMNs, we propose to exploit features obtained by a large-scale cross-modal encoder. In addition, the current training approach of NMNs relies on the propagation of module outputs to subsequent modules, leading to the accumulation of prediction errors and the generation of false answers. To mitigate this, we introduce an NMN learning strategy involving scheduled teacher guidance. Initially, the model is fully guided by the ground-truth intermediate outputs, but gradually transitions to an autonomous behavior as training progresses. This reduces error accumulation, thus improving training efficiency and final performance.We demonstrate that by incorporating cross-modal features and employing more effective training techniques for NMN, we achieve a favorable balance between performance and transparency in the reasoning process.

POE: Process of Elimination for Multiple Choice Reasoning

paper_url: http://arxiv.org/abs/2310.15575
repo_url: https://github.com/kasmasvan/poe
paper_authors: Chenkai Ma, Xinya Du
for: 提高自然语言处理器在多选理智任务中的表现
methods: 提出了一种两步评分方法，称为过程排除（POE），首先对每个选项进行评分，然后根据评分结果排除看上去错误的选项，并从剩下的选项中进行最终预测。
results: 在8种理智任务上进行零例试验，证明POE方法的有效性，并且发现POE方法尤其适合逻辑理智任务。此外，还进行了masks的分析，并证明POE方法可以应用于少量示例设定和大语言模型（LLMs）如ChatGPT。

Abstract
Language models (LMs) are capable of conducting in-context learning for multiple choice reasoning tasks, but the options in these tasks are treated equally. As humans often first eliminate wrong options before picking the final correct answer, we argue a similar two-step strategy can make LMs better at these tasks. To this end, we present the Process of Elimination (POE), a two-step scoring method. In the first step, POE scores each option, and eliminates seemingly wrong options. In the second step, POE masks these wrong options, and makes the final prediction from the remaining options. Zero-shot experiments on 8 reasoning tasks illustrate the effectiveness of POE, and a following analysis finds our method to be especially performant on logical reasoning tasks. We further analyze the effect of masks, and show that POE applies to few-shot settings and large language models (LLMs) like ChatGPT.

摘要
语言模型（LM）可以在多选问题上进行上下文学习，但是选项在这些任务中往往被对待相同。人类通常会先消除错误的选项，然后选择最终正确的答案。我们认为类似的两步策略可以使LM更好地处理这些任务。为此，我们提出了消除过程（POE），它是一种两步评分方法。在第一步，POE对每个选项进行评分，并消除看起来错误的选项。在第二步，POE隐藏这些错误的选项，然后从剩下的选项中进行最终预测。在零容量实验中，POE在8个理解任务上表现出色，并且分析发现，POE在逻辑理解任务上表现特别出色。我们还分析了mask的效果，并证明POE适用于少量上下文和大语言模型（LLM）如ChatGPT。

Natural Language Processing for Drug Discovery Knowledge Graphs: promises and pitfalls

paper_url: http://arxiv.org/abs/2310.15572
repo_url: None
paper_authors: J. Charles G. Jeynes, Tim James, Matthew Corney
for: 该论文旨在探讨使用自然语言处理（NLP）技术来挖掘科学文献中的不结构化文本，以帮助建立知识图谱（KG）以激发药物发现。
methods: 该论文使用NLP技术来自动提取文本中的数据，以增强知识图谱中的数据。
results: 该论文指出，使用NLP技术可以自动提取数据从数百万个文档中，但也存在许多可能的坏处，如命名实体识别和 ontology 链接错误，这些错误可能导致错误的推论和结论。

Abstract
Building and analysing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modelling of the data. In this article, we discuss promises and pitfalls of using natural language processing (NLP) to mine unstructured text typically from scientific literature as a data source for KGs. This draws on our experience of initially parsing structured data sources such as ChEMBL as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines such as incorrect named entity recognition and ontology linking all of which could ultimately lead to erroneous inferences and conclusions.

摘要

Visually Grounded Continual Language Learning with Selective Specialization

paper_url: http://arxiv.org/abs/2310.15571
repo_url: None
paper_authors: Kyra Ahrens, Lennart Bengtson, Jae Hee Lee, Stefan Wermter
for: 本研究旨在提供Language-informed continual learning中模型特性的扩展分析，以便控制特定任务学习和总体知识的平衡。
methods: 本研究使用了多种特性分析和模型评估策略，包括两个新引入的 диагностические数据集，以及多种模型结构和特性分析。
results: 研究结果表明，选择策略对于Language-informed continual learning中的模型特性具有重要作用，并且提出了一些简单易行的方法，以超越常见的 continual learning 基elines。

Abstract
A desirable trait of an artificial agent acting in the visual world is to continually learn a sequence of language-informed tasks while striking a balance between sufficiently specializing in each task and building a generalized knowledge for transfer. Selective specialization, i.e., a careful selection of model components to specialize in each task, is a strategy to provide control over this trade-off. However, the design of selection strategies requires insights on the role of each model component in learning rather specialized or generalizable representations, which poses a gap in current research. Thus, our aim with this work is to provide an extensive analysis of selection strategies for visually grounded continual language learning. Due to the lack of suitable benchmarks for this purpose, we introduce two novel diagnostic datasets that provide enough control and flexibility for a thorough model analysis. We assess various heuristics for module specialization strategies as well as quantifiable measures for two different types of model architectures. Finally, we design conceptually simple approaches based on our analysis that outperform common continual learning baselines. Our results demonstrate the need for further efforts towards better aligning continual learning algorithms with the learning behaviors of individual model parts.

摘要
文本中的愿望是一个人工智能在视觉世界中不断学习一串语言指导的任务，同时保持学习任务之间的平衡。选择性特殊化是一种策略，可以为这个贸易做出控制。然而，选择策略的设计需要了解模型组件在学习特定或通用表示方面的角色，这种空白在当前研究中存在。因此，我们的目标是通过广泛的分析来探讨选择策略的选择。由于没有适合的标准准确量表 для这种目的，我们引入了两个新的诊断数据集，以便对模型进行完善的分析。我们评估了多种模块特殊化策略，以及两种不同的模型架构中的量化度量。最后，我们设计了简单易于实现的方法，并超越常见的 kontinuierliche learning 基线。我们的结果表明需要更多的努力来更好地对照 continual learning 算法和模型组件之间的学习行为进行对应。

MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain

paper_url: http://arxiv.org/abs/2310.15569
repo_url: None
paper_authors: Timo Pierre Schrader, Matteo Finco, Stefan Grünewald, Felix Hildebrand, Annemarie Friedrich
for: 本文旨在提供一个新的数据集，用于支持材料科学领域的研究。
methods: 本文使用了多种机器学习模型，包括命名实体识别、关系检测和帧结构识别等多种任务。
results: 研究发现，通过多任务训练和利用相关的资源，可以获得竞争力强的模型性能。

Abstract
Keeping track of all relevant recent publications and experimental results for a research area is a challenging task. Prior work has demonstrated the efficacy of information extraction models in various scientific areas. Recently, several datasets have been released for the yet understudied materials science domain. However, these datasets focus on sub-problems such as parsing synthesis procedures or on sub-domains, e.g., solid oxide fuel cells. In this resource paper, we present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science. The corpus has been annotated by domain experts with several layers ranging from named entities over relations to frame structures. We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.

摘要
监控最新相关发表文章和实验结果是研究领域中的一项具有挑战性的任务。先前的工作已经证明了信息EXTRACTION模型在不同的科学领域中的效果。最近，物理科学领域内的一些数据集已经发布。然而，这些数据集都是关注子问题，如 sintesis过程解析或特定子领域，如固体燃料电池。在这篇资源文章中，我们介绍了MuLMS数据集，包含50篇开放访问文章，覆盖了物理科学七个子领域。这个corpus已经由域专家注释了多层，从名称实体到关系到帧结构。我们提出了竞争力强的神经网络模型，并示出了多任务训练与现有相关资源的合作带来的 beneficial effect。

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

paper_url: http://arxiv.org/abs/2310.15556
repo_url: None
paper_authors: Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, Yiming Qian
For: The paper aims to mitigate the cost of deploying commercial retrieval-augmented large language models (LLMs) by proposing a token compression scheme.* Methods: The proposed token compression scheme includes two methods: summarization compression and semantic compression. The first method uses a T5-based model fine-tuned by datasets generated using self-instruct, while the second method removes words with lower impact on the semantic.* Results: The proposed methods are evaluated using a dataset called Food-Recommendation DB (FRDB) focusing on food recommendation for women around pregnancy period or infants. The results show that the summarization compression can reduce 65% of the retrieval token size with further 0.3% improvement on the accuracy, while the semantic compression provides a more flexible way to trade-off the token size with performance, with a 20% reduction in token size and only 1.6% drop in accuracy.

Abstract
Since ChatGPT released its API for public use, the number of applications built on top of commercial large language models (LLMs) increase exponentially. One popular usage of such models is leveraging its in-context learning ability and generating responses given user queries leveraging knowledge obtained by retrieval augmentation. One problem of deploying commercial retrieval-augmented LLMs is the cost due to the additionally retrieved context that largely increases the input token size of the LLMs. To mitigate this, we propose a token compression scheme that includes two methods: summarization compression and semantic compression. The first method applies a T5-based model that is fine-tuned by datasets generated using self-instruct containing samples with varying lengths and reduce token size by doing summarization. The second method further compresses the token size by removing words with lower impact on the semantic. In order to adequately evaluate the effectiveness of the proposed methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) focusing on food recommendation for women around pregnancy period or infants. Our summarization compression can reduce 65% of the retrieval token size with further 0.3% improvement on the accuracy; semantic compression provides a more flexible way to trade-off the token size with performance, for which we can reduce the token size by 20% with only 1.6% of accuracy drop.

摘要
The first method uses a T5-based model that is fine-tuned by datasets generated using self-instruct containing samples with varying lengths, reducing the token size by summarizing the content. The second method further compresses the token size by removing words with lower impact on the semantic. To evaluate the effectiveness of the proposed methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) that focuses on food recommendations for women during pregnancy or infancy.Our summarization compression can reduce the retrieval token size by 65% with a further 0.3% improvement in accuracy, while the semantic compression provides a more flexible way to trade-off the token size with performance, reducing the token size by 20% with only a 1.6% drop in accuracy.

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

paper_url: http://arxiv.org/abs/2310.15552
repo_url: None
paper_authors: Sunit Bhattacharya, Ondrej Bojar
for: 这个论文的目的是研究Transformer模型中的Feed-Forward模块是一个键值记忆集合，keys学习了各个输入pattern的特定模式，values则将这些’记忆’的输出组合起来预测下一个token。
methods: 这个论文使用了Transformer模型的pretraining和autoregressive预测方法，并通过使用并行 corpora来验证自己的假设。
results: 研究发现，模型的输入和输出层更加具有语言特有的行为，而中间层则更加具有语言共享的特征。

Abstract
Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then combine the output from the 'memories' of the keys to generate predictions about the next token. This leads to an incremental process of prediction that gradually converges towards the final token choice near the output layers. This interesting perspective raises questions about how multilingual models might leverage this mechanism. Specifically, for autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages? No! Our hypothesis centers around the notion that during pretraining, certain model parameters learn strong language-specific features, while others learn more language-agnostic (shared across languages) features. To validate this, we conduct experiments utilizing parallel corpora of two languages that the model was initially pretrained on. Our findings reveal that the layers closest to the network's input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.

摘要
(Simplified Chinese translation)近期研究表明，Transformer 中的 feed-forward 模块可以视为一个包含键值记忆的集合，其中键学习 capture 输入中特定的模式，基于训练示例。值们则将 ' memories ' 中的输出 combine 以生成下一个token的预测。这种逐步进行的预测过程，逐渐 converges 到输出层的最终token选择。这种有趣的视角提出了关于多语言模型如何利用这种机制的问题。例如，对于使用两个或更多语言的 autoregressive 模型，所有神经元（在层次上）是否对所有语言具有相同的响应？不是！我们的假设是，在预训练时，模型参数会学习强度语言特有的特征，而其他参数则会学习更加语言共享的特征。为验证这一假设，我们通过使用两种语言的并行 corpus 进行实验。我们的发现是，输入层和输出层的层次比较接近的神经元具有更强的语言特有行为，与中间层的神经元相比。

Improving Language Models Meaning Understanding and Consistency by Learning Conceptual Roles from Dictionary

paper_url: http://arxiv.org/abs/2310.15541
repo_url: None
paper_authors: Myeongjun Erik Jang, Thomas Lukasiewicz
for: 本研究旨在解决现代预训练语言模型（PLM）的不信worthiness问题，即生成不一致的预测问题，这会导致生成不同的预测结果，但是表达的意思相同。
methods: 我们提出了一种实用的方法，通过加强PLM的意识性来解决这个问题。我们基于字典中的词义对的概念角色理论来让PLM学习准确的意思。然后，我们提出了一种高效的参数集成技术，可以快速地将学习到的概念关系与PLM的预训练知识结合起来。
results: 我们的实验结果表明，我们的方法可以同时改善多种一致性，快速地集成知识，并可以应用于其他语言。

Abstract
The non-humanlike behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness. A striking phenomenon of such faulty behaviours is the generation of inconsistent predictions, which produces logically contradictory results, such as generating different predictions for texts delivering the same meaning or violating logical properties. Previous studies exploited data augmentation or implemented specialised loss functions to alleviate the issue. However, their usage is limited, because they consume expensive training resources for large-sized PLMs and can only handle a certain consistency type. To this end, we propose a practical approach that alleviates the inconsistent behaviour issue by fundamentally improving PLMs' meaning awareness. Based on the conceptual role theory, our method allows PLMs to capture accurate meaning by learning precise interrelationships between concepts from word-definition pairs in a dictionary. Next, we propose an efficient parameter integration technique that updates only a few additional parameters to combine the learned interrelationship with PLMs' pre-trained knowledge. Our experimental results reveal that the approach can concurrently improve multiple types of consistency, enables efficient knowledge integration, and easily applies to other languages.

摘要
当代预训练语言模型（PLM）的非人类行为是信worthiness的主要原因。一种 striking 的现象是生成不一致的预测，这会产生逻辑矛盾的结果，如生成表达同义文本时不同的预测或违反逻辑规则。先前的研究使用了数据扩展或特殊的损失函数来缓解问题，但它们的使用有限，因为它们需要大量的训练资源，只能处理一定的一致性类型。为此，我们提出了一种实用的方法，可以根本改善 PLM 的意识意识。基于概念角色理论，我们的方法使 PLM 能够准确地捕捉意思，通过学习词定义对的精准关系。然后，我们提出了一种高效的参数集成技术，可以在 PLM 的预训练知识基础之上更新只需要一些额外参数。我们的实验结果表明，该方法可以同时改善多种一致性，实现高效的知识集成，并且容易应用于其他语言。

MarkQA: A large scale KBQA dataset with numerical reasoning

paper_url: http://arxiv.org/abs/2310.15517
repo_url: https://github.com/cdhx/markqa
paper_authors: Xiang Huang, Sitao Cheng, Yuheng Bao, Shanshan Huang, Yuzhong Qu
for: 本研究对知识库问答（KBQA）的进一步发展进行了尝试，特别是对数字reasoning进行了更多的探索。
methods: 本文提出了一个新的任务，即NR-KBQA，它需要某些问题的解决需要进行多个跳步逻辑和数字逻辑。作者们提出了一种逻辑表示形式，即Python的PyQL，用于表示数字逻辑问题的解决过程。
results: 对MarkQA数据集进行了一些state-of-the-art QA方法的实验，结果显示，复杂的数字逻辑在KBQA中具有很大的挑战。

Abstract
While question answering over knowledge bases (KBQA) has shown progress in addressing factoid questions, KBQA with numerical reasoning remains relatively unexplored. In this paper, we focus on the complex numerical reasoning in KBQA and propose a new task, NR-KBQA, which necessitates the ability to perform both multi-hop reasoning and numerical reasoning. We design a logic form in Python format called PyQL to represent the reasoning process of numerical reasoning questions. To facilitate the development of NR-KBQA, we present a large dataset called MarkQA, which is automatically constructed from a small set of seeds. Each question in MarkQA is equipped with its corresponding SPARQL query, alongside the step-by-step reasoning process in the QDMR format and PyQL program. Experimental results of some state-of-the-art QA methods on the MarkQA show that complex numerical reasoning in KBQA faces great challenges.

摘要
while 问答 sobre bases de conocimiento (KBQA) ha demostrado progreso en responder preguntas de hecho, el KBQA con razonamiento numérico permanece relativamente inexplorado. En este artículo, nos enfocamos en el razonamiento numérico complejo en KBQA y propusimos una nueva tarea, NR-KBQA, que requiere la habilidad de realizar tanto razonamiento en múltiples pasos como razonamiento numérico. Diseñamos una forma lógica en formato Python llamada PyQL para representar el proceso de razonamiento de preguntas de números. Para facilitar el desarrollo de NR-KBQA, presentamos una gran base de datos llamada MarkQA, que se construyó automáticamente a partir de un conjunto pequeño de semillas. Cada pregunta en MarkQA está equipada con su correspondiente consulta SPARQL, así como el proceso de razonamiento paso a paso en el formato QDMR y el programa PyQL. Los resultados experimentales de algunos métodos de QA estado-de-arte en MarkQA muestran que el razonamiento numérico complejo en KBQA enfrenta grandes desafíos.

Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting Elusive Disinformation

paper_url: http://arxiv.org/abs/2310.15515
repo_url: None
paper_authors: Jason Lucas, Adaku Uchendu, Michiharu Yamashita, Jooyoung Lee, Shaurya Rohatgi, Dongwon Lee
for: 防止大语言模型（LLM）被负用（即生成大规模危害和误导性内容）。
methods: 提议一种“战火与火”（F3）策略，利用现代LLM的生成和总结能力对人类写的和LLM生成的假信息进行抗击。
results: 在广泛的实验中，我们发现GPT-3.5-turbo在适用零极学习上下文Semantic Reasoning技术的情况下，对各种数据集（包括原始和假数据集）具有显著的优势，其精度在68-72%之间，而其他自定义和精度调整的假信息检测器则显示负性。

Abstract
Recent ubiquity and disruptive impacts of large language models (LLMs) have raised concerns about their potential to be misused (.i.e, generating large-scale harmful and misleading content). To combat this emerging risk of LLMs, we propose a novel "Fighting Fire with Fire" (F3) strategy that harnesses modern LLMs' generative and emergent reasoning capabilities to counter human-written and LLM-generated disinformation. First, we leverage GPT-3.5-turbo to synthesize authentic and deceptive LLM-generated content through paraphrase-based and perturbation-based prefix-style prompts, respectively. Second, we apply zero-shot in-context semantic reasoning techniques with cloze-style prompts to discern genuine from deceptive posts and news articles. In our extensive experiments, we observe GPT-3.5-turbo's zero-shot superiority for both in-distribution and out-of-distribution datasets, where GPT-3.5-turbo consistently achieved accuracy at 68-72%, unlike the decline observed in previous customized and fine-tuned disinformation detectors. Our codebase and dataset are available at https://github.com/mickeymst/F3.

摘要
最近，大型自然语言模型（LLM）的普遍性和破坏性带来了关于其可能被滥用（例如，生成大规模危害和误导性内容）的担忧。为了解决这种emerging risk，我们提出了一种“战火与火”（F3）策略，利用现代LLM的生成和emergent reasoning能力来对人写的和LLM生成的假信息进行反击。首先，我们利用GPT-3.5-turbo来生成真实和假的LLM生成的内容，通过重写和扰动预 prompts来分别生成假和真实的内容。其次，我们运用零批学内存推理技术，通过cloze预 prompts来判断真实和假的文章和新闻报道。在我们的广泛实验中，我们发现GPT-3.5-turbo在各种数据集上表现出零批学的优势，其中GPT-3.5-turbo在各种distribution和out-of-distribution数据集上都能够达到68-72%的准确率，与之前经过定制和精度调整的假信息检测器不同，其准确率在下降。我们的代码库和数据集可以在https://github.com/mickeymst/F3中获取。

A Joint Matrix Factorization Analysis of Multilingual Representations

paper_url: http://arxiv.org/abs/2310.15513
repo_url: https://github.com/zsquaredz/joint_multilingual_analysis
paper_authors: Zheng Zhao, Yftah Ziser, Bonnie Webber, Shay B. Cohen
for: This paper is written to analyze the representations learned by multilingual pre-trained models and study how they encode morphosyntactic information.
methods: The authors use joint matrix factorization as an alternative to probing to compare the latent representations of multilingual and monolingual models.
results: The authors find variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. They also find strong associations between the factorization outputs and performance across different cross-lingual tasks.

Abstract
We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models. We conduct a large-scale empirical study of over 33 languages and 17 morphosyntactic categories. Our findings demonstrate variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. Hierarchical clustering of the factorization outputs yields a tree structure that is related to phylogenetic trees manually crafted by linguists. Moreover, we find the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks. We release our code to facilitate future research.

摘要
我们提出了基于共同矩阵因子化的分析工具，用于比较多语言和单语言模型的潜在表示。这是探测工具的一种代替方式，允许我们同时分析多个表示集。使用这种工具，我们研究了多语言预训练模型学习的 morphosyntactic 特征如何反映在其中。我们进行了33种语言和17种 morphosyntactic 类别的大规模实验研究。我们的发现表明，在上下层之间存在不同的 morphosyntactic 编码方式，并且这些差异受到语言属性的影响。使用层次 clustering 分析输出得到的树结构与手动制作的语言树相关。此外，我们发现了因素分解输出与不同的语言交互任务中的性能强相关。我们发布了我们的代码，以便将来的研究。

TRAMS: Training-free Memory Selection for Long-range Language Modeling

paper_url: http://arxiv.org/abs/2310.15494
repo_url: https://github.com/lwaekfjlk/trams
paper_authors: Haofei Yu, Cunxiang wang, Yue Zhang, Wei Bi
for: 提高Transformer架构在长距离语言模型方面的性能
methods: 提出了一种协助选择 calculus 计算中参与者的简单指标，以提高Transformer架构的长距离语言模型性能
results: 在Word-levelbenchmark（WikiText-103）和Character-level benchmark（enwik8）上测试了该方法，并获得了提高性能的结果，不需要额外训练或添加参数。

Abstract
The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.

摘要
transformer 架构在许多人工智能模型中具有重要作用，但它在长距离语言模型化方面仍面临挑战。虽然有一些特定的 transformer 架构被设计来解决长距离依赖关系的问题，但现有方法如 transformer-xl 受到高比例的不有效内存带来困扰。在这项研究中，我们提出了一种插件化策略，称为 TRAining-free Memory Selection (TRAMS)，该策略选择基于一个简单度量的 tokens 参与计算注意力。这种策略允许我们保留与当前查询具有高注意力的 tokens，并忽略其他 tokens。我们在 word-level 约束（WikiText-103）和 character-level 约束（enwik8）上测试了我们的方法，结果表明我们的方法可以提高性能，无需进行额外训练或添加额外参数。

paper_url: http://arxiv.org/abs/2310.15477
repo_url: https://github.com/TsinghuaC3I/CRaSh
paper_authors: Kaiyan Zhang, Ning Ding, Biqing Qi, Xuekai Zhu, Xinwei Long, Bowen Zhou
for: 本研究旨在探讨如何使用Offsite-Tuning（OFT）技术来提高中央化大语言模型（LLM）的泛化能力，同时保护私人数据的隐私。
methods: 本研究使用了Empirical Analysis和CRaSh（Clustering, Removing, and Sharing）训练自由策略来探索LLM层次结构和表达的特性，以及OFT的效果。
results: 研究发现LLM具有层次结构，并且在不同层次上存在表达和中间预测的变化。此外，CRaSh训练自由策略可以大幅提高OFT性能，并且研究发现了基于损失函数的优化的优化方法。

Abstract
Instruction tuning has recently been recognized as an effective way of aligning Large Language Models (LLMs) to enhance their generalization ability across various tasks. However, when tuning publicly accessible, centralized LLMs with private instruction data, privacy concerns are inevitable. While direct transfer of parameterized modules between models is a plausible approach to address this, its implications and effectiveness need further exploration. This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators. Given the limited understanding of the underlying mechanism of OFT, we perform an empirical analysis on LLMs from the perspectives of representation and functional similarity. Interestingly, our findings reveal a unique modular structure within the layers of LLMs that appears to emerge as the model size expands. Simultaneously, we note subtle but potentially significant changes in representation and intermediate predictions across the layers. Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs. CRaSh significantly boosts performance of OFT with billions of parameters. Furthermore, we investigate the optimal solutions yielded by fine-tuning with and without full model through the lens of loss landscape. Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT. The source code is publicly available at https://github.com/TsinghuaC3I/CRaSh.

摘要
征文报告：现代大语言模型（LLM）的调教技术已被认为是提高 LLM 的通用能力的有效方法。然而，当将公共可访问的中央 LLM 调教私人指令数据时，隐私问题是不可避免的。直接将参数化模块 между模型传递是一种可能的方法，但它的影响和效果需要进一步的探索。本文关注 Offsite-Tuning（OFT）技术，该技术将 transformer 块 между中央 LLM 和下游模拟器传递。由于 OFT 的深层理解尚未得到充分的研究，我们进行了empirical分析 LLM 的表示和功能相似性。我们发现 LLM 具有一种唯一的模块结构，该结构随模型大小增加而出现。此外，我们注意到在层次中存在潜在重要的表示和中间预测变化。 inspirited by these observations，我们提出了 CRaSh，即 Clustering, Removing, and Sharing 训练free的策略，以 derive improved emulators from LLMs。CRaSh 可以显著提高 OFT 的性能，并且我们通过对 fine-tuning with 和 without full model 的研究，发现这些优点在同一个极值附近。我们的发现表明 CRaSh 和 OFT 的效果。源代码可以在 https://github.com/TsinghuaC3I/CRaSh 上获取。

Continual Event Extraction with Semantic Confusion Rectification

paper_url: http://arxiv.org/abs/2310.15470
repo_url: https://github.com/nju-websoft/SCR
paper_authors: Zitao Wang, Xinyi Wang, Wei Hu
for: 本研究旨在提取不断出现的事件信息，并避免忘记现象。
methods: 我们提出了一种新的不断事件提取模型，使用semantic confusion rectification来纠正事件类型的含义混乱。我们为每句文本添加 Pseudo标签，以减轻semantic confusion。此外，我们还将当前和前一个模型之间的重要知识传递，以提高事件类型的理解。
results: 我们的模型在不平衡数据集上表现出色，比基eline模型更高效。

Abstract
We study continual event extraction, which aims to extract incessantly emerging event information while avoiding forgetting. We observe that the semantic confusion on event types stems from the annotations of the same text being updated over time. The imbalance between event types even aggravates this issue. This paper proposes a novel continual event extraction model with semantic confusion rectification. We mark pseudo labels for each sentence to alleviate semantic confusion. We transfer pivotal knowledge between current and previous models to enhance the understanding of event types. Moreover, we encourage the model to focus on the semantics of long-tailed event types by leveraging other associated types. Experimental results show that our model outperforms state-of-the-art baselines and is proficient in imbalanced datasets.

摘要
我们研究不断发生的事件提取，旨在不断出现的事件信息的提取，而避免忘记。我们发现事件类型的含义混乱来自文本更新过程中的注释。事件类型的偏度更进一步加剧了这个问题。这篇论文提出了一种新的不断事件提取模型，用于纠正含义混乱。我们为每句话添加 Pseudo 标签，以减轻含义混乱。我们将当前和前一个模型之间的核心知识传递，以提高事件类型的理解。此外，我们采用其他相关类型来强化长尾事件类型的含义。实验结果表明，我们的模型在不良数据集中表现出色，超越了当前的基elines。

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks

paper_url: http://arxiv.org/abs/2310.15469
repo_url: None
paper_authors: Xiaoyi Chen, Siyuan Tang, Rui Zhu, Shijun Yan, Lei Jin, Zihao Wang, Liya Su, XiaoFeng Wang, Haixu Tang
for: 这项研究旨在探讨 Large Language Models (LLMs) 在训练过程中是否可能泄露个人隐私信息。
methods: 该研究使用了 OpenAI 的 GPT-3.5 模型，并通过构建一个 PII assoiciation 任务来探讨 LLM 是否可能泄露个人隐私信息。
results: 研究发现，通过一个非常小的 PII 数据集进行微调，LLM 可以从原本不可能泄露 PII 的状态转变为可以泄露大量隐私信息的状态。这种攻击方式被称为 Janus 攻击。

Abstract
The era post-2018 marked the advent of Large Language Models (LLMs), with innovations such as OpenAI's ChatGPT showcasing prodigious linguistic prowess. As the industry galloped toward augmenting model parameters and capitalizing on vast swaths of human language data, security and privacy challenges also emerged. Foremost among these is the potential inadvertent accrual of Personal Identifiable Information (PII) during web-based data acquisition, posing risks of unintended PII disclosure. While strategies like RLHF during training and Catastrophic Forgetting have been marshaled to control the risk of privacy infringements, recent advancements in LLMs, epitomized by OpenAI's fine-tuning interface for GPT-3.5, have reignited concerns. One may ask: can the fine-tuning of LLMs precipitate the leakage of personal information embedded within training datasets? This paper reports the first endeavor to seek the answer to the question, particularly our discovery of a new LLM exploitation avenue, called the Janus attack. In the attack, one can construct a PII association task, whereby an LLM is fine-tuned using a minuscule PII dataset, to potentially reinstate and reveal concealed PIIs. Our findings indicate that, with a trivial fine-tuning outlay, LLMs such as GPT-3.5 can transition from being impermeable to PII extraction to a state where they divulge a substantial proportion of concealed PII. This research, through its deep dive into the Janus attack vector, underscores the imperative of navigating the intricate interplay between LLM utility and privacy preservation.

摘要
post-2018年代marked the advent of Large Language Models (LLMs), with innovations such as OpenAI's ChatGPT showcasing prodigious linguistic prowess. As the industry galloped toward augmenting model parameters and capitalizing on vast swaths of human language data, security and privacy challenges also emerged. Foremost among these is the potential inadvertent accrual of Personal Identifiable Information (PII) during web-based data acquisition, posing risks of unintended PII disclosure. While strategies like RLHF during training and Catastrophic Forgetting have been marshaled to control the risk of privacy infringements, recent advancements in LLMs, epitomized by OpenAI's fine-tuning interface for GPT-3.5, have reignited concerns. One may ask: can the fine-tuning of LLMs precipitate the leakage of personal information embedded within training datasets? This paper reports the first endeavor to seek the answer to the question, particularly our discovery of a new LLM exploitation avenue, called the Janus attack. In the attack, one can construct a PII association task, whereby an LLM is fine-tuned using a minuscule PII dataset, to potentially reinstate and reveal concealed PIIs. Our findings indicate that, with a trivial fine-tuning outlay, LLMs such as GPT-3.5 can transition from being impermeable to PII extraction to a state where they divulge a substantial proportion of concealed PII. This research, through its deep dive into the Janus attack vector, underscores the imperative of navigating the intricate interplay between LLM utility and privacy preservation.

Interpreting Answers to Yes-No Questions in User-Generated Content

paper_url: http://arxiv.org/abs/2310.15464
repo_url: None
paper_authors: Shivam Mathur, Keun Hee Park, Dhivya Chinnappa, Saketh Kotamraju, Eduardo Blanco
for: 这篇论文是关于如何解释社交媒体上的问答问题的。
methods: 论文使用了4442个Twitter上的问答对，并分析了答案中的语言特征，以及无法解释的答案。
results: 研究发现，大型语言模型尚未解决了这个问题，即使精度和混合其他 corpora。

Abstract
Interpreting answers to yes-no questions in social media is difficult. Yes and no keywords are uncommon, and the few answers that include them are rarely to be interpreted what the keywords suggest. In this paper, we present a new corpus of 4,442 yes-no question-answer pairs from Twitter. We discuss linguistic characteristics of answers whose interpretation is yes or no, as well as answers whose interpretation is unknown. We show that large language models are far from solving this problem, even after fine-tuning and blending other corpora for the same problem but outside social media.

摘要
社交媒体中答复问答的解释困难。yes和no关键词罕见，答案中很少包含这些词语，而且这些词语的解释并不总是简单。在这篇论文中，我们提供了4,442个yes-no问答对的新词汇库，并讨论了答案的语言特征，以及未知的解释。我们显示了大型自然语言模型，即使经过练习和混合其他词汇库，仍未能解决这个问题。

Facilitating Self-Guided Mental Health Interventions Through Human-Language Model Interaction: A Case Study of Cognitive Restructuring

paper_url: http://arxiv.org/abs/2310.15461
repo_url: None
paper_authors: Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, Theresa Nguyen, Tim Althoff
for: 本研究旨在探讨人工智能语言模型如何支持自主心理健康 intervención.
methods: 我们采用了一种基于证据的心理治疗技巧——思想重构，作为案例研究。我们在一个大型心理健康网站上进行了一项IRB批准的随机场景研究，并设计了一个使用语言模型支持人们完成各种思想重构步骤的系统。
results: 我们发现，我们的系统对67%的参与者产生了正面的情感强度影响，并帮助65%的参与者超越负面思想。虽然青少年报告的result relatively worse,但我们发现可以通过简化语言模型生成来提高总效果和公平性。

Abstract
Self-guided mental health interventions, such as "do-it-yourself" tools to learn and practice coping strategies, show great promise to improve access to mental health care. However, these interventions are often cognitively demanding and emotionally triggering, creating accessibility barriers that limit their wide-scale implementation and adoption. In this paper, we study how human-language model interaction can support self-guided mental health interventions. We take cognitive restructuring, an evidence-based therapeutic technique to overcome negative thinking, as a case study. In an IRB-approved randomized field study on a large mental health website with 15,531 participants, we design and evaluate a system that uses language models to support people through various steps of cognitive restructuring. Our findings reveal that our system positively impacts emotional intensity for 67% of participants and helps 65% overcome negative thoughts. Although adolescents report relatively worse outcomes, we find that tailored interventions that simplify language model generations improve overall effectiveness and equity.

摘要
自顾式精神健康互助 intervención,如"DIY"工具来学习和实践抗压力策略,显示了很大的托管难以实施和普及潜在。在这篇论文中，我们研究了人类语言模型交互如何支持自顾式精神健康互助。我们使用了证据基础的认知修剪技巧，即超越负面思维，作为一个案例研究。在一个获得IRB批准的随机场景学习中，我们设计并评估了一个使用语言模型支持人们进行多个步骤的认知修剪步骤。我们的发现表明，我们的系统对67%的参与者有积极的情感影响，并帮助65%的人超越负面思维。虽然青少年报告的结果相对较差，但我们发现了针对语言模型生成简化的个性化 intervención，可以提高总效果和公平性。

K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific Ratings

paper_url: http://arxiv.org/abs/2310.15439
repo_url: https://github.com/ssu-humane/k-haters
paper_authors: Chaewon Park, Soohwan Kim, Kyubyong Park, Kunwoo Park
for: 这个研究的目的是开发一个针对韩语 hate speech 检测的新词库，以提高现有的 hate speech 检测模型的准确性和可靠性。
methods: 这个研究使用了192000篇韩语新闻评论，并对其进行了target-specific offensiveness rating。此外，研究还采用了认知反射测试来评估annotations的质量。
results: 研究发现，使用lowest test scores的annotations可能会导致模型对特定目标群体进行偏袋的预测，并且准确性较低。这个研究对韩语 hate speech 检测领域的NLG研究做出了贡献，并提供了一个大规模的韩语 hate speech 词库。

Abstract
Numerous datasets have been proposed to combat the spread of online hate. Despite these efforts, a majority of these resources are English-centric, primarily focusing on overt forms of hate. This research gap calls for developing high-quality corpora in diverse languages that also encapsulate more subtle hate expressions. This study introduces K-HATERS, a new corpus for hate speech detection in Korean, comprising approximately 192K news comments with target-specific offensiveness ratings. This resource is the largest offensive language corpus in Korean and is the first to offer target-specific ratings on a three-point Likert scale, enabling the detection of hate expressions in Korean across varying degrees of offensiveness. We conduct experiments showing the effectiveness of the proposed corpus, including a comparison with existing datasets. Additionally, to address potential noise and bias in human annotations, we explore a novel idea of adopting the Cognitive Reflection Test, which is widely used in social science for assessing an individual's cognitive ability, as a proxy of labeling quality. Findings indicate that annotations from individuals with the lowest test scores tend to yield detection models that make biased predictions toward specific target groups and are less accurate. This study contributes to the NLP research on hate speech detection and resource construction. The code and dataset can be accessed at https://github.com/ssu-humane/K-HATERS.

摘要
众多数据集已经提出以抗击在线仇恨的扩展。然而，大多数这些资源都是英语中心，主要关注于显着的仇恨表达。这个研究 gap 需要开发多语言高质量 corpora，同时包括更加柔和的仇恨表达。本研究介绍 K-HATERS，一个新的仇恨语言检测 corpora 在韩语中，包含约192万条新闻评论，每个评论具有目标特定的不宽容度评分。这是韩语中最大的不宽容度 corpora，也是第一个提供目标特定的三点 Likert 等级评分，以便在不同程度的不宽容度下检测韩语中的仇恨表达。我们进行了实验，证明了提posed corpus 的有效性，包括与现有数据集进行比较。此外，为了减少人工标注中的噪音和偏见，我们explore了一种新的想法，即采用社会科学中广泛使用的认知反射测试作为标注质量的代理。结果表明，由lowest test scores的人进行标注的模型倾向于特定目标群体的报告，并且准确性较低。这种研究对 NLP 领域的仇恨语言检测和资源建构做出了贡献。可以通过 GitHub 上的https://github.com/ssu-humane/K-HATERS 获取代码和数据集。

Leveraging Large Language Models for Enhanced Product Descriptions in eCommerce

paper_url: http://arxiv.org/abs/2310.18357
repo_url: None
paper_authors: Jianghong Zhou, Bo Liu, Jhalak Nilesh Acharya Yao Hong, Kuang-chih Lee, Musen Wen
for: 提高电商搜索可见性和用户参与度，提高销售额和用户满意度
methods: 使用LLAMA 2.0 7B语言模型自动生成产品描述，并对模型进行域名特定语言特征和电商细节的定制，以提高其在销售和用户参与方面的实用性
results: 系统可以减少人工劳动量，同时提高搜索可见性和用户 clicks， validate the effectiveness of our approach using multiple evaluation metrics such as NDCG, customer click-through rates, and human assessments.

Abstract
In the dynamic field of eCommerce, the quality and comprehensiveness of product descriptions are pivotal for enhancing search visibility and customer engagement. Effective product descriptions can address the 'cold start' problem, align with market trends, and ultimately lead to increased click-through rates. Traditional methods for crafting these descriptions often involve significant human effort and may lack both consistency and scalability. This paper introduces a novel methodology for automating product description generation using the LLAMA 2.0 7B language model. We train the model on a dataset of authentic product descriptions from Walmart, one of the largest eCommerce platforms. The model is then fine-tuned for domain-specific language features and eCommerce nuances to enhance its utility in sales and user engagement. We employ multiple evaluation metrics, including NDCG, customer click-through rates, and human assessments, to validate the effectiveness of our approach. Our findings reveal that the system is not only scalable but also significantly reduces the human workload involved in creating product descriptions. This study underscores the considerable potential of large language models like LLAMA 2.0 7B in automating and optimizing various facets of eCommerce platforms, offering significant business impact, including improved search functionality and increased sales.

摘要
在电商领域中，产品描述的质量和完整性对搜索可见性和客户参与度有着重要的影响。有效的产品描述可以解决冷启动问题，遵循市场趋势，最终导致更高的点击率。传统的产品描述创作方法通常需要大量的人工劳动，并且可能缺乏一致性和可扩展性。这篇论文提出了一种使用 LLMA 2.0 7B 语言模型自动生成产品描述的新方法。我们在 Walmart 的实际产品描述数据集上训练了模型，然后对域pecific语言特征和电商特点进行了微调，以提高其在销售和用户参与度方面的实用性。我们使用了多种评价指标，包括 NDCG、客户点击率和人类评价，来验证我们的方法的有效性。我们的发现表明，系统不仅可扩展，还可以减少人工劳动时间。这篇研究表明了大语言模型 LIKELLMA 2.0 7B 在电商平台上自动化和优化各种方面的潜在业务影响，包括改善搜索功能和提高销售。

paper_url: http://arxiv.org/abs/2310.15431
repo_url: None
paper_authors: Kavel Rao, Liwei Jiang, Valentina Pyatkin, Yuling Gu, Niket Tandon, Nouha Dziri, Faeze Brahman, Yejin Choi
for: 这个论文的目的是提出一种叫做抵让的道德理由，用于在真实的生活场景中更好地表达人类的道德判断。
methods: 这篇论文使用了一种迭代自适应学习的方法，从GPT-3的少量无结构的种子知识开始，然后通过自己的模型进行自适应学习，并用人类判断和NLI进行筛选和自我学习，以获得更高质量的任务数据。
results: 这篇论文通过这种方法获得了一个高质量的数据集，名为δ-规则箴言（delta-Rules-of-Thumb），包含1.2万个 Entry of contextualizations和理由，用于115 thousand个可 defeasible moral actions。这个数据集的有效性和多样性得到了人类 annotators 的评价，85.9%-99.8% 的时间内。使用这个数据集， authors 还获得了一个高质量的学生模型，比所有 intermediate 学生模型都高得多。

Abstract
Moral or ethical judgments rely heavily on the specific contexts in which they occur. Understanding varying shades of defeasible contextualizations (i.e., additional information that strengthens or attenuates the moral acceptability of an action) is critical to accurately represent the subtlety and intricacy of grounded human moral judgment in real-life scenarios. We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable, along with commonsense rationales that justify the reasoning. To elicit high-quality task data, we take an iterative self-distillation approach that starts from a small amount of unstructured seed knowledge from GPT-3 and then alternates between (1) self-distillation from student models; (2) targeted filtering with a critic model trained by human judgment (to boost validity) and NLI (to boost diversity); (3) self-imitation learning (to amplify the desired data quality). This process yields a student model that produces defeasible contexts with improved validity, diversity, and defeasibility. From this model we distill a high-quality dataset, \delta-Rules-of-Thumb, of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions rated highly by human annotators 85.9% to 99.8% of the time. Using \delta-RoT we obtain a final student model that wins over all intermediate student models by a notable margin.

摘要
道德或伦理判断强烈取决于特定情境下发生。理解不同程度的抵觐性上下文化（即更强化或减弱行动的道德可接受度）是 kritical то accurately represent 人类的场景下的细腻和复杂的道德判断。我们引入抵觐的道德理解任务：提供场景和理由，使得行动更加道德可接受或更加不道德。为了获得高质量任务数据，我们采用迭代自适应法，从小量无结构的 GPT-3 种子知识开始，然后每次 alternate между (1) 自适应法 ; (2) 人类判断训练的批评模型和 NLI （以提高有效性和多样性）; (3) 自我学习（以增强愿望的数据质量）。这个过程产生了一个学生模型，该模型生成的抵觐上下文和理由具有提高的有效性、多样性和抵觐性。从这个模型中，我们提取了一个高质量数据集， delta-Rules-of-Thumb，包含 1.2 万个上下文和理由，用于 115 万个可抵觐的道德行动，被人类标注者评分为 85.9% 到 99.8% 的时间。使用 delta-RoT，我们获得了一个最终的学生模型，该模型在所有 intermediate 学生模型之上胜出了显著的差距。

Beyond Sentiment: Leveraging Topic Metrics for Political Stance Classification

paper_url: http://arxiv.org/abs/2310.15429
repo_url: None
paper_authors: Weihong Qi
for: 本研究旨在替代和补充 sentiment 分析，准确地反映文本中的政治立场和结构。
methods: 本研究使用 Bestvater 和 Monroe (2023) 提供的三个数据集，利用 BERTopic 方法抽取凝结话题，并使用这些话题作为偏好变量进行立场分类。
results: 实验结果显示，BERTopic 可以提高凝结分数（coherence scores）by 17.07% to 54.20% Comparing to traditional方法如 Dirichlet Allocation (LDA) 和 Non-negative Matrix Factorization (NMF)，并且 topic metrics 在立场分类中表现更高，可以提高性能 by as much as 18.95%。

Abstract
Sentiment analysis, widely critiqued for capturing merely the overall tone of a corpus, falls short in accurately reflecting the latent structures and political stances within texts. This study introduces topic metrics, dummy variables converted from extracted topics, as both an alternative and complement to sentiment metrics in stance classification. By employing three datasets identified by Bestvater and Monroe (2023), this study demonstrates BERTopic's proficiency in extracting coherent topics and the effectiveness of topic metrics in stance classification. The experiment results show that BERTopic improves coherence scores by 17.07% to 54.20% when compared to traditional approaches such as Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), prevalent in earlier political science research. Additionally, our results indicate topic metrics outperform sentiment metrics in stance classification, increasing performance by as much as 18.95%. Our findings suggest topic metrics are especially effective for context-rich texts and corpus where stance and sentiment correlations are weak. The combination of sentiment and topic metrics achieve an optimal performance in most of the scenarios and can further address the limitations of relying solely on sentiment as well as the low coherence score of topic metrics.

摘要
“对文本的情感分析，受到了许多批评，因为它只能捕捉文本的总趋势，而不能准确地反映文本中的隐藏结构和政治立场。这项研究提出了话题指标，即从提取出来的话题中生成的假变量，作为对 sentiment 指标的代替和补充。通过使用最近的 Bestvater 和 Monroe (2023) 所提出的三个数据集，本研究示出 BERTopic 在提取有 coherence 的话题方面的能力，以及话题指标在立场分类中的效果。实验结果表明，BERTopic 在比较于传统方法（如 Dirichlet Allocation 和 Non-negative Matrix Factorization）时，提高 coherence 分数 by 17.07% 到 54.20%。此外，我们的结果还表明，话题指标在立场分类中表现更好，提高性能 by 18.95%。我们的发现表明，话题指标在Context-rich 文本和 corpus 中 especial 有用，并且 combining sentiment 和话题指标可以达到最佳性能，并且可以解决受 sentiment 指标和话题指标低 coherence 分数的限制。”

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

paper_url: http://arxiv.org/abs/2310.15425
repo_url: https://github.com/masonphonlab/maps_paper_code
paper_authors: Matthew C. Kelley, Scott James Perry, Benjamin V. Tucker
for: 这个论文主要是为了提出一种新的神经网络基于的强制对齐系统（MAPS），以测试两种可能的改进方法。
methods: 这个系统使用神经网络作为强制对齐的模型，并采用了两种改进方法：一是将声学模型视为标记任务而不是分类任务，二是使用插值技术来允许边界更精确。
results: 与比较体系 Montreal Forced Aligner 相比，使用插值技术的系统在测试集上增加了27.92%的边界 Within 10 ms。然而，使用标记任务的方法并不总是有所改进。这个研究还探讨了声学模型的训练任务和输出目标的问题，并提出了重新思考如何分 segment speech 本身的可能性。

Abstract
Forced alignment systems automatically determine boundaries between segments in speech data, given an orthographic transcription. These tools are commonplace in phonetics to facilitate the use of speech data that would be infeasible to manually transcribe and segment. In the present paper, we describe a new neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model in a forced aligner as a tagging task, rather than a classification task, motivated by the common understanding that segments in speech are not truly discrete and commonly overlap. The second is an interpolation technique to allow boundaries more precise than the common 10 ms limit in modern forced alignment systems. We compare configurations of our system to a state-of-the-art system, the Montreal Forced Aligner. The tagging approach did not generally yield improved results over the Montreal Forced Aligner. However, a system with the interpolation technique had a 27.92% increase relative to the Montreal Forced Aligner in the amount of boundaries within 10 ms of the target on the test set. We also reflect on the task and training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciliation of this tension may require rethinking the task and output targets or how speech itself should be segmented.

摘要
受限Alignment系统会自动确定在语音数据中的分割边界， givent an orthographic transcription。这些工具在 fonetics 中很普遍，以便使用不可靠的手动转录和分割语音数据。在这篇文章中，我们描述了一个新的 нейрон网络基于的受限Alignment系统，即 Mason-Alberta 音频分 segmenter (MAPS)。MAPS 分配器 serves as a testbed for two possible improvements we pursue for forced alignment systems。第一个是在受限Alignment系统中对音频模型进行标记任务，而不是分类任务，这是因为在语音中的分割不是真正独立的，通常 overlap。第二个是一种 interpolate 技术，以Allow boundaries more precise than the common 10 ms limit in modern forced alignment systems。我们与 state-of-the-art system， Montreal Forced Aligner 进行比较。 tagging 方法不一般提高resultsover Montreal Forced Aligner。然而，一种含 interpolate 技术的系统在 test set 上有 27.92% 的提高，相对于 Montreal Forced Aligner。我们还反思了 acoustic modeling 在受限Alignment中的任务和训练过程，并 highlighted how the output targets for these models do not match phoneticians' conception of similarity between phones，这可能需要重新考虑 speech 自身的 segmentation。

Let the Pretrained Language Models “Imagine” for Short Texts Topic Modeling

paper_url: http://arxiv.org/abs/2310.15420
repo_url: None
paper_authors: Pritom Saha Akash, Jie Huang, Kevin Chen-Chuan Chang
for: 寻找短文档中隐藏的 semantics，Addressing the data-sparsity issue in short-text topic modeling.
methods: 使用 pre-trained language models (PLMs) 将短文档扩展到更长的序列，并提供一种简单的解决方案来减少 PLMs 生成的噪音文本影响。
results: 在多个实际场景下，模型可以大幅提高短文档主题模型的性能，并超越现有的模型。

Abstract
Topic models are one of the compelling methods for discovering latent semantics in a document collection. However, it assumes that a document has sufficient co-occurrence information to be effective. However, in short texts, co-occurrence information is minimal, which results in feature sparsity in document representation. Therefore, existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics. In this paper, we take a new approach to short-text topic modeling to address the data-sparsity issue by extending short text into longer sequences using existing pre-trained language models (PLMs). Besides, we provide a simple solution extending a neural topic model to reduce the effect of noisy out-of-topics text generation from PLMs. We observe that our model can substantially improve the performance of short-text topic modeling. Extensive experiments on multiple real-world datasets under extreme data sparsity scenarios show that our models can generate high-quality topics outperforming state-of-the-art models.

摘要
In this paper, we take a new approach to short-text topic modeling to address the data-sparsity issue by extending short texts into longer sequences using existing pre-trained language models (PLMs). Additionally, we provide a simple solution for reducing the effect of noisy out-of-topics text generation from PLMs by extending a neural topic model. Our experiments on multiple real-world datasets under extreme data sparsity scenarios show that our models can generate high-quality topics that outperform state-of-the-art models.

2023-10-24

GlotLID: Language Identification for Low-Resource Languages

ZzzGPT: An Interactive GPT Approach to Enhance Sleep Quality

Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

TiC-CLIP: Continual Training of CLIP Models

Background Summarization of Event Timelines

BLP 2023 Task 2: Sentiment Analysis

Hidden Citations Obscure True Impact in Science

WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

Can You Follow Me? Testing Situational Understanding in ChatGPT

GenKIE: Robust Generative Multimodal Document Key Information Extraction

Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Locally Differentially Private Document Generation Using Zero Shot Prompting

CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn from Financial Reports

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

In-Context Learning Creates Task Vectors

Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT

A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning

SoK: Memorization in General-Purpose Large Language Models

Self-Guard: Empower the LLM to Safeguard Itself

Unnatural language processing: How do language models handle machine-generated prompts?

Generative Language Models Exhibit Social Identity Biases

BLESS: Benchmarking Large Language Models on Sentence Simplification

Learning From Free-Text Human Feedback – Collect New Datasets Or Extend Existing Ones?

Do Differences in Values Influence Disagreements in Online Discussions?

Failures Pave the Way: Enhancing Large Language Models through Tuning-free Rule Accumulation

RAPL: A Relation-Aware Prototype Learning Approach for Few-Shot Document-Level Relation Extraction

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

Re-Temp: Relation-Aware Temporal Representation Learning for Temporal Knowledge Graph Completion

Ensemble of Task-Specific Language Models for Brain Encoding

Enhancing Biomedical Lay Summarisation with External Knowledge Graphs

COPF: Continual Learning Human Preference through Optimal Policy Fitting

Creating a silver standard for patent simplification

Prevalence and prevention of large language model use in crowd work

How Much Context Does My Attention-Based ASR System Need?

Expression Syntax Information Bottleneck for Math Word Problems

CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation

Tips for making the most of 64-bit architectures in langage design, libraries or garbage collection

Machine Translation for Nko: Tools, Corpora and Baseline Results

MUSER: A Multi-View Similar Case Retrieval Dataset

ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts

Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

POE: Process of Elimination for Multiple Choice Reasoning

Natural Language Processing for Drug Discovery Knowledge Graphs: promises and pitfalls

Visually Grounded Continual Language Learning with Selective Specialization

MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Improving Language Models Meaning Understanding and Consistency by Learning Conceptual Roles from Dictionary

MarkQA: A large scale KBQA dataset with numerical reasoning

Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting Elusive Disinformation

A Joint Matrix Factorization Analysis of Multilingual Representations

TRAMS: Training-free Memory Selection for Long-range Language Modeling

CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without Full Large Language Model

Continual Event Extraction with Semantic Confusion Rectification

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks

Interpreting Answers to Yes-No Questions in User-Generated Content

Facilitating Self-Guided Mental Health Interventions Through Human-Language Model Interaction: A Case Study of Cognitive Restructuring

K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific Ratings

Leveraging Large Language Models for Enhanced Product Descriptions in eCommerce

What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations

Beyond Sentiment: Leveraging Topic Metrics for Political Stance Classification

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

Let the Pretrained Language Models “Imagine” for Short Texts Topic Modeling