2023-10-22

cs.CL

cs.CL - 2023-10-22

Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

paper_url: http://arxiv.org/abs/2310.14451
repo_url: None
paper_authors: Yasmin Moslem, Gianfranco Romani, Mahdi Molaei, Rejwanul Haque, John D. Kelleher, Andy Way
for: 提高机器翻译（MT）的精度，以便在专业领域内进行更好的交流和理解。
methods: 利用大语言模型（LLM）进行两项实验，包括生成同时语言对的数据和将MT模型中的翻译结果进行自动批注。
results: 结果表明，我们的提议方法能够有效地将预先批准的词汇 integrate 到翻译中，成功率从36.67%提高至72.88%。

Abstract
This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs.

摘要

Using an LLM to generate bilingual synthetic data based on the provided terminology.2. Fine-tuning a generic encoder-decoder MT model with a mix of the terminology-based synthetic data and a randomly sampled portion of the original generic training data.3. Generating translations with the fine-tuned MT model.4. Leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms.The results show that our proposed approach effectively improves the integration of pre-approved terms into translations. The average number of terms incorporated into the translations of the blind dataset increases from 36.67% with the generic model to 72.88% by the end of the process, nearly doubling the successful utilization of terms across the three language pairs.

TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

paper_url: http://arxiv.org/abs/2310.14450
repo_url: https://github.com/hanshanley/tata
paper_authors: Hans W. A. Hanley, Zakir Durumeric
for: 本文是为了建立一个通用的立场检测模型，能够在不同主题下仍能准确地检测立场。
methods: 本文使用了对照学习和一个未标注的新闻文章数据集，通过培育TAG和TAW表示来建立不同主题下的立场检测模型。
results: combine这两种表示， authors achieved state-of-the-art performance on several public stance detection datasets（Zero-shot VAST dataset的 $F_1$-score为0.771）。

Abstract
Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.

摘要
<>translate_language Simplified Chinese;Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.中文简体版：在互联网上，理解不同的态度和信仰是重要的。然而，由于文章对某个话题的态度往往受话题的限制，建立能 generalized to unseen topics的态度探测模型是困难的。在这项工作中，我们提议使用对比学习以及一个不同话题的新闻文章数据集来训练无关话题/TAG和相关话题/TAW的嵌入，用于下游态度探测。将这些嵌入组合在我们的全局TATA模型中，我们在多个公共态度探测数据集上达到了状态级表现（在零shot VAST数据集上的$F_1$分数为0.771）。我们在 GitHub 上发布了代码和数据，请参考。

Text generation for dataset augmentation in security classification tasks

paper_url: http://arxiv.org/abs/2310.14429
repo_url: https://github.com/wenliangdai/multi-task-offensive-language-detection
paper_authors: Alexander P. Welsh, Matthew Edwards
for: 填充安全领域的训练数据不足问题
methods: 使用自然语言文本生成器填充训练数据，测试多个安全相关文本分类任务
results: GPT-3数据增强策略可以对训练无足问题进行改善，尤其是在知道阳性类别数据有严重限制的情况下

Abstract
Security classifiers, designed to detect malicious content in computer systems and communications, can underperform when provided with insufficient training data. In the security domain, it is often easy to find samples of the negative (benign) class, and challenging to find enough samples of the positive (malicious) class to train an effective classifier. This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks. We describe a variety of previously-unexamined language-model fine-tuning approaches for this purpose and consider in particular the impact of disproportionate class-imbalances in the training set. Across our evaluation using three state-of-the-art classifiers designed for offensive language detection, review fraud detection, and SMS spam detection, we find that models trained with GPT-3 data augmentation strategies outperform both models trained without augmentation and models trained using basic data augmentation strategies already in common usage. In particular, we find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.

摘要
安全分类器，用于检测计算机系统和通信中的恶意内容，可能会表现不佳当提供不充分的训练数据。在安全领域，通常容易找到benign类样本，而困难找到足够的malicious类样本来训练有效的分类器。本研究利用自然语言文本生成器来填充这种数据空白，并考虑了训练集中略重的类别不均衡的影响。我们使用三种当前最佳的分类器，用于探测攻击性语言、评论骗局和短信骗局，并评估了不同的语言模型练习方法。我们发现，使用GPT-3数据生成策略可以超越没有增强和基本增强策略的模型，尤其在限制了知道的正面类样本的情况下。

Large Language Models are biased to overestimate profoundness

paper_url: http://arxiv.org/abs/2310.14422
repo_url: None
paper_authors: Eugenio Herrera-Berg, Tomás Vergara Browne, Pablo León-Villagrá, Marc-Lluís Vives, Cristian Buc Calderon
for: 本研究评估了多种语言模型（LLMs）对日常、动员、 Pseudo-profound声明的评估能力，以及RLHF对模型带来的偏见。
methods: 研究使用了多种提示技术，包括ew-shot学习提示和链式思维提示，以评估模型对不同类型声明的评估能力。
results: 研究发现，LLMs和人类之间存在显著的声明相似性，不管使用哪种提示技术。但是，LLMs系统性地过分评估非сен的声明，除了Tk-instruct，它独特地下esti mates声明的深度。ew-shot学习提示能够使得模型的评估与人类更加相似。此外，研究还发现RLHF可能导致模型带来偏见，增加声明深度的评估。

Abstract
Recent advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. And yet, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profoundness of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.

摘要
latest advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. However, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profundity of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

REFER: An End-to-end Rationale Extraction Framework for Explanation Regularization

paper_url: http://arxiv.org/abs/2310.14418
repo_url: None
paper_authors: Mohammad Reza Ghasemi Madani, Pasquale Minervini
for: 本文旨在提高Explainable Natural Language Processing中的人工标注文本解释的重要性。
methods: 本文提出了一种名为REFER的框架，该框架使用可微的解释EXTractor，可以在推理过程中借鉴人工标注的帮助。
results: 在我们的实验中，REFER在具有 faithfulness、plausibility和下游任务准确率的情况下，与之前的基线比较，在e-SNLI和CoS-E上得到了较好的结果，其中的composite normalized relative gain比例提高了11%和3%。

Abstract
Human-annotated textual explanations are becoming increasingly important in Explainable Natural Language Processing. Rationale extraction aims to provide faithful (i.e., reflective of the behavior of the model) and plausible (i.e., convincing to humans) explanations by highlighting the inputs that had the largest impact on the prediction without compromising the performance of the task model. In recent works, the focus of training rationale extractors was primarily on optimizing for plausibility using human highlights, while the task model was trained on jointly optimizing for task predictive accuracy and faithfulness. We propose REFER, a framework that employs a differentiable rationale extractor that allows to back-propagate through the rationale extraction process. We analyze the impact of using human highlights during training by jointly training the task model and the rationale extractor. In our experiments, REFER yields significantly better results in terms of faithfulness, plausibility, and downstream task accuracy on both in-distribution and out-of-distribution data. On both e-SNLI and CoS-E, our best setting produces better results in terms of composite normalized relative gain than the previous baselines by 11% and 3%, respectively.

摘要
人类标注文本解释在可解释自然语言处理中变得越来越重要。理由提取目标为提供准确（即模型行为reflective）并有理由的解释，而不是妥协任务模型性能。在现有的工作中，训练理由提取器的主要目标是优化假设性，使用人类高亮来评估plausibility。我们提出了REFER框架，它使用可微分的理由提取器，允许在理由提取过程中进行反propagation。我们分析了在训练中使用人类高亮的影响，并在任务模型和理由提取器同时训练。在我们的实验中，REFER实现了在 faithfulness、plausibility 和下游任务准确率方面提高了较大的改进，并在 e-SNLI 和 CoS-E 上实现了更好的 composite normalized relative gain 表现，相比前一个基eline提高11%和3%。

Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models

paper_url: http://arxiv.org/abs/2310.14389
repo_url: https://github.com/honglizhan/covidet-appraisals-public
paper_authors: Hongli Zhan, Desmond C. Ong, Junyi Jessy Li
for: This paper is written to address the lack of research on the automatic prediction of cognitive appraisals in emotional experiences.
methods: The paper uses a dataset called CovidET-Appraisals, which assesses 24 appraisal dimensions in 241 Reddit posts, to evaluate the ability of large language models to automatically assess and explain cognitive appraisals.
results: The paper finds that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models.Here is the information in Simplified Chinese text:
for: 这篇论文是为了弥补情感经验中自主评估的缺失而写的。
methods: 这篇论文使用了 CovidET-Appraisals 数据集，该数据集包含 24 个评估维度，每个维度都有自然语言的理由，在 241 篇 Reddit 帖子中进行了评估。
results: 这篇论文发现，虽然最佳模型表现出色，但开源 LL 模型在这个任务上异常缺乏能力，这提出了未来情感智能模型的新挑战。

Abstract
The emotions we experience involve complex processes; besides physiological aspects, research in psychology has studied cognitive appraisals where people assess their situations subjectively, according to their own values (Scherer, 2005). Thus, the same situation can often result in different emotional experiences. While the detection of emotion is a well-established task, there is very limited work so far on the automatic prediction of cognitive appraisals. This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions, each with a natural language rationale, across 241 Reddit posts. CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models -- excelling at a wide range of NLP tasks -- to automatically assess and explain cognitive appraisals. We found that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models. We release our dataset at https://github.com/honglizhan/CovidET-Appraisals-Public.

摘要
我们的情感经历 involve 复杂的过程; 不仅有生物学方面，心理学研究也探究了人们对自己情感状况的主观评估，根据自己的价值观（Scherer, 2005）。因此，同一个情况可能会导致不同的情感体验。虽然检测情感是一项已经成熔的任务，但是自动预测认知评估还没有得到过足的关注。这项工作填补了这一空白，并提供了 CovidET-Appraisals dataset，覆盖了 24 个评估维度，每个维度有自然语言的理由，在 241 篇 Reddit 帖子中进行评估。CovidET-Appraisals 提供了一个完善的测试环境，用于评估大语言模型在 NLP 任务中表现的能力，并且自动评估和解释认知评估。我们发现，尽管最佳模型表现出色，但开源 LLM 在这项任务上却落后，提出了未来开发情感智能模型的新挑战。我们将数据集发布在 GitHub 上，具体地址为。

Bi-Encoders based Species Normalization – Pairwise Sentence Learning to Rank

paper_url: http://arxiv.org/abs/2310.14366
repo_url: None
paper_authors: Zainab Awan, Tim Kahlke, Peter Ralph, Paul Kennedy
for: 该论文旨在提出一种深度学习方法，用于生物医学名实体Normalization。
methods: 该方法基于Best Matching 25算法生成候选概念，然后使用bi-directional encoder representation from the encoder (BERT)进行排名。
results: 对于物种实体类型，我们的方法比现有方法更高效，能够准确地将实体连接到NCBI分类。

Abstract
Motivation: Biomedical named-entity normalization involves connecting biomedical entities with distinct database identifiers in order to facilitate data integration across various fields of biology. Existing systems for biomedical named entity normalization heavily rely on dictionaries, manually created rules, and high-quality representative features such as lexical or morphological characteristics. However, recent research has investigated the use of neural network-based models to reduce dependence on dictionaries, manually crafted rules, and features. Despite these advancements, the performance of these models is still limited due to the lack of sufficiently large training datasets. These models have a tendency to overfit small training corpora and exhibit poor generalization when faced with previously unseen entities, necessitating the redesign of rules and features. Contribution: We present a novel deep learning approach for named entity normalization, treating it as a pair-wise learning to rank problem. Our method utilizes the widely-used information retrieval algorithm Best Matching 25 to generate candidate concepts, followed by the application of bi-directional encoder representation from the encoder (BERT) to re-rank the candidate list. Notably, our approach eliminates the need for feature-engineering or rule creation. We conduct experiments on species entity types and evaluate our method against state-of-the-art techniques using LINNAEUS and S800 biomedical corpora. Our proposed approach surpasses existing methods in linking entities to the NCBI taxonomy. To the best of our knowledge, there is no existing neural network-based approach for species normalization in the literature.

摘要
目的：生物医学命名实体normalization通过连接生物医学实体与特定数据库标识符来实现数据集成。现有的生物医学命名实体normalization系统大量依赖于词典、手动创建的规则和高质量表达特征。然而，最近的研究已经调查了使用神经网络模型来减少词典、手动创建的规则和特征的依赖。尽管有这些进步，但现有的模型在性能上仍有限制，它们往往因为训练数据集的小型而过拟合，并且对于之前未看到的实体表现出差异欠拟合，导致规则和特征的重新设计。贡献：我们提出了一种新的深度学习方法 для命名实体normalization，将其视为一个对照学习排名问题。我们的方法首先使用广泛使用的信息检索算法Best Matching 25生成候选概念，然后通过双向encoder表示（BERT）对候选列表进行排名。吸引注意的是，我们的方法不需要特征工程或规则创建。我们在种类实体类型上进行实验，并对我们的方法与当前的状态艺术技术进行比较。我们的提议方法在连接实体到NCBI分类中超过现有方法。到目前为止，there is no existing neural network-based approach for species normalization in the literature.

Is ChatGPT a game changer for geocoding – a benchmark for geocoding address parsing techniques

paper_url: http://arxiv.org/abs/2310.14360
repo_url: None
paper_authors: Zhengcong Yin, Diya Li, Daniel W. Goldberg
for: 评估 GPT-3 模型在地理编码地址解析任务中的表现。
methods: 使用人工输入模式挖掘数据集，并对 GPT-3 模型、transformer 模型和 LSTM-CRF 模型进行训练和比较。
results: Bidirectional LSTM-CRF 模型在这些 transformer 模型和 GPT-3 模型中表现最佳，而 transformer 模型和 GPT-3 模型在表现上几乎相当。GPT-3 模型，虽然表现不佳，但在几个例子下表现出了潜在的改进空间。

Abstract
The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.

摘要
“GPT模型在不同任务中的成功，包括地名识别，使我们感兴趣测试GPT-3模型在地址解析任务中的性能。为了更准确地反映实际场景中的用户输入质量和提供一个'金标准'评价数据集，我们创建了一个基于人工输入模式的低质量地址描述 synthesized 数据集。这个数据集包含21种输入错误和变化，涵盖了美国全国50个州和特区的所有街道，共计239,000个唯一选择的地址记录。我们使用这个数据集进行训练和评估GPT-3模型、 transformer 基于模型和 LSTM 基于模型的性能。评估结果显示，携带irectional LSTM-CRF 模型在这些 transformer 基于模型和 GPT-3 模型中表现最佳。 transformer 基于模型和 GPT-3 模型在性能上几乎相当，但 GPT-3 模型在性能上落后，但它在几个例子中表现出了潜力，表明可以通过进一步的微调提高其性能。我们将这个数据集和代码开源，以便未来的研究人员可以利用它来开发模型或扩展其到评估类似任务，如文档地理编码。”

Cultural and Linguistic Diversity Improves Visual Representations

paper_url: http://arxiv.org/abs/2310.14356
repo_url: None
paper_authors: Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna
for: 这 paper 探讨了图像理解中不同文化背景下的视觉吗。
methods: 作者使用了多种方法，包括 scene graph, embedding, 和语言复杂度来评估不同语言的caption的semantic coverage。
results: 研究发现，当数据包含多种语言时，caption的semantic coverage会高于单语言数据，并且模型在不同语言的测试数据上表现最佳。

Abstract
Computer vision often treats perception as objective, and this assumption gets reflected in the way that datasets are collected and models are trained. For instance, image descriptions in different languages are typically assumed to be translations of the same semantic content. However, work in cross-cultural psychology and linguistics has shown that individuals differ in their visual perception depending on their cultural background and the language they speak. In this paper, we demonstrate significant differences in semantic content across languages in both dataset and model-produced captions. When data is multilingual as opposed to monolingual, captions have higher semantic coverage on average, as measured by scene graph, embedding, and linguistic complexity. For example, multilingual captions have on average 21.8% more objects, 24.5% more relations, and 27.1% more attributes than a set of monolingual captions. Moreover, models trained on content from different languages perform best against test data from those languages, while those trained on multilingual content perform consistently well across all evaluation data compositions. Our research provides implications for how diverse modes of perception can improve image understanding.

摘要

The Law and NLP: Bridging Disciplinary Disconnects

paper_url: http://arxiv.org/abs/2310.14346
repo_url: None
paper_authors: Robert Mahari, Dominik Stammbach, Elliott Ash, Alex ‘Sandy’ Pentland
for: 法律实践中的语言是其根源，但法律师和学者尚未广泛采用自然语言处理（NLP）工具。同时，法律系统正面临一个访问正义危机，NLP可能可以减轻这个危机。
methods: 本文论证法律NLP领域的研究缺乏与法律社区的连接，导致一些最受欢迎的法律NLP任务无法满足法律实践中的需求。
results: 我们在审查最近的法律NLP文献中发现，法律NLP社区与法律学术界之间存在较少的交叉。我们认为，一些最受欢迎的法律NLP任务无法满足法律实践中的需求。我们提出了一些可以bridgingdisciplinary disconnects的法律NLP任务，并高亮了未曾探索的法律NLP研究领域。

Abstract
Legal practice is intrinsically rooted in the fabric of language, yet legal practitioners and scholars have been slow to adopt tools from natural language processing (NLP). At the same time, the legal system is experiencing an access to justice crisis, which could be partially alleviated with NLP. In this position paper, we argue that the slow uptake of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in the legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.

摘要
法律实践深深涉及语言的结构，然而法律师和学者对自然语言处理（NLP）技术的采用相对落后。同时，法律系统正面临访问正义危机，NLP可能可以减轻这种危机。在这份位点纸中，我们 argue That the slow adoption of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.Here's the translation in Traditional Chinese:法律实践深深涉及语言的结构，然而法律师和学者对自然语言处理（NLP）技术的采用相对落后。同时，法律系统正面临访问正义危机，NLP可能可以减轻这种危机。在这份位点纸中，我们 argue That the slow adoption of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.

paper_url: http://arxiv.org/abs/2310.14340
repo_url: None
paper_authors: Revanth Gangi Reddy, Hao Bai, Wentao Yao, Sharath Chandra Etagi Suresh, Heng Ji, ChengXiang Zhai
for: 提高对话中的信息 Retrieval relevance和specificity，使对话更加有趣和有价值。
methods: 利用社交常识对话系统建立话题相关连接，并通过 instruciton-driven 查询生成法生成更加有 relevance 和specificity 的查询。
results: 比较 experiment 结果表明，提出的方法可以超越现有的查询生成技术，并生成更加有趣、有价值和有 relevance 的查询，从而提高对话中的信息 Retrieval 效果。

Abstract
Open-domain dialog involves generating search queries that help obtain relevant knowledge for holding informative conversations. However, it can be challenging to determine what information to retrieve when the user is passive and does not express a clear need or request. To tackle this issue, we present a novel approach that focuses on generating internet search queries that are guided by social commonsense. Specifically, we leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides our query generation. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation. Through extensive evaluations, we show that our approach overcomes limitations of existing query generation techniques that rely solely on explicit dialog information, and produces search queries that are more relevant, specific, and compelling, ultimately resulting in more engaging responses.

摘要
<> translate "Open-domain dialog involves generating search queries that help obtain relevant knowledge for holding informative conversations. However, it can be challenging to determine what information to retrieve when the user is passive and does not express a clear need or request. To tackle this issue, we present a novel approach that focuses on generating internet search queries that are guided by social commonsense. Specifically, we leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides our query generation. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation, and instruction-driven query generation. Through extensive evaluations, we show that our approach overcomes limitations of existing query generation techniques that rely solely on explicit dialog information, and produces search queries that are more relevant, specific, and compelling, ultimately resulting in more engaging responses." into 中文（简体）Here's the translation:开放领域对话通常包括生成可以帮助获得有用知识的搜索查询。然而，当用户被动并没有明确的需求或请求时，可能困难确定需要检索哪些信息。为解决这个问题，我们提出了一种新的方法，即通过社会常识导航查询生成。我们利用对话系统来建立与对话话题相关的连接，然后将这些连接用于生成查询。我们的提议的框架解决了悬挂式用户互动的问题，并 integrates 话题跟踪、常识响应生成和指导查询生成。经过广泛的评估，我们表明我们的方法可以超越现有的查询生成技术，生成更加有关、特定和吸引人的搜索查询，最终导致更加有趣的响应。

DiFair: A Benchmark for Disentangled Assessment of Gender Knowledge and Bias

paper_url: http://arxiv.org/abs/2310.14329
repo_url: https://github.com/mzakizadeh/difair_public
paper_authors: Mahdi Zakizadeh, Kaveh Eskandari Miandoab, Mohammad Taher Pilehvar
for: mitigating the gender bias in pretrained language models and evaluating the impact of bias mitigation on useful gender knowledge
methods: using a manually curated dataset called DiFair, introducing a unified metric called gender invariance score to quantify both biased behavior and preservation of useful gender knowledge
results: experimental results show that debiasing techniques can ameliorate the issue of gender bias, but at the cost of lowering the model’s useful gender knowledge

Abstract
Numerous debiasing techniques have been proposed to mitigate the gender bias that is prevalent in pretrained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. Importantly, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pretained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.

摘要
很多去偏见技术已经被提出来 Mitigate the gender bias that is prevalent in pre-trained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. However, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pre-trained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.Here's the translation in Traditional Chinese:很多去偏见技术已经被提出来 Mitigate the gender bias that is prevalent in pre-trained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. However, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pre-trained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.

Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis

paper_url: http://arxiv.org/abs/2310.14325
repo_url: None
paper_authors: Inez Okulska, Emilia Wiśnios
for: 本研究旨在开发一种hybrid neural和规则基于的上下文意识检测系统，用于检测色情内容中的有害上下文信息。
methods: 本研究采用了核心引用解决方法，与专业评审人员合作编制了一个数据集，并开发了一个可以分辨有害与无害色情内容的分类器。
results: 本研究在波兰文本上测试了hybrid模型，达到了84%的准确率和80%的回归率，而基于RoBERTa和Longformer的模型则无法显示出类似的表现，这说明了核心引用链的重要性在检测有害色情内容中。

Abstract
Adult content detection still poses a great challenge for automation. Existing classifiers primarily focus on distinguishing between erotic and non-erotic texts. However, they often need more nuance in assessing the potential harm. Unfortunately, the content of this nature falls beyond the reach of generative models due to its potentially harmful nature. Ethical restrictions prohibit large language models (LLMs) from analyzing and classifying harmful erotics, let alone generating them to create synthetic datasets for other neural models. In such instances where data is scarce and challenging, a thorough analysis of the structure of such texts rather than a large model may offer a viable solution. Especially given that harmful erotic narratives, despite appearing similar to harmless ones, usually reveal their harmful nature first through contextual information hidden in the non-sexual parts of the narrative. This paper introduces a hybrid neural and rule-based context-aware system that leverages coreference resolution to identify harmful contextual cues in erotic content. Collaborating with professional moderators, we compiled a dataset and developed a classifier capable of distinguishing harmful from non-harmful erotic content. Our hybrid model, tested on Polish text, demonstrates a promising accuracy of 84% and a recall of 80%. Models based on RoBERTa and Longformer without explicit usage of coreference chains achieved significantly weaker results, underscoring the importance of coreference resolution in detecting such nuanced content as harmful erotics. This approach also offers the potential for enhanced visual explainability, supporting moderators in evaluating predictions and taking necessary actions to address harmful content.

摘要
成人内容检测仍然是自动化领域的挑战。现有的分类器主要是将 эротиче和非эротиче文本分开。然而，它们经常缺乏对可能的害的评估。 Unfortunately, this type of content is beyond the reach of generative models due to its potentially harmful nature.伦理限制禁止大语言模型（LLMs）从 analyzing和分类害词的内容，尤其是生成这类内容以创建Synthetic datasets for other neural models.在这种数据稀缺和挑战的情况下，一种可靠的解决方案是通过对这些文本的结构进行仔细分析，而不是使用大型模型。这是因为害词内容，尽管看起来和无害内容相似，通常在非性部分中隐藏的上下文信息中表现出害词性。这篇论文介绍了一种混合神经网络和规则库的上下文意识系统，利用核心引用解决方案来识别害词内容中的害词上下文信息。与专业调度人员合作，我们编辑了一个数据集并开发了一个可 distinguish between harmful and non-harmful erotic content的分类器。我们的混合模型在Polish文本上进行测试，显示了84%的准确率和80%的回归率。基于RoBERTa和Longformer的模型，没有显式使用核心引用链，得到的结果显示了较弱的性能，这说明了核心引用解决方案在检测这种细腻内容的害词性方面的重要性。这种方法还提供了可见的视觉解释性，支持调度人员评估预测结果并采取必要的行动来解决害词内容。

4 and 7-bit Labeling for Projective and Non-Projective Dependency Trees

paper_url: http://arxiv.org/abs/2310.14319
repo_url: None
paper_authors: Carlos Gómez-Rodríguez, Diego Roca, David Vilares
for: 这篇论文是为了提出一种可以将任何 проекive 依赖树转换为一个字符串中的4位标签的编码方法。
methods: 这篇论文使用了一种基于字符串的标签编码方法，每个单词的标签包含4个位数，表示该单词是左或右依赖的、外most的左/右依赖、有左/右叶子节点等信息。
results: 该编码方法可以在线性时间内编码和解码，并且在一些多样化的树频谱上实现了较高的准确率，比之前最佳的序列标签编码方法更高。

Abstract
We introduce an encoding for parsing as sequence labeling that can represent any projective dependency tree as a sequence of 4-bit labels, one per word. The bits in each word's label represent (1) whether it is a right or left dependent, (2) whether it is the outermost (left/right) dependent of its parent, (3) whether it has any left children and (4) whether it has any right children. We show that this provides an injective mapping from trees to labels that can be encoded and decoded in linear time. We then define a 7-bit extension that represents an extra plane of arcs, extending the coverage to almost full non-projectivity (over 99.9% empirical arc coverage). Results on a set of diverse treebanks show that our 7-bit encoding obtains substantial accuracy gains over the previously best-performing sequence labeling encodings.

摘要
我们介绍了一种编码方式，可以将任何投影依赖树转换为一个字符串中的4位标签，每个词的标签包含以下信息：（1）是右或左依赖关系，（2）是父节点的左或右外部依赖，（3）有左子节点，（4）有右子节点。我们证明了这是一个唯一映射，可以在线时间内编码和解码。我们还定义了一个7位扩展，表示一个额外的平面弧，使得覆盖率接近100%。在一组多样的树银行上测试的结果显示，我们的7位编码可以获得substantial的准确率提升， compared to之前最佳的序列标签编码。

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

paper_url: http://arxiv.org/abs/2310.14312
repo_url: None
paper_authors: Anthi Papadopoulou, Pierre Lison, Mark Anderson, Lilja Øvrelid, Ildikó Pilán
for: 本研究的目的是提出一种 двух步文本匿名化方法，并对两个最新发布的数据集进行实验分析：Text Anonymization Benchmark（Pil'an et al., 2022）和一个基于Wikipedia的生ografies（Papadopoulou et al., 2022）。
methods: 本研究使用了一种权限允许的实体识别器，该识别器使用了标准命名实体识别模型和从Wikidata中提取的人员相关词汇进行训练。第二步是根据检测到的文本段进行风险评估，并使用语言模型概率、文本段分类、序列标签、扰动和网络搜索来评估隐私风险。
results: 本研究提供了五种不同的隐私风险指标，分别基于语言模型概率、文本段分类、序列标签、扰动和网络搜索。研究人员对每种风险指标进行了比较分析，并描述了它们的优点和局限性，特别是与可用的标注数据相关。

Abstract
Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.

摘要
文本净化是将文本中的直接或间接个人标识符Mask all occurrences of (direct or indirect) personal identifiers in a document, with the goal of concealing the identity of the individual(s) referred in it. 在这篇论文中，我们考虑了一种两步方法 для实现文本净化，并对两个最近发布的数据集进行了详细的实验分析：Text Anonymization Benchmark（Pil\'an et al., 2022）和一个来自Wikipedia的biography集合（Papadopoulou et al., 2022）。文本净化过程从privacy-oriented实体识别器开始，该实体识别器通过将标准命名实体识别模型和 Wikidata中的人员相关词汇拼接而训练。第二步的文本净化过程是评估各检测到的文本块中的隐私风险，单独或与其他文本块组合。我们提出了五种不同的隐私指标，分别基于语言模型概率、文本块分类、序列标签、扰动和网络搜索。我们对每个隐私指标进行了对照分析，并指出了它们的优点和局限性，特别是与可用的标注数据相关。

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

paper_url: http://arxiv.org/abs/2310.14303
repo_url: None
paper_authors: Rishabh Bhardwaj, Soujanya Poria
for: 这篇论文的目的是如何用不同的方法来评估大语言模型（LLMs）的危险性。
methods: 这篇论文使用了一种新的方法，即参数调整（parametric red-teaming）来评估 LLMs 的危险性。这种方法通过调整模型参数来绕过模型的安全行为，并且只需要使用 100 个示例。
results: 这篇论文的结果表明，使用参数调整方法可以很有效地绕过 CHATGPT 等模型的安全行为，并且可以在多种模型上实现高度的攻击成功率。此外，这种方法还可以暴露模型中隐藏的偏见和偏好。

Abstract
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.

摘要
红团队（red-teaming）已经广泛地应用于评估大语言模型（LLM）的危害性。它的目标是让模型免受安全限制，以便它可以根据危害性的查询行为。现有的方法主要基于输入文本基于的红团队，如敌对提示、低资源提示或Contextualized提示来 condition the model，以使其免受安全限制。免除安全限制可以暴露模型中隐藏的危害信息和偏见，但提示基于的攻击失败率较高，并且只适用于特定的模型。在这篇论文中，我们提出了一新的LLM安全研究视角，即参数红团队（Parametric red-teaming）。它通过调整模型参数来破坏模型的安全限制，而这些限制不深刻地关联到模型的行为。使用100个示例的不一致可以很好地绕过CHATGPT等常见的模型，并达到88%的攻击成功率。在开源模型上，如VICUNA-7B和LLAMA-2-CHAT 7B和13B，它的攻击成功率高于91%。在偏见评估中，不一致 expose了安全适配模型中的隐藏偏见，其中模型的回答有64%的时间具有强烈的偏见和意见性。

paper_url: http://arxiv.org/abs/2310.14278
repo_url: None
paper_authors: Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie
for: 提高 conversational ASR 系统的准确率和持续性，特别是在EXTRACTING RELEVANT CONTEXTUAL INFORMATION FROM PREVIOUS CONVERSATIONAL TURNS 中。
methods: 我们提出了一种新的 Conversational ASR 系统，基于 Conformer Encoder-Decoder 模型，并具有跨模态对话表示。我们的方法通过特殊的编码器和模式层输入将听说和文本模型结合在一起，从而EXTRACTING RICHER HISTORICAL SPEECH CONTEXT WITHOUT EXPLICIT ERROR PROPAGATION。我们还将conditional latent variational module incorporated into the decoder to learn conversational level attributes such as role preference and topic coherence。
results: 我们的模型在 Mandarin conversation datasets HKUST 和 MagicData-RAMC 上实现了相对准确率提高8.8%和23%，compared to the standard Conformer model。

Abstract
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel Conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

摘要
自动语音识别（ASR）在对话设置下存在独特的挑战，包括从前一些对话扩展有用的上下文信息。由于无关内容、错误卷积和重复，现有方法很难提取更长和有效的上下文。为解决这个问题，我们介绍了一种新的对话式ASR系统，扩展了Conformer编码器-解码器模型，并添加了跨模态对话表示。我们的方法利用一个跨模态提取器，将预训练的音频和文本模型通过特殊的编码器和模式层掩码输入结合。这使得更多的历史语音上下文可以无需显式错误卷积提取。我们还将条件潜在变量模块 integrate into the decoder，学习对话水平特征，如角色偏好和话题一致性。通过将跨模态和对话表示添加到解码器中，我们的模型可以保持长句子上下文不产生信息损失，实现相对准确率提高8.8%和23%在香港大学科技大学（HKUST）和魔法数据-RAMC（MagicData-RAMC）普通话数据集上，相比标准Conformer模型。

CT-GAT: Cross-Task Generative Adversarial Attack based on Transferability

paper_url: http://arxiv.org/abs/2310.14265
repo_url: https://github.com/xiaoxuannlp/ct-gat
paper_authors: Minxuan Lv, Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu
for: 防护神经网络模型免受敌意例子的攻击
methods: 直接使用多任务中的恶意示例提取可转移特征来构造敌意例子
results: 在十个不同的 dataset 上进行实验，结果表明我们的方法可以具有较好的攻击性能，并且可以采用小成本来实现。Here’s the full Chinese translation of the paper’s abstract:
for: 本文采用多任务中的恶意示例来防护神经网络模型免受敌意例子的攻击。
methods: 我们直接使用多任务中的恶意示例提取可转移特征来构造敌意例子。 Specifically, we train a sequence-to-sequence generative model named CT-GAT using adversarial sample data collected from multiple tasks to acquire universal adversarial features and generate adversarial examples for different tasks.
results: 我们在十个不同的 dataset 上进行实验，结果表明我们的方法可以具有较好的攻击性能，并且可以采用小成本来实现。

Abstract
Neural network models are vulnerable to adversarial examples, and adversarial transferability further increases the risk of adversarial attacks. Current methods based on transferability often rely on substitute models, which can be impractical and costly in real-world scenarios due to the unavailability of training data and the victim model's structural details. In this paper, we propose a novel approach that directly constructs adversarial examples by extracting transferable features across various tasks. Our key insight is that adversarial transferability can extend across different tasks. Specifically, we train a sequence-to-sequence generative model named CT-GAT using adversarial sample data collected from multiple tasks to acquire universal adversarial features and generate adversarial examples for different tasks. We conduct experiments on ten distinct datasets, and the results demonstrate that our method achieves superior attack performance with small cost.

摘要
神经网络模型容易受到敌意示例的威胁，而受攻击性质的传播更会增加攻击风险。现有的基于传播的方法frequently rely on占位模型，这可能在实际应用中是不可预测的和成本高的。在这篇论文中，我们提出了一种新的方法，通过提取不同任务之间的传播特征来直接构建敌意示例。我们的关键发现是敌意传播可以跨任务扩展。我们使用多个任务的敌意样本来训练一个名为CT-GAT的序列到序列生成模型，以获得通用的敌意特征并生成不同任务的敌意示例。我们在十个不同的数据集上进行了实验，结果显示，我们的方法可以在小成本下实现高度的攻击性能。

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

paper_url: http://arxiv.org/abs/2310.14262
repo_url: None
paper_authors: Ivana Kvapilíková, Ondřej Bojar
for: 提高低语言机器翻译系统的质量
methods: 使用 Pseudo-parallel sentence pairs 和 synthetic sentence pairs 进行训练
results: 与基线相比，提高翻译质量，最高提高14.5个 BLEU 点（英语到乌克兰语）

Abstract
Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora in addition to synthetic sentence pairs back-translated from monolingual corpora. We experiment with different training schedules and reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.

摘要
即使最新的深度学习和大规模语言模型技术发展，机器翻译（MT）低资源语言 task 仍然是一个挑战。我们提议一种培训策略，利用 Pseudo-parallel sentence pairs 和 artificial sentence pairs back-translated from monolingual corpora。我们在不同的培训时间表现出来的结果，可以达到14.5 BLEU点（英语到乌克兰）的提升。

From Static to Dynamic: A Continual Learning Framework for Large Language Models

paper_url: http://arxiv.org/abs/2310.14248
repo_url: https://github.com/elfsong/dynamind
paper_authors: Mingzhe Du, Anh Tuan Luu, Bin Ji, See-kiong Ng
for: 这篇论文是为了解决大型自然语言处理模型（LLMs）中的复杂性问题，以提高其在不同自然语言处理任务中的表现。
methods: 这篇论文提出了一个名为DynaMind的新的持续学习框架，旨在帮助LLMs继续学习并吸收新知识。DynaMind包括内存机制来吸收新知识，以及增强模型推论过程中的模块运算器，以提高LLMs的表现精度。
results: 根据比较 experiments，DynaMind可以有效地解决LLMs中的复杂性问题，并提高其表现精度。

Abstract
The vast number of parameters in large language models (LLMs) endows them with remarkable capabilities, allowing them to excel in a variety of natural language processing tasks. However, this complexity also presents challenges, making LLMs difficult to train and inhibiting their ability to continuously assimilate new knowledge, which may lead to inaccuracies in their outputs. To mitigate these issues, this paper presents DynaMind, a novel continual learning framework designed for LLMs. DynaMind incorporates memory mechanisms to assimilate new knowledge and modular operators to enhance the model inference process with the newly assimilated knowledge, consequently improving the accuracies of LLMs' outputs. Benchmark experiments demonstrate DynaMind's effectiveness in overcoming these challenges. The code and demo of DynaMind are available on GitHub: https://github.com/Elfsong/DynaMind.

摘要
庞大的参数量在大型自然语言处理模型（LLM）中具有惊人的能力，使其在各种自然语言处理任务中表现出色。然而，这种复杂性也存在挑战，使LLM困难于训练，并限制其继续吸收新知识，可能导致其输出的不准确。为解决这些问题，本文提出了DynaMind，一种特有的连续学习框架，适用于LLM。DynaMind包括记忆机制，以吸收新知识，以及模块运算符，以提高LLM的输出准确性。实验示出DynaMind可以有效地解决这些问题。代码和示例可以在GitHub上找到：https://github.com/Elfsong/DynaMind。

PHD: Pixel-Based Language Modeling of Historical Documents

paper_url: http://arxiv.org/abs/2310.18343
repo_url: None
paper_authors: Nadav Borenstein, Phillip Rust, Desmond Elliott, Isabelle Augenstein
for: 这篇论文是为了探讨 историography 中文档案的数字化处理和自然语言处理方法。
methods: 该论文使用最新的像素基于语言模型，通过重建受遮盖的像素区域来替代传统的 OCR 技术。它还提出了一种新的合成档案生成方法，用于生成具有历史档案特点的 sintetic 档案。
results: 该论文通过实验证明，PHD 模型在重建受遮盖的像素区域方面具有高度的掌握能力，并在历史问答任务中得到了成功应用。

Abstract
The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.

摘要
digitization of historical documents 提供了历史学家无 precedent 的研究机会。然而，传统的历史文档分析方法是将图像转换为文本使用 OCR，这种方法忽略了图像的可能性并且引入了高水平的噪声。为了bridging这个差距，我们利用了最近的像素基本语言模型，用于重建掩码的图像 patches 而不是预测Token分布。由于历史材料的罕见性，我们提议一种新的方法生成Synthetic scans 来模拟真实的历史文档。我们然后预训练我们的模型PHD 使用组合的Synthetic scans 和真实的历史报纸从1700-1900年代。我们的实验表明PHD 在重建掩码的图像 patches 方面表现出了高度的能力，并且我们在历史QA任务中成功地应用了我们的模型，这亮出了它在这个领域的用于。

Customising General Large Language Models for Specialised Emotion Recognition Tasks

paper_url: http://arxiv.org/abs/2310.14225
repo_url: None
paper_authors: Liyizhe Peng, Zixing Zhang, Tao Pang, Jing Han, Huan Zhao, Hao Chen, Björn W. Schuller
for: 这个论文主要是为了探讨大语言模型（LLMs）在情感识别任务中的性能和可行性。
methods: 这篇论文使用了两种不同的模态适应技术来改进Chat General Language Model（一个公共可用的大语言模型），即深度提示调整和低维度适应。
results: 实验结果表明，通过使用这两种技术改进的LLM可以轻松超越其他特有的深度模型，这表明LLM在情感识别任务中具有强大的传输性和可行性。

Abstract
The advent of large language models (LLMs) has gained tremendous attention over the past year. Previous studies have shown the astonishing performance of LLMs not only in other tasks but also in emotion recognition in terms of accuracy, universality, explanation, robustness, few/zero-shot learning, and others. Leveraging the capability of LLMs inevitably becomes an essential solution for emotion recognition. To this end, we further comprehensively investigate how LLMs perform in linguistic emotion recognition if we concentrate on this specific task. Specifically, we exemplify a publicly available and widely used LLM -- Chat General Language Model, and customise it for our target by using two different modal adaptation techniques, i.e., deep prompt tuning and low-rank adaptation. The experimental results obtained on six widely used datasets present that the adapted LLM can easily outperform other state-of-the-art but specialised deep models. This indicates the strong transferability and feasibility of LLMs in the field of emotion recognition.

摘要
<>大语言模型（LLM）的出现在过去一年内得到了很多关注。之前的研究表明，LLM在其他任务上的表现非常出众，以及在情感识别任务中的准确率、通用性、解释能力、鲁棒性、少/Zero-shot学习等方面的表现。利用LLM的能力变得是解决情感识别问题的必要手段。为此，我们进一步全面调查了LLM在语言情感识别任务中的表现。例如，我们使用了公共可用的和广泛使用的LLM——Chat General Language Model，并使用两种不同的模态适应技术，即深度推荐练化和低级适应。实验结果表明，适应后的LLM可以轻松击败其他特有的深度模型。这表明LLM在情感识别领域的传输性和可行性。

Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

paper_url: http://arxiv.org/abs/2310.14206
repo_url: https://github.com/victor7246/transject
paper_authors: Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
for: 本研究旨在提高Transformer模型的表达能力，特别是保持层次结构信息。
methods: 本文提出了一种名为TransJect的encoder模型，该模型通过保证层次距离 preserved来提高表达能力。具体来说，TransJect使用了一种简单的替代方案来确保点积分注意力，从而保证了Liψchitz连续性。
results: 在多个短和长序列分类任务上，TransJect比Transformer的variantshow了最大提升6.8%和5.9%。此外，TransJect在语言模型任务上表现出79%的提升。此外，本文还探讨了多头自注意的缺陷从统计物理角度。

Abstract
Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.

摘要
多头自注意型Transformer显示了不同学习任务中的搭配性。虽然这些模型在序列中的短期和长期上下文理解方面表现出了显著改进，但Transformer和其变种的Encoder却无法保持层次上的上下文信息。Transformer通常将token映射到稀疏拟合和失去数学相等性 among token representations。在这种情况下，我们提出了TransJect，一种Encoder模型，可以保证层次上的距离保持。我们提出了一种简单的替代品dot-product注意，以确保Lipschitz连续性。这使得TransJect可以学习将token表示变换到不同的拟合上，保持后续层次上的Euclidean距离between every pair of tokens。多个benchmark短序列和长序列分类任务上的评估显示，TransJect可以与Transformer变种的最大改进6.8%和5.9%。此外，TransJect在语言模型任务上表现出79%的提高。我们还从统计物理角度探讨了多头自注意的缺陷。虽然多头自注意是为了在网络中学习不同层次的抽象，但我们的实际分析表明，不同的注意头会随机和无序地学习。相比之下，TransJect采用了一种精心混合的专家，这些专家更加有序和平衡，从输入序列中学习不同的稀疏表示。TransJect具有很低的 entropy和可以高效扩展到更大的深度。

QA-NatVer: Question Answering for Natural Logic-based Fact Verification

paper_url: http://arxiv.org/abs/2310.14198
repo_url: None
paper_authors: Rami Aly, Marek Strong, Andreas Vlachos
for: 评估声明真实性基于证据， faithfulness 是一个重要考虑因素，即生成可信的解释。
methods: 使用问答系统预测自然逻辑运算符，利用指导语言模型的泛化能力，无需训练数据。
results: 在 FEVER 几个shot Setting 中，我们的方法比最佳基eline提高了4.3个准确性点，包括一个预训练 seq2seq 自然逻辑系统和一个预训练提问基类ifier。我们的系统在对比Fact datasets 中显示出了稳定性和可重用性，并在没有进一步注释的情况下超越了所有其他方法。人工评估表明，我们的方法生成的证据更加可能且少量错误的自然逻辑运算符。

Abstract
Fact verification systems assess a claim's veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-theoretic operators. However, these approaches rely on substantial resources for training, which are only available for high-resource languages. To this end, we propose to use question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. Thus, we obviate the need for annotated training data while still relying on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system, as well as a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.

摘要
фак verify 系统评估一个说法的真实性基于证据。设计这些系统时，一个重要考虑因素是忠诚度，即生成的解释能够准确反映模型的逻辑。 current works 将关注自然逻辑，它直接在自然语言上运行，通过 captured span 之间的semantic relation 和 claims 的Alignment来进行操作。 however, these approaches rely on a large amount of training data, which is only available for high-resource languages.To address this challenge, we propose using question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. This approach eliminates the need for annotated training data and relies on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by 4.3 accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system and a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.

An In-Context Schema Understanding Method for Knowledge Base Question Answering

paper_url: http://arxiv.org/abs/2310.14174
repo_url: None
paper_authors: Yantao Liu, Zixuan Li, Xiaolong Jin, Long Bai, Saiping Guan, Jiafeng Guo, Xueqi Cheng
for: 本研究旨在提高大语言模型在知识基础中问答任务中的表现，具体来说是通过增强大语言模型对知识库的schema理解来提高其作为semantic parser的能力。
methods: 本研究提出了一种叫做In-Context Schema Understanding（ICSU）的方法，该方法利用了在context学习机制，通过提供例子来指导大语言模型生成SPARQL查询。为了从注释化的问题-查询对中检索合适的例子，ICSU采用了四种不同的检索策略。
results: 实验结果表明，ICSU与所有的检索策略都可以与随机检索策略相比，显著提高了准确率（从12%提高到78.76%）。

Abstract
The Knowledge Base Question Answering (KBQA) task aims to answer natural language questions based on a given knowledge base. As a kind of common method for this task, semantic parsing-based ones first convert natural language questions to logical forms (e.g., SPARQL queries) and then execute them on knowledge bases to get answers. Recently, Large Language Models (LLMs) have shown strong abilities in language understanding and may be adopted as semantic parsers in such kinds of methods. However, in doing so, a great challenge for LLMs is to understand the schema of knowledge bases. Therefore, in this paper, we propose an In-Context Schema Understanding (ICSU) method for facilitating LLMs to be used as a semantic parser in KBQA. Specifically, ICSU adopts the In-context Learning mechanism to instruct LLMs to generate SPARQL queries with examples. In order to retrieve appropriate examples from annotated question-query pairs, which contain comprehensive schema information related to questions, ICSU explores four different retrieval strategies. Experimental results on the largest KBQA benchmark, KQA Pro, show that ICSU with all these strategies outperforms that with a random retrieval strategy significantly (from 12\% to 78.76\% in accuracy).

摘要
《知识库问答（KBQA）任务的目标是根据给定的知识库回答自然语言问题。现有一种常见的方法是将自然语言问题转化为逻辑形式（例如 SPARQL 查询），然后执行在知识库中以获取答案。最近，大型自然语言模型（LLM）在语言理解方面表现出色，因此可能被采用为 semantic parser 在这些方法中。然而，在这种情况下，LLM 的一大挑战是理解知识库的结构。因此，在这篇论文中，我们提出了一种在Context Schema Understanding（ICSU）方法，用于使 LLM 在 KBQA 中作为semantic parser。具体来说，ICSU 采用了在 Context 学习机制，以示 LLM 生成 SPARQL 查询的示例。为了从 annotated question-query 对中检索相关的 schema 信息，ICSU 探索了四种不同的检索策略。实验结果表明，ICSU 与所有这些策略相比，在 KQA Pro 最大知识库问答 benchmark 上表现出色，具体来说，ICSU 的准确率从 12% 提高到 78.76%。

Can Language Models Laugh at YouTube Short-form Videos?

paper_url: http://arxiv.org/abs/2310.14159
repo_url: https://github.com/dayoon-ko/exfuntube
paper_authors: Dayoon Ko, Sangho Lee, Gunhee Kim
For: 本研究targets at developing a dataset and a prompting method to improve large language models’ (LLMs) understanding of humorous videos on social media.* Methods: 研究使用了YouTube上的用户生成的10000个多Modal funny videos，通过视频过滤管道和GPT-3.5进行验证，并为每个视频添加时间戳和文本解释。* Results: 研究表明，使用zero-shot video-to-text prompting可以有效提高LLMs对视频幽默的理解，并通过三种评估方法（自动分数、理由质量实验和人工评价）得到了证明。

Abstract
As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.

摘要
As short-form funny videos on social networks become increasingly popular, it is becoming more important for AI models to understand them in order to communicate with humans more effectively. However, previous video humor datasets have focused on specific domains such as speeches or sitcoms, and have primarily targeted verbal cues. We have curated a dataset of 10,000 multimodal funny videos from YouTube, called ExFunTube, which includes both visual and verbal elements that contribute to humor. We use a video filtering pipeline with GPT-3.5 to verify the humor in each video, and then annotate each video with timestamps and text explanations for the funny moments. Our ExFunTube dataset is unique compared to existing datasets, as it covers a wide range of domains with various types of humor that require a multimodal understanding of the content. Additionally, we have developed a zero-shot video-to-text prompting method to improve the ability of large language models (LLMs) to understand humor. We evaluate our prompting method using three different methods, including automatic scores, rationale quality experiments, and human evaluations, and show that it significantly improves the ability of LLMs to explain humor.

Orthogonal Subspace Learning for Language Model Continual Learning

paper_url: http://arxiv.org/abs/2310.14152
repo_url: None
paper_authors: Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, Xuanjing Huang
for: 这篇论文旨在解决语言模型在进行多任务时的慢性忘记问题。
methods: 本文提出了一种简单有效的方法，即阶层低维适应（O-LoRA），可以有效地减少语言模型在进行新任务时的慢性忘记。O-LoRA 在不同的低维vector空间中学习任务，以避免任务之间的干扰。
results: 实验结果显示， compared to现有方法，O-LoRA 能够更好地保持语言模型对未见任务的普遍能力。 In addition, O-LoRA 只需要额外增加一些参数成本，并且不需要用户数据储存 для重新读取。

Abstract
Benefiting from massive corpora and advanced hardware, large language models (LLMs) exhibit remarkable capabilities in language understanding and generation. However, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as catastrophic forgetting. In this paper, we propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Our method induces only marginal additional parameter costs and requires no user data storage for replay. Experimental results on continual learning benchmarks show that our method outperforms state-of-the-art methods. Furthermore, compared to previous approaches, our method excels in preserving the generalization ability of LLMs on unseen tasks.

摘要
LLMs 因为巨大的词汇和高级硬件的支持，在语言理解和生成方面表现出了惊人的能力。然而，在紧随着多个任务的场景下，它们的表现却会出现悬崖式忘记。在这篇论文中，我们提出了低维 adaptation（O-LoRA），一种简单高效的方法，用于语言模型的连续学习，以避免悬崖式忘记。具体来说，O-LoRA 在不同的低维向量空间中学习不同任务，以避免干扰。我们的方法增加了非常少的参数成本，并不需要用户存储数据进行回放。实验结果表明，我们的方法在连续学习测试 benchmark 上表现出色，并且比之前的方法更好地保留 LLMs 对未seen 任务的总结能力。

PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain

paper_url: http://arxiv.org/abs/2310.14151
repo_url: https://github.com/michael-wzhu/PromptCBLUE
paper_authors: Wei Zhu, Xiaoling Wang, Huanran Zheng, Mosha Chen, Buzhou Tang
For: + The paper aims to evaluate Chinese language models (LLMs) for multi-task capabilities on a wide range of bio-medical tasks.* Methods: + The authors re-build the Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark into a large-scale prompt-tuning benchmark, called PromptCBLUE. + The benchmark is designed to evaluate Chinese LLMs’ performance on medical entity recognition, medical text classification, medical natural language inference, medical dialogue understanding, and medical content/dialogue generation.* Results: + The authors experiment with fine-tuning 9 Chinese LLMs with different techniques and report the results.

Abstract
Biomedical language understanding benchmarks are the driving forces for artificial intelligence applications with large language model (LLM) back-ends. However, most current benchmarks: (a) are limited to English which makes it challenging to replicate many of the successes in English for other languages, or (b) focus on knowledge probing of LLMs and neglect to evaluate how LLMs apply these knowledge to perform on a wide range of bio-medical tasks, or (c) have become a publicly available corpus and are leaked to LLMs during pre-training. To facilitate the research in medical LLMs, we re-build the Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark into a large scale prompt-tuning benchmark, PromptCBLUE. Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks including medical entity recognition, medical text classification, medical natural language inference, medical dialogue understanding and medical content/dialogue generation. To establish evaluation on these tasks, we have experimented and report the results with the current 9 Chinese LLMs fine-tuned with differtent fine-tuning techniques.

摘要
生物医学语言理解指标是人工智能应用中的推动力，但现有的大多数指标有以下限制：（a）仅限于英语，使得其他语言的复制困难，或（b）主要关注语言模型的知识探测，忽略语言模型在各种生物医学任务上的应用，或（c）已经公开化并泄露给语言模型 durante pre-training。为促进医学语言模型的研究，我们将中文生物医学语言理解评估 benchmark（CBLUE）重新建立为大规模的提示调整 benchmark，即 PromptCBLUE。我们的 benchmark 是一个适用的测试床和在线平台，用于评估中文语言模型在各种生物医学任务上的多任务能力，包括医学实体识别、医学文本分类、医学自然语言推理、医学对话理解和医学内容/对话生成。为了建立这些任务的评估，我们在现有的9种中文语言模型中进行了不同的细化技术的实验，并发布了结果。

2023-10-22

Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

Text generation for dataset augmentation in security classification tasks

Large Language Models are biased to overestimate profoundness

REFER: An End-to-end Rationale Extraction Framework for Explanation Regularization

Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models

Bi-Encoders based Species Normalization – Pairwise Sentence Learning to Rank

Is ChatGPT a game changer for geocoding – a benchmark for geocoding address parsing techniques

Cultural and Linguistic Diversity Improves Visual Representations

The Law and NLP: Bridging Disciplinary Disconnects

Social Commonsense-Guided Search Query Generation for Open-Domain Knowledge-Powered Conversations

DiFair: A Benchmark for Disentangled Assessment of Gender Knowledge and Bias

Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis

4 and 7-bit Labeling for Projective and Non-Projective Dependency Trees

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

CT-GAT: Cross-Task Generative Adversarial Attack based on Transferability

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

From Static to Dynamic: A Continual Learning Framework for Large Language Models

PHD: Pixel-Based Language Modeling of Historical Documents

Customising General Large Language Models for Specialised Emotion Recognition Tasks

Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

QA-NatVer: Question Answering for Natural Logic-based Fact Verification

An In-Context Schema Understanding Method for Knowledge Base Question Answering

Can Language Models Laugh at YouTube Short-form Videos?

Orthogonal Subspace Learning for Language Model Continual Learning

PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain