cs.CL - 2023-12-06

Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions

paper_url: http://arxiv.org/abs/2312.03912
repo_url: None
paper_authors: Will Aitken, Mohamed Abdalla, Karen Rudie, Catherine Stinson
for: This paper investigates the reliance on industry for NLP publications, specifically looking at the citations of industry artifacts and contributions in papers presented at EMNLP 2022.
methods: The paper surveys 100 papers published at EMNLP 2022 to determine the frequency of citations of industry artifacts and contributions.
results: The paper finds that there is a substantial reliance on industry for NLP publications, with citations of industry artifacts and contributions being at least three times greater than industry publication rates per year. The paper discusses two possible perspectives on this finding: 1) collaboration with industry is still collaboration, even in the absence of an alternative, or 2) free NLP inquiry has been captured by the motivations and research direction of private corporations.

Abstract
The advent of transformers, higher computational budgets, and big data has engendered remarkable progress in Natural Language Processing (NLP). Impressive performance of industry pre-trained models has garnered public attention in recent years and made news headlines. That these are industry models is noteworthy. Rarely, if ever, are academic institutes producing exciting new NLP models. Using these models is critical for competing on NLP benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine whether this phenomenon constitutes a reliance on industry for NLP publications. We find that there is indeed a substantial reliance. Citations of industry artifacts and contributions across categories is at least three times greater than industry publication rates per year. Quantifying this reliance does not settle how we ought to interpret the results. We discuss two possible perspectives in our discussion: 1) Is collaboration with industry still collaboration in the absence of an alternative? Or 2) has free NLP inquiry been captured by the motivations and research direction of private corporations?

摘要
“ transformers 的出现，更高的计算预算和大数据，已经导致自然语言处理（NLP）领域做出了很大的进步。在最近几年，业界预训模型的出色表现受到了公众的关注，并在新闻头条上占据了主要地位。这些模型是业界模型，这是值得注意的。在学术界rarely, if ever, 出现了新的NLP模型。我们在 EMNLP 2022 年度会议上翻译了 100 篇论文，以确定这种现象是否存在。我们发现，实际上有一定的依赖。业界文献和贡献的引用 frequency 至少三倍于每年的业界发表率。量化这种依赖并不能解释我们应该如何解释结果。我们在讨论中提出了两个可能的视角：1）在没有备用的情况下，与业界合作仍然是合作吗？或2）自私公司的动机和研究方向已经抓住了自由NLP研究的主流？”

Revisiting the Optimality of Word Lengths

paper_url: http://arxiv.org/abs/2312.03897
repo_url: https://github.com/tpimentelms/optimality-of-word-lengths
paper_authors: Tiago Pimentel, Clara Meister, Ethan Gotlieb Wilcox, Kyle Mahowald, Ryan Cotterell
for: Zipf (1935) 的研究目的是提出词形具有最小通信成本的优化。
methods: 这种研究使用 Piantadosi et al. (2011) 提出的通信成本理论（Channel Capacity Hypothesis，CCH），并提出一种新的 derivation 来最小化 CCH 的成本。
results: 研究发现，Zipf 的假设在13种语言和多种实验设置下，word length 更好地预测了 frequency。此外，当使用更好的语言模型来估算 expectation 和 variance-to-mean ratio 时，word length 的预测变得更差。这些结果支持 Zipf 的长期假设。

Abstract
Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs. Under the assumption that cost is given by an utterance's length, he supported this claim by showing that words' lengths are inversely correlated with their frequencies. Communicative cost, however, can be operationalized in different ways. Piantadosi et al. (2011) claim that cost should be measured as the distance between an utterance's information rate and channel capacity, which we dub the channel capacity hypothesis (CCH) here. Following this logic, they then proposed that a word's length should be proportional to the expected value of its surprisal (negative log-probability in context). In this work, we show that Piantadosi et al.'s derivation does not minimize CCH's cost, but rather a lower bound, which we term CCH-lower. We propose a novel derivation, suggesting an improved way to minimize CCH's cost. Under this method, we find that a language's word lengths should instead be proportional to the surprisal's expectation plus its variance-to-mean ratio. Experimentally, we compare these three communicative cost functions: Zipf's, CCH-lower , and CCH. Across 13 languages and several experimental settings, we find that length is better predicted by frequency than either of the other hypotheses. In fact, when surprisal's expectation, or expectation plus variance-to-mean ratio, is estimated using better language models, it leads to worse word length predictions. We take these results as evidence that Zipf's longstanding hypothesis holds.

摘要
zipf (1935) 提出了 Wordforms 是为最小化语音交流成本而优化的假设。假设交流成本是话语长度，他通过显示单词长度与其频率的相对关系来支持这一点。 communicative cost 可以用不同的方式来操作化。 piantadosi 等 (2011) 提出了一种方法，即将 cost 定义为语音信号和渠道 capacities 之间的距离，我们在这里称之为通道容量假设 (CCH)。 following 这种逻辑，他们 then proposed 一个词语的长度应该与其在语言上的频率相对关系成正比。在这个工作中，我们发现 piantadosi 等的 derivation 不能减少 CCH 的成本，而是一个下界，我们称之为 CCH-lower。我们提出了一种新的 derivation，建议一种改进的方法来减少 CCH 的成本。根据这种方法，我们发现一个语言中的单词长度应该与其预期的Surprisal （负对数概率在语言上的相对关系）成正比，加上其均值与标准差的比率。实验ally，我们比较了这三种交流成本函数： zipf 的、 CCH-lower 和 CCH。在 13 种语言和多种实验设置下，我们发现 length 是频率更好地预测的。事实上，当 Surprisal 的预期、或者预期加上均值与标准差的比率，使用更好的语言模型来估计，会导致单词长度预测更差。我们认为这些结果是证明 zipf 的长期假设的证据。

PROMISE: A Framework for Model-Driven Stateful Prompt Orchestration

paper_url: http://arxiv.org/abs/2312.03699
repo_url: None
paper_authors: Wenyuan Wu, Jasmin Heierli, Max Meisterhans, Adrian Moser, Andri Färber, Mateusz Dolata, Elena Gavagnin, Alexandre de Spindler, Gerhard Schwabe
for: 本文旨在提供一种框架，帮助开发者在信息系统中实现复杂的语言基于交互。
methods: 本文使用状态机器模型概念，实现模型驱动、动态提示编排，以控制语言模型的行为。
results: 我们在医疗信息系统中应用PROMISE框架，并 demonstarted其能够处理复杂交互情况。

Abstract
The advent of increasingly powerful language models has raised expectations for language-based interactions. However, controlling these models is a challenge, emphasizing the need to be able to investigate the feasibility and value of their application. We present PROMISE, a framework that facilitates the development of complex language-based interactions with information systems. Its use of state machine modeling concepts enables model-driven, dynamic prompt orchestration across hierarchically nested states and transitions. This improves the control of the behavior of language models and thus enables their effective and efficient use. We show the benefits of PROMISE in the context of application scenarios within health information systems and demonstrate its ability to handle complex interactions.

摘要
“语言模型的增强力量已经提高了语言基于交互的期望。然而，控制这些模型是一项挑战，强调了需要能够评估其可行性和价值。我们提出了PROMISE框架，它使用状态机制模型的概念来实现语言模型的动态提示管理。这些管理技术可以在层次结构中进行模型驱动的状态和转移控制，从而提高语言模型的控制能力。我们在医疗信息系统中应用PROMISE，并证明它可以处理复杂交互。”Note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Evaluating and Mitigating Discrimination in Language Model Decisions

paper_url: http://arxiv.org/abs/2312.03689
repo_url: None
paper_authors: Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli
For: The paper aims to evaluate the potential discriminatory impact of language models (LMs) in a wide range of use cases, including hypothetical scenarios where they have not yet been deployed.* Methods: The authors use a method that involves generating a wide array of potential prompts that decision-makers may input into an LM, systematically varying the demographic information in each prompt, and applying this methodology to the Claude 2.0 model.* Results: The authors find patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied, and demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering.Here are the three points in Simplified Chinese:
for: 这篇论文目标是评估语言模型（LMs）在各种使用场景中的可能性歧视影响，包括尚未部署的 гипотетических场景。
methods: 作者使用一种方法，即生成各种可能的决策者输入语言模型（LM）的提示，并系统地变化每个提示中的人口信息，以应用这种方法性到 Claude 2.0 模型。
results: 作者发现 Claude 2.0 模型在某些场景中存在正面和负面歧视现象，并示出了采用提示工程来减少这些歧视的技术。

Abstract
As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval

摘要
We use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society. We systematically vary the demographic information in each prompt to identify patterns of both positive and negative discrimination in the Claude 2.0 model in select settings. Our findings reveal that the model exhibits both positive and negative discrimination in certain situations, highlighting the need for careful prompt engineering to mitigate these biases.While we do not endorse or permit the use of language models for automated decision-making in high-risk use cases, our work demonstrates techniques to significantly decrease both positive and negative discrimination. By anticipating, measuring, and addressing discrimination, our method enables developers and policymakers to safely deploy language models in appropriate use cases. We release our dataset and prompts at .

Interpretability Illusions in the Generalization of Simplified Models

paper_url: http://arxiv.org/abs/2312.03656
repo_url: None
paper_authors: Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, Asma Ghandeharioun
for: 本研究旨在检验使用简化模型表示方法来研究深度学习系统的准确性。
methods: 研究者使用了简化工具如几何约化和聚类来将深度学习模型转化为更加简单的形式，然后将这些简化形式与原始模型进行比较，以检验它们之间的差异。
results: 研究者发现，即使使用简化形式可以准确地预测训练集上的结果，但是这些简化形式在不同的测试集上的预测结果可能不准确，特别是在模型能够涵盖新结构或更深的深度时。这种现象存在，即使简化形式不直接依赖于训练分布。

Abstract
A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

摘要
通常使用简化的模型表示方法来研究深度学习系统，例如使用特征值分解来可视化模型的隐藏状态在低维度空间中。这种方法假设简化后的结果与原始模型相符。在这里，我们解释了一个重要的假设问题：即简化表示可能在不同的概率分布下不准确地反映模型的行为。我们使用控制的数据集和系统化泛化分割来训练Transformer模型。首先，我们训练模型在 Dyck 平衡括号语言上。然后，我们使用维度减少和聚类等工具简化这些模型，并直接测试这些简化的代理模型在不同的异常分布上如何匹配原始模型的行为。我们发现简化后的代理模型通常在异常分布下不准确。在模型可以泛化到新结构或更深的深度时，简化版本可能会失败或更好地泛化。这种发现不仅在简化表示不直接依赖于训练分布，还有这种情况。接下来，我们研究了一个更自然的任务：预测代码中的下一个字符。我们发现简化后的代理模型和原始模型之间存在类似的泛化差异，并进行了进一步的分析，以确定代码完成任务中哪些方面与最大差异相关。总之，我们的结果提出了机制解释使用工具如特征值分解是否可靠地预测模型在新情况下的行为。

Improving Bias Mitigation through Bias Experts in Natural Language Understanding

paper_url: http://arxiv.org/abs/2312.03577
repo_url: https://github.com/jej127/bias-experts
paper_authors: Eojin Jeon, Mingyu Lee, Juhyeong Park, Yeachan Kim, Wing-Lam Mok, SangKeun Lee
for: 降低数据集中偏见的影响，提高模型在不同数据集上的性能。
methods: 使用auxiliary model和主模型之间的二分类预测器（bias experts），通过One-vs-Restapproach进行训练，以提高auxiliary model的偏见识别能力。
results: 通过实验结果，我们的提议方法可以提高auxiliary model的偏见识别能力，并使得降低偏见后的模型在不同数据集上的性能得到了显著提升。

Abstract
Biases in the dataset often enable the model to achieve high performance on in-distribution data, while poorly performing on out-of-distribution data. To mitigate the detrimental effect of the bias on the networks, previous works have proposed debiasing methods that down-weight the biased examples identified by an auxiliary model, which is trained with explicit bias labels. However, finding a type of bias in datasets is a costly process. Therefore, recent studies have attempted to make the auxiliary model biased without the guidance (or annotation) of bias labels, by constraining the model's training environment or the capability of the model itself. Despite the promising debiasing results of recent works, the multi-class learning objective, which has been naively used to train the auxiliary model, may harm the bias mitigation effect due to its regularization effect and competitive nature across classes. As an alternative, we propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model, coined bias experts. Specifically, each bias expert is trained on a binary classification task derived from the multi-class classification task via the One-vs-Rest approach. Experimental results demonstrate that our proposed strategy improves the bias identification ability of the auxiliary model. Consequently, our debiased model consistently outperforms the state-of-the-art on various challenge datasets.

摘要
dataset 中的偏见 oftentimes enables the model to achieve high performance on in-distribution data, while poorly performing on out-of-distribution data. To mitigate the detrimental effect of the bias on the networks, previous works have proposed debiasing methods that down-weight the biased examples identified by an auxiliary model, which is trained with explicit bias labels. However, finding a type of bias in datasets is a costly process. Therefore, recent studies have attempted to make the auxiliary model biased without the guidance (or annotation) of bias labels, by constraining the model's training environment or the capability of the model itself. Despite the promising debiasing results of recent works, the multi-class learning objective, which has been naively used to train the auxiliary model, may harm the bias mitigation effect due to its regularization effect and competitive nature across classes. As an alternative, we propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model, coined bias experts. Specifically, each bias expert is trained on a binary classification task derived from the multi-class classification task via the One-vs-Rest approach. Experimental results demonstrate that our proposed strategy improves the bias identification ability of the auxiliary model. Consequently, our debiased model consistently outperforms the state-of-the-art on various challenge datasets.Note: The translation is done using Google Translate, and may not be perfect.

XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering

paper_url: http://arxiv.org/abs/2312.03567
repo_url: None
paper_authors: Joel Stremmel, Ardavan Saeedi, Hamid Hassanzadeh, Sanjit Batra, Jeffrey Hertzberg, Jaime Murillo, Eran Halperin
For: The paper is written for physicians and researchers who need to query medical records to design clinical studies and understand patient medical history.* Methods: The paper introduces a novel approach called XAIQA, which generates synthetic QA pairs at scale from data naturally available in electronic health records. The method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes.* Results: The paper shows that XAIQA identifies more semantic matches and clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs, and improves the performance of GPT-4 as an extractive QA model, including on difficult questions.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了帮助医生和研究人员查询医疗记录，以设计临床研究和理解患者医疗历史。* Methods: 论文提出了一种新的方法 called XAIQA，它可以在电子医疗记录中生成大量的Synthetic QA对。这种方法使用分类模型 explainer 来生成关于医疗概念的问题和答案。* Results: 论文表明，XAIQA 可以比两种使用 sentence transformers 生成 QA对的方法更好地标识 semantic match 和 clinical abbreviation，并且可以提高 GPT-4 作为抽取式 QA 模型的性能，包括难问题。

Abstract
Extractive question answering (QA) systems can enable physicians and researchers to query medical records, a foundational capability for designing clinical studies and understanding patient medical history. However, building these systems typically requires expert-annotated QA pairs. Large language models (LLMs), which can perform extractive QA, depend on high quality data in their prompts, specialized for the application domain. We introduce a novel approach, XAIQA, for generating synthetic QA pairs at scale from data naturally available in electronic health records. Our method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes. In an expert evaluation with two physicians, our method identifies $2.2\times$ more semantic matches and $3.8\times$ more clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs. In an ML evaluation, adding our QA pairs improves performance of GPT-4 as an extractive QA model, including on difficult questions. In both the expert and ML evaluations, we examine trade-offs between our method and sentence transformers for QA pair generation depending on question difficulty.

摘要
“抽象Question Answering（QA）系统可以让医生和研究人员查询医疗纪录，这是设计临床试验和理解病人医疗历史的重要能力。然而，建立这些系统通常需要专家录创QA对。大型自然语言模型（LLM）可以进行抽象QA，但它们需要高质量的数据作为其推问。我们介绍了一种新的方法，XAIQA，可以将大量的自然可用数据中的数据生成成QA对。我们的方法使用医疗条目 explainer 来生成关于医疗条目的问题和答案。在两位医生的专家评估中，我们的方法可以识别 $2.2\times$ 更多的 semantic match 和 $3.8\times$ 更多的医疗缩写。在 ML 评估中，将我们生成的 QA 对添加到 GPT-4 中可以提高这个抽象 QA 模型的性能，包括难问题。在专家和 ML 评估中，我们分析了我们的方法和 sentence transformers 的 QA 对生成方法之间的贡献和折冲关系，具体取决于问题的难度。”

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

paper_url: http://arxiv.org/abs/2312.03549
repo_url: None
paper_authors: Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Ke Tan, Fu Wu, Jiezhong Qiu, Aimin Pan
for: 这个 paper 是为了提高大型语言模型（LLMs）的训练效率和可扩展性。methods: 本 paper 使用了当地的数据和模型平行化策略，以及一个新的排程方法来将特定的计算任务分配给具有不同特性的 GPU 设备。results: 本 paper 的结果显示，使用者的框架可以在不同的 NIC 环境下进行训练，并且可以与现有的主流 LLM 框架整合。此外，该框架在多个 GPU 集群中的扩展性也得到了证明。

Abstract
Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

摘要
大型语言模型（LLM）如GPT-3、OPT和LLaMA在各种任务中表现出色，但训练这些模型可能会出现巨大成本，通常需要数千个GPU数据中心 Months of continuous operation。通常，这些训练是在特殊的GPU集群中进行，该集群是配备同步高速Remote Direct Memory Access（RDMA）网络卡（NIC）。获取和维护这些专门的集群是具有挑战。目前的LLM训练框架，如Megatron-LM和Megatron-DeepSpeed，专注在同步训练Homogeneous cluster Setting中。在这篇文章中，我们引入Holmes，一个LLM训练框架，该框架使用了精心设计的数据和模型平行化策略在 hetroogeneous NIC环境中。我们的主要技术贡献在于一种新的排程方法，将在LLM训练中分配特定的computational tasklet到特定的GPU装置基于该装置的连接NIC的特性。此外，我们的提案的框架，使用管道平行技术，可以在多个GPU集群中扩展，甚至在没有高速Interconnects between nodes in distinct clusters的情况下。我们进行了广泛的实验，包括不同的情况在hetroogeneous NIC环境中。大多数情况下，我们的框架可以在同步训练中 achievable with homogeneous RDMA-capable networks（InfiniBand or RoCE）水平，significantly exceeding training efficiency within the pure Ethernet environment。此外，我们验证了我们的框架可以与主流LLM框架在hetroogeneous NIC环境中优化训练效率，并且可以与它们集成。

Sig-Networks Toolkit: Signature Networks for Longitudinal Language Modelling

paper_url: http://arxiv.org/abs/2312.03523
repo_url: None
paper_authors: Talia Tseriotou, Ryan Sze-Yin Chan, Adam Tsakalidis, Iman Munire Bilal, Elena Kochkina, Terry Lyons, Maria Liakata
for: 这个论文主要是为了提出一个开源的pip安装的工具套件，叫做Sig-Networks，用于长期语言模型化。
methods: 这个工具套件使用了签名基于神经网络模型，这些模型在时间任务上已经显示出了成功。论文还应用并扩展了已经发表的研究，提供了一个完整的签名基于模型的suite。这些组件可以用作PyTorch的建筑块，在未来的架构中使用。Sig-Networks支持任务无关的数据集插入，简单的前处理 для顺序数据，参数的灵活性，自动调整多种模型。
results: 论文在三个不同的自然语言处理任务上进行了测试，包括心理咨询对话、谣言立场转换和社交媒体Thread中的情绪变化，在所有三个任务上达到了最高性能水平。论文还提供了对未来任务的指导。

Abstract
We present an open-source, pip installable toolkit, Sig-Networks, the first of its kind for longitudinal language modelling. A central focus is the incorporation of Signature-based Neural Network models, which have recently shown success in temporal tasks. We apply and extend published research providing a full suite of signature-based models. Their components can be used as PyTorch building blocks in future architectures. Sig-Networks enables task-agnostic dataset plug-in, seamless pre-processing for sequential data, parameter flexibility, automated tuning across a range of models. We examine signature networks under three different NLP tasks of varying temporal granularity: counselling conversations, rumour stance switch and mood changes in social media threads, showing SOTA performance in all three, and provide guidance for future tasks. We release the Toolkit as a PyTorch package with an introductory video, Git repositories for preprocessing and modelling including sample notebooks on the modeled NLP tasks.

摘要
我们介绍一个开源、可以通过pip安装的工具集，即Sig-Networks，这是首先采用签名基于神经网络模型的语言模型工具集。我们将ocus在 incorporating Signature-based Neural Network models，这些模型在时间任务上表现出了成功。我们应用并扩展了已发表的研究，提供了一个完整的签名基于模型集。这些组件可以用作PyTorch建筑块，在未来的建筑中使用。Sig-Networks支持任务无关的数据集插入、sequential数据顺序处理、参数灵活性和模型自动调整。我们在三种不同的自然语言处理任务上（辅导对话、谣言立场转换和社交媒体线上情绪变化）进行了试验，并达到了当前最佳性能。我们释放了工具集作为PyTorch包，并提供了引导视频、Git存储库和示例笔记本。

Exploring Answer Information Methods for Question Generation with Transformers

paper_url: http://arxiv.org/abs/2312.03483
repo_url: None
paper_authors: Talha Chafekar, Aafiya Hussain, Grishma Sharma, Deepak Sharma
for: 这个研究旨在探讨不同方法在提供目标答案作为输入时，对 RNN 模型的效果。
methods: 这个研究使用了三种方法和其组合，包括答案提示、使用自定义产品方法、使用答案嵌入和解码器输出、选择输入段落中的答案相关信息，以及使用独立的跨注意力块在解码器中注意答案。
results: 我们发现，不含任何其他模式的答案提示方法可以获得最佳分 across ROUGE 和 Meteor 评价指标。此外，我们还使用自定义指标计算生成问题中是否包含相同的答案。

Abstract
There has been a lot of work in question generation where different methods to provide target answers as input, have been employed. This experimentation has been mostly carried out for RNN based models. We use three different methods and their combinations for incorporating answer information and explore their effect on several automatic evaluation metrics. The methods that are used are answer prompting, using a custom product method using answer embeddings and encoder outputs, choosing sentences from the input paragraph that have answer related information, and using a separate cross-attention attention block in the decoder which attends to the answer. We observe that answer prompting without any additional modes obtains the best scores across rouge, meteor scores. Additionally, we use a custom metric to calculate how many of the generated questions have the same answer, as the answer which is used to generate them.

摘要
有很多研究在问题生成方面，使用不同的方法提供目标答案作为输入，以explore其影响多种自动评估指标。这些实验主要针对基于RNN的模型。我们使用三种不同的方法和其组合来推送答案信息，并评估它们对多个自动评估指标的影响。这些方法包括答案提示、使用自定义产品方法使答案嵌入和解码输出、从输入段落中选择带答案相关信息的句子，以及在解码器中使用独立的交叉注意力块，以注意答案。我们发现，不使用任何其他模式的答案提示方法可以获得最好的总评分和雨亮分数。此外，我们使用自定义指标来计算生成问题中是否包含相同的答案，即生成问题的答案和生成问题的答案之间的相似度。

AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite

paper_url: http://arxiv.org/abs/2312.03480
repo_url: https://github.com/jgroschwitz/grapes
paper_authors: Jonas Groschwitz, Shay B. Cohen, Lucia Donatelli, Meaghan Fowlie
for: 本研究开发了一个抽象意义表示（AMR）分析评估集（GrAPES），用于测试现有的AMR分析器在各种语言现象上的能力。
methods: 本研究使用了多种现有的AMR分析器，并开发了一些新的评估指标来测试这些分析器的性能。
results: 研究发现，现有的AMR分析器在一些语言现象上表现良好，但在其他情况下仍然存在较多的错误，特别是在节点标签和图结构上。

Abstract
We present the Granular AMR Parsing Evaluation Suite (GrAPES), a challenge set for Abstract Meaning Representation (AMR) parsing with accompanying evaluation metrics. AMR parsers now obtain high scores on the standard AMR evaluation metric Smatch, close to or even above reported inter-annotator agreement. But that does not mean that AMR parsing is solved; in fact, human evaluation in previous work indicates that current parsers still quite frequently make errors on node labels or graph structure that substantially distort sentence meaning. Here, we provide an evaluation suite that tests AMR parsers on a range of phenomena of practical, technical, and linguistic interest. Our 36 categories range from seen and unseen labels, to structural generalization, to coreference. GrAPES reveals in depth the abilities and shortcomings of current AMR parsers.

摘要
我团队现在发布了粒子AMR解析评估集（GrAPES），这是一个为抽象意义表示（AMR）解析的挑战集，同时提供了评估 метри。现在的AMR解析器在标准的Smatch评估 metric上获得了高分，接近或者超过了报告的间接审核者一致性。但这并不意味着AMR解析已经解决了，事实上，在前一项工作中的人工评估表明，当前的解析器仍然很频繁地在节点标签或图 структуре中出现错误，这些错误会对句子意义产生重大的扭曲。在这里，我们提供了一个测试AMR解析器的评估集，该集包括36个类别，从seen和unseen标签、结构总结、核心reference等方面进行测试。GrAPES将深入探讨当前AMR解析器的能力和缺陷。

DBCopilot: Scaling Natural Language Querying to Massive Databases

paper_url: http://arxiv.org/abs/2312.03463
repo_url: https://github.com/tshu-w/dbcopilot
paper_authors: Tianshu Wang, Hongyu Lin, Xianpei Han, Le Sun, Xiaoyang Chen, Hao Wang, Zhenyu Zeng
for: 这篇论文的目的是解决现有的文本到SQL（Text-to-SQL）框架在面对庞大、动态变化的数据库时的扩展性问题。
methods: 这篇论文使用了一种备受折衣的和灵活的助手模型来路由在庞大数据库中。具体来说，DBCopilot 将文本到SQL 过程分解为 schema 路由和 SQL 生成两个部分，使用了一种轻量级的序列到序列神经网络模型来构建数据库连接和导航自然语言问题通过数据库和表。
results: 实验结果表明，DBCopilot 是一个可扩展和高效的解决方案，可以有效地处理实际中的文本到SQL 任务，并提供了一个大规模数据库自动学习和适应的机制。

Abstract
Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero-shot text-to-SQL paradigm, existing methods face scalability challenges when dealing with massive, dynamically changing databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing across massive databases. Specifically, DBCopilot decouples the text-to-SQL process into schema routing and SQL generation, leveraging a lightweight sequence-to-sequence neural network-based router to formulate database connections and navigate natural language questions through databases and tables. The routed schemas and questions are then fed into LLMs for efficient SQL generation. Furthermore, DBCopilot also introduced a reverse schema-to-question generation paradigm, which can learn and adapt the router over massive databases automatically without requiring manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for real-world text-to-SQL tasks, providing a significant advancement in handling large-scale schemas.

摘要
文本到SQL 技术可以简化数据库交互，让非专家转换自然语言（NL）问题成为结构化查询语言（SQL）查询。而最近的大语言模型（LLM）的进步有助于零学习文本到SQL paradigm，但现有方法在面临巨大、动态变化的数据库时存在扩展性问题。本文介绍DBCopilot框架，该框架解决这些挑战，通过使用轻量级和灵活的助手模型来在巨大数据库中路由。具体来说，DBCopilot将文本到SQL过程分解成SchemaRouting和SQL生成两个阶段，使用轻量级的序列到序列神经网络基于路由器来形成数据库连接和导航自然语言问题 через数据库和表。路由的schema和问题然后被 feed into LLMs для高效的 SQL 生成。此外，DBCopilot 还引入了反向 schema-to-question 生成 paradigm，可以自动学习和适应大数据库，无需人工干预。实验结果表明，DBCopilot 是一个扩展性和有效的解决方案，对实际文本到SQL任务具有重要进步，可以有效地处理大规模的 schema。

Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification

paper_url: http://arxiv.org/abs/2312.03458
repo_url: None
paper_authors: Chengguang Gan, Qinghao Zhang, Tatsunori Mori
for: This paper aims to improve the text comprehension of Large Language Models (LLMs) by bridging the gap between LLM and human-like thinking processes, specifically in the domain of Japanese text.
methods: The paper proposes two methods, “Think from Words” (TFW) and “TFW with Extra word-level information” (TFW Extra), which initiate the comprehension process at the word level and incorporate additional word-level data to enhance LLMs’ text comprehension.
results: The paper employs text classification on six Japanese datasets to assess the effectiveness of TFW and investigate the impact of various word-level information types on LLMs’ text comprehension, providing insights into their potential to cause misinterpretations and errors in the overall comprehension of the final text.

Abstract
The proliferation of Large Language Models (LLMs) has spurred extensive research into LLM-related Prompt investigations, such as Instruction Learning (IL), In-context Learning (ICL), and Chain-of-Thought (CoT). These approaches aim to improve LLMs' responses by enabling them to provide concise statements or examples for deeper contemplation when addressing questions. However, independent thinking by LLMs can introduce variability in their thought processes, leading to potential inaccuracies. In response, our study seeks to bridge the gap between LLM and human-like thinking processes, recognizing that text comprehension begins with understanding individual words. To tackle this challenge, we have expanded the CoT method to cater to a specific domain. Our approach, known as "Think from Words" (TFW), initiates the comprehension process at the word level and then extends it to encompass the entire text. We also propose "TFW with Extra word-level information" (TFW Extra), augmenting comprehension with additional word-level data. To assess our methods, we employ text classification on six Japanese datasets comprising text-level and word-level elements. Our findings not only validate the effectiveness of TFW but also shed light on the impact of various word-level information types on LLMs' text comprehension, offering insights into their potential to cause misinterpretations and errors in the overall comprehension of the final text.

摘要
大量的大语言模型（LLM）的出现，推动了关于 LLM 相关的提示研究，如指令学习（IL）、内容学习（ICL）和链条（CoT）。这些方法旨在改进 LLM 的回答，让它们能够提供简洁的声明或示例，以便更深入的思考问题。然而， LLM 独立思考的变化可能会导致回答的不准确。因此，我们的研究旨在将 LLM 和人类思维过程连接起来，认为文本理解始于单词理解。为解决这个挑战，我们扩展了 CoT 方法，称为 "从单词开始的理解"（TFW）。TFW 方法首先从单词水平开始理解，然后扩展到整篇文本。此外，我们还提出 "TFW 加上额外单词水平信息"（TFW Extra），通过添加单词水平数据来加强理解。为评估我们的方法，我们使用了六个日本文本集，包括文本水平和单词水平的元素。我们的发现不仅证明了 TFW 的有效性，还揭示了不同单词水平信息类型对 LLM 的文本理解产生了什么影响，提供了对 LLM 可能的误解和错误的深入了解。

Comparative Analysis of Multilingual Text Classification & Identification through Deep Learning and Embedding Visualization

paper_url: http://arxiv.org/abs/2312.03789
repo_url: None
paper_authors: Arinjay Wyawhare
for: 这个研究是为了比较多语言文本分类方法，使用深度学习和嵌入可视化。
methods: 这个研究使用了LangDetect、LangId、FastText和Sentence Transformer模型，并在一个包含17种语言的数据集上进行了测试。
results: 研究发现，FastText的2D可视化显示了更清晰的幂等分类结果，并且FastText多层Perceptron模型在精度、准确率、回归率和F1分数方面表现出色，超过了Sentence Transformer模型。

Abstract
This research conducts a comparative study on multilingual text classification methods, utilizing deep learning and embedding visualization. The study employs LangDetect, LangId, FastText, and Sentence Transformer on a dataset encompassing 17 languages. It explores dimensionality's impact on clustering, revealing FastText's clearer clustering in 2D visualization due to its extensive multilingual corpus training. Notably, the FastText multi-layer perceptron model achieved remarkable accuracy, precision, recall, and F1 score, outperforming the Sentence Transformer model. The study underscores the effectiveness of these techniques in multilingual text classification, emphasizing the importance of large multilingual corpora for training embeddings. It lays the groundwork for future research and assists practitioners in developing language detection and classification systems. Additionally, it includes the comparison of multi-layer perceptron, LSTM, and Convolution models for classification.

摘要
这项研究进行了多语言文本分类方法的比较研究，利用深度学习和嵌入视觉化。研究使用了 LangDetect、LangId、FastText 和 Sentence Transformer 在一个包括 17 种语言的数据集上进行了测试。研究发现，在二维视觉化中，FastText 的 clustering 更加清晰，这是因为它在多语言训练中获得了更广泛的训练数据。另外，FastText 多层感知机制实现了很高的准确率、精度、回归率和 F1 分数，在 Sentence Transformer 模型中表现出色。研究证明了这些技术在多语言文本分类中的效果，并且强调了在训练嵌入时需要大量的多语言训练数据。这项研究为未来研究提供了基础，并帮助实践者在语言检测和分类方面建立系统。此外，研究还比较了多层感知、LSTM 和 Convolution 模型在分类方面的表现。

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

paper_url: http://arxiv.org/abs/2312.03788
repo_url: None
paper_authors: Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng
for: 这个论文的目的是提出一种高效精度的4位量子化方法，以便将大型语言模型（LLMs）部署到具有限制的计算和存储资源的设备上。
methods: 这个论文提出了一种名为SmoothQuant+的精度高效的4位量子化方法，该方法不需要额外训练，可以保持LLMs模型的精度不产生损失。SmoothQuant+使用通道级别的活化异常值缓和，并对应的调整相应的权重，以确保量子化后的模型和原始模型具有相同的精度。
results: 根据论文的结果，使用SmoothQuant+进行4位量子化后，Code Llama-34B模型可以在A100 40GB GPU上部署，并且保持模型的精度不产生损失。此外，在两个A100 40GB GPU上运行的FP16模型的吞吐量比SmoothQuant+模型高出1.9倍至4.0倍，而每个字符的延迟时间仅占FP16模型在两个A100 40GB GPU上运行时的68%。这是目前最佳的4位量子化方法 для LLMS。

Abstract
Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs, reducing the memory footprint by approximately 75% compared to FP16 models, albeit with some accuracy loss. In this paper, we propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ that requires no additional training, which enables lossless in accuracy for LLMs for the first time. Based on the fact that the loss of weight quantization is amplified by the activation outliers, SmoothQuant+ smoothes the activation outliers by channel before quantization, while adjusting the corresponding weights for mathematical equivalence, and then performs group-wise 4-bit weight quantization for linear layers. We have integrated SmoothQuant+ into the vLLM framework, an advanced high-throughput inference engine specially developed for LLMs, and equipped it with an efficient W4A16 CUDA kernels, so that vLLM can seamlessly support SmoothQuant+ 4-bit weight quantization. Our results show that, with SmoothQuant+, the Code Llama-34B model can be quantized and deployed on a A100 40GB GPU, achieving lossless accuracy and a throughput increase of 1.9 to 4.0 times compared to the FP16 model deployed on two A100 40GB GPUs. Moreover, the latency per token is only 68% of the FP16 model deployed on two A100 40GB GPUs. This is the state-of-the-art 4-bit weight quantization for LLMs as we know.

摘要
大型语言模型（LLM）在不同任务中表现出色，但它们的庞大模型大小和相应的计算和存储资源需求也存在投入困难。目前，4比特后期量化（PTQ）已经在LLM中获得了一定的成功，可以将模型的存储尺寸减少约75%，但是会有一定的精度损失。在这篇论文中，我们提出了高精度和高效的4比特只量化（SmoothQuant+），不需要额外训练，可以实现LLM中的精度损失无损。基于活动值异常值的扩散会增加量化损失，SmoothQuant+在通道级别将活动值缓冲和滤波，然后对应的 weights 进行数学等价性调整，并对 linear 层进行分组weise 4比特量化。我们将SmoothQuant+结合到了高性能的 vLLM 框架中，并使用高效的 W4A16 CUDA 加速器，以便 vLLM 可以无缝支持 SmoothQuant+ 4比特量化。我们的结果显示，使用 SmoothQuant+，Code Llama-34B 模型可以在 A100 40GB GPU 上进行量化部署，实现精度损失无损，并提高了 Throughput 1.9-4.0 倍，同时减少了每个 Token 的延迟时间为 FP16 模型在两个 A100 40GB GPU 上部署的 68%。这是目前最佳的4比特量化方法 для LLM。

Compressed Context Memory For Online Language Model Interaction

paper_url: http://arxiv.org/abs/2312.03414
repo_url: https://github.com/snu-mllab/context-memory
paper_authors: Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song
for: 本研究旨在提出一种 Context Compression 方法，用于在在线场景中，如 ChatGPT，进行 Transformer 语言模型的压缩。随着上下文的扩展，注意过程需要更多的内存和计算资源，从而降低了语言模型的吞吐量。
methods: 我们提出了一种压缩上下文存储系统，通过在语言模型的前进传输中 integrate 一个轻量级的conditional LoRA来实现压缩。基于压缩上下文存储系统，语言模型可以进行压缩的注意操作和内存操作，从而实现压缩的语言模型。
results: 通过对话、个性化和多任务学习等评估，我们 demonstarte 了我们的方法可以达到一个完整的上下文模型的性能水平，但具有 $5\times$ 小的上下文存储空间。代码可以在 https://github.com/snu-mllab/context-memory 中找到。

Abstract
This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight conditional LoRA into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with $5\times$ smaller context memory space. Codes are available at https://github.com/snu-mllab/context-memory.

摘要
这篇论文提出了一种基于Transformer语言模型的上下文压缩方法，用于在在线场景如ChatGPT中，context不断扩展。随着context的增长，注意过程需要更多的内存和计算资源，从而降低语言模型的吞吐率。为解决这个问题，我们提议一种压缩上下文内存系统，通过在语言模型的前进通道中插入一个轻量级的 conditional LoRA进行压缩。基于压缩上下文内存，语言模型可以进行压缩后的推理，具有减少内存和注意操作的能力。经过对话、个性化和多任务学习的评估，我们证明了我们的方法可以实现与全上下文模型相同的性能水平，但具有5倍小的上下文内存空间。代码可以在https://github.com/snu-mllab/context-memory上下载。

A Text-to-Text Model for Multilingual Offensive Language Identification

paper_url: http://arxiv.org/abs/2312.03379
repo_url: None
paper_authors: Tharindu Ranasinghe, Marcos Zampieri
for: 本研究旨在开发一个基于 transformer 的语言模型，用于识别社交媒体上的不良内容（如仇恨言语、网络欺凌、网络攻击等）。
methods: 本研究使用了 text-to-text transformers (T5) 模型，并在两个大规模的不良语言识别数据集（SOLID 和 CCTK）上进行了预训练。在这些数据集上，我们研究了将两个数据集合并使用，以及在 semi-supervised 情况下选择最佳阈值的影响。
results: 我们的预训练 T5 模型在多个英语 benchmark 上表现出色，超过了其他基于 transformer 的模型（如 fBERT 和 HateBERT）的表现。此外，我们还在六种不同语言（德语、希腊语、韩语、马拉地语、僧伽罗语和西班牙语）上训练了首个多语言预训练模型，并在这些语言上达到了新的州OF-THE-ART 表现。

Abstract
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

摘要
社交媒体上的不良内容问题日益担忧，许多公司和政府组织都在寻找有效的识别方法。最近，基于转换器的模型，如BERT、XLNET和XLM-R，已经达到了识别不良内容的状态对抗性表现。然而，这些模型的主要局限性在于其核心只 architecture，这限制了下游任务中的标签类型和数量。为了解决这些限制，本研究提出了首个使用文本到文本转换器（T5）进行不良语言识别的预训练模型。我们在两个大的不良语言识别 dataset（SOLID和CCTK）上进行了T5的预训练，并investigated了在半有限制的情况下选择最佳阈值的影响。我们的预训练T5模型在多个英语标准测试上超过了其他基于转换器的模型，如fBERT和HateBERT，的识别性能。ollowing a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5, and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results show that this multilingual model achieves a new state-of-the-art on all the above datasets, demonstrating its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

Lazy-k: Decoding for Constrained Token Classification

paper_url: http://arxiv.org/abs/2312.03367
repo_url: https://github.com/arthurdevnl/lazyk
paper_authors: Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jérôme Brachat, Jean-Marc Ogier
for: 提高概率模型在结构预测中的表现
methods: 结合受限解码方法，使用小型模型
results: 受限解码方法可以显著提高模型表现，特别是使用小型模型时Here’s a breakdown of each point:
for: The paper aims to improve the performance of probabilistic models in structured prediction.
methods: The paper combines probabilistic models with constrained decoding approaches, specifically in the context of token classification for information extraction.
results: The paper shows that constrained decoding approaches can significantly improve the models’ performances, especially when using smaller models. Additionally, the Lazy-$k$ approach proposed in the paper allows for more flexibility between decoding time and accuracy.

Abstract
We explore the possibility of improving probabilistic models in structured prediction. Specifically, we combine the models with constrained decoding approaches in the context of token classification for information extraction. The decoding methods search for constraint-satisfying label-assignments while maximizing the total probability. To do this, we evaluate several existing approaches, as well as propose a novel decoding method called Lazy-$k$. Our findings demonstrate that constrained decoding approaches can significantly improve the models' performances, especially when using smaller models. The Lazy-$k$ approach allows for more flexibility between decoding time and accuracy. The code for using Lazy-$k$ decoding can be found here: https://github.com/ArthurDevNL/lazyk.

摘要
我们探讨可能性模型在结构化预测中的提升。specifically，我们将模型与约束解码方法结合在信息抽取中的Token类型分类中。解码方法会搜索满足约束的标签分配，同时最大化总概率。为此，我们评估了一些现有的方法，并提出了一种新的解码方法called Lazy-$k$.我们的发现表明，约束解码方法可以明显提高模型的表现，尤其是使用较小的模型。Lazy-$k$方法允许在解码时间和准确率之间进行更多的灵活性。可以在以下链接中找到使用Lazy-$k$解码的代码：https://github.com/ArthurDevNL/lazyk。

KhabarChin: Automatic Detection of Important News in the Persian Language

paper_url: http://arxiv.org/abs/2312.03361
repo_url: None
paper_authors: Hamed Hematian Hemati, Arash Lagzian, Moein Salimi Sartakhti, Hamid Beigy, Ehsaneddin Asgari
for: 本研究旨在探讨重要新闻的检测，以提高社会大量人群的信息感知和决策效率。
methods: 本研究使用自然语言处理（NLP）方法自动化新闻检测过程。提出了一个新的基准数据集（Khabarchin），用于检测波斯语新闻中的重要新闻。
results: 研究对7,869篇波斯语新闻文章进行了注释，并创建了数据集。面临了高度不同观和类别偏见的两个挑战，并提供了解决方案。提出了一些学习型模型，从传统机器学习到当前最佳transformer模型，解决这个任务。此外，还提出了新闻文章中重要句子检测的第二任务，以解决长文本上重要信息的检测问题。

Abstract
Being aware of important news is crucial for staying informed and making well-informed decisions efficiently. Natural Language Processing (NLP) approaches can significantly automate this process. This paper introduces the detection of important news, in a previously unexplored area, and presents a new benchmarking dataset (Khabarchin) for detecting important news in the Persian language. We define important news articles as those deemed significant for a considerable portion of society, capable of influencing their mindset or decision-making. The news articles are obtained from seven different prominent Persian news agencies, resulting in the annotation of 7,869 samples and the creation of the dataset. Two challenges of high disagreement and imbalance between classes were faced, and solutions were provided for them. We also propose several learning-based models, ranging from conventional machine learning to state-of-the-art transformer models, to tackle this task. Furthermore, we introduce the second task of important sentence detection in news articles, as they often come with a significant contextual length that makes it challenging for readers to identify important information. We identify these sentences in a weakly supervised manner.

摘要
知道重要的新闻对于快速获取信息和做出 Informed 决策至关重要。自然语言处理（NLP）方法可以帮助自动化这个过程。这篇论文介绍了检测重要新闻的新方法，在未曾研究的地区进行了探索。我们定义重要新闻文章为能够对一大部分社会产生影响，能够改变他们的思维方式或决策方式。新闻文章来自七家重要的波斯语新闻机构，共计7,869个样本，并创建了数据集。面临了高度不同观和类别异质的两个挑战，并提供了解决方案。此外，我们还提出了一些学习基于模型，从传统机器学习到当前最佳transformer模型，解决这个任务。此外，我们还引入了新闻文章中重要句子检测的第二个任务，因为它们经常具有较长的上下文，使读者很难寻找重要信息。我们在弱监督方式下进行了这个任务。

Topic and genre in dialogue

paper_url: http://arxiv.org/abs/2312.03342
repo_url: None
paper_authors: Amandine Decker, Ellen Breitholtz, Christine Howes, Staffan Larsson
for: 本研究证明话题在对话中发挥基本作用，并且需要在话题和类型之间划分和正交定义，以实现可靠、可控和自定义对话系统。
methods: 本研究使用了话题分析和分类技术，以及对话分析和模型建立方法。
results: 研究发现，通过分别定义话题和类型，可以实现对话系统的模块化、可靠和自定义，并且可以提高对话系统的可控性和效果。

Abstract
In this paper we argue that topic plays a fundamental role in conversations, and that the concept is needed in addition to that of genre to define interactions. In particular, the concepts of genre and topic need to be separated and orthogonally defined. This would enable modular, reliable and controllable flexible-domain dialogue systems.

摘要
在这篇论文中，我们 argues That topic 在对话中发挥基本作用，并且认为这个概念与 genre 需要分开、正交定义。这样可以带来可模块化、可靠、可控的多领域对话系统。Note:* "topic" 被翻译为 "话题" (huì tí)* "genre" 被翻译为 "类型" (lèi xìng)* "orthogonally" 被翻译为 "正交" (zhèng jì)* "modular" 被翻译为 "可模块化" (kě móudāng hóu)* "reliable" 被翻译为 "可靠" (kě xìng)* "controllable" 被翻译为 "可控" (kě kòng)

Measuring Misogyny in Natural Language Generation: Preliminary Results from a Case Study on two Reddit Communities

paper_url: http://arxiv.org/abs/2312.03330
repo_url: None
paper_authors: Aaron J. Snoswell, Lucinda Nelson, Hao Xue, Flora D. Salim, Nicolas Suzor, Jean Burgess
for: 这篇论文主要是为了评估自然语言生成中的恶意情况，尤其是评估 generic ‘toxicity’ 分类器在识别恶意语言中的缺点。
methods: 作者使用了两个well-characterized ‘Incel’ 社区在 Reddit 上的数据来构建了两个训练集，并使用了这些训练集来精制两个语言模型。然后，作者使用了一个开源的 ‘toxicity’ 分类器来评估这两个语言模型中的恶意语言表现。
results: 研究发现，使用 generic ‘toxicity’ 分类器无法在这两个语言模型中分辨出意义性的区别。而一个 feminist 主题专家提出的一个 gender-specific 词汇表则能够准确地识别这两个社区的不同。这些初步结果表明，使用通用的方法来评估危害性的缺点，并高亮了需要在自然语言评估中注意的精细的 benchmark 设计和选择。

Abstract
Generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. We consider the challenge of measuring misogyny in natural language generation, and argue that generic `toxicity' classifiers are inadequate for this task. We use data from two well-characterised `Incel' communities on Reddit that differ primarily in their degrees of misogyny to construct a pair of training corpora which we use to fine-tune two language models. We show that an open source `toxicity' classifier is unable to distinguish meaningfully between generations from these models. We contrast this with a misogyny-specific lexicon recently proposed by feminist subject-matter experts, demonstrating that, despite the limitations of simple lexicon-based approaches, this shows promise as a benchmark to evaluate language models for misogyny, and that it is sensitive enough to reveal the known differences in these Reddit communities. Our preliminary findings highlight the limitations of a generic approach to evaluating harms, and further emphasise the need for careful benchmark design and selection in natural language evaluation.

摘要

Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation

paper_url: http://arxiv.org/abs/2312.03312
repo_url: None
paper_authors: Wonjun Lee, Gary Geunbae Lee, Yunsu Kim
for: 这项研究旨在提高两个通过语言的 Cross-Lingual Transfer Learning（CLTL），以提高语音识别的精度。
methods: 这项研究使用了两个阶段的优化：首先，我们优化了音素识别模型和音素到文字转换模型，以提高语音识别的精度。其次，我们引入了全球音素噪声生成器，以在文字到图标训练中模拟真实的 ASR 噪声，从而降低错误的传递。
results: 实验结果表明，使用我们的方法可以在低资源语言中显著降低 Word Error Rate（WER），这说明了我们的方法的有效性。这项研究的成果可能对两个通过语言的 ASR 系统的发展产生影响，并且可能提供更好的 Cross-Lingual Transfer Learning 技术。

Abstract
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally, we introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation. Experiments on the CommonVoice 12.0 dataset show significant reductions in Word Error Rate (WER) for low-resource languages, highlighting the effectiveness of our approach. This research contributes to the advancements of two-pass ASR systems in low-resource languages, offering the potential for improved cross-lingual transfer learning.

摘要
这项研究优化了两个阶段的两种语言之间转移学习，以提高语音识别的精度。我们的方法优化了这两个阶段，以提高语音识别的准确性。我们优化phoneme词汇覆盖率，通过共享语音特征来合并phoneme，从而提高识别精度。此外，我们引入了全球phoneme噪音生成器，以提供实际ASR噪音 durante phoneme-to-grapheme训练，以降低错误卷积。对于CommonVoice 12.0数据集进行了实验，显示了低资源语言中的 significan reductions in Word Error Rate (WER)， highlighting the effectiveness of our approach。这项研究对两个阶段ASR系统的发展在低资源语言中做出了贡献，提供了改进的cross-lingual transfer learning的 potential。

Rethinking E-Commerce Search

paper_url: http://arxiv.org/abs/2312.03217
repo_url: https://github.com/jacklinedesouza/STRATEGIES-OF-DIGITAL-MARKETING-AND-CONTENT-MARKETING
paper_authors: Haixun Wang, Taesik Na
for: 这篇论文的目的是提出一种新的电商搜索和推荐系统，可以更好地利用不结构化数据，如客户评价和网页文章等。
methods: 这篇论文提出了一种新的方法，即将结构化数据转换为文本数据，然后使用自然语言处理技术（如大语言模型）进行搜索和推荐。
results: 该方法可以更好地利用不结构化数据，提高电商搜索和推荐的精度和效果。

Abstract
E-commerce search and recommendation usually operate on structured data such as product catalogs and taxonomies. However, creating better search and recommendation systems often requires a large variety of unstructured data including customer reviews and articles on the web. Traditionally, the solution has always been converting unstructured data into structured data through information extraction, and conducting search over the structured data. However, this is a costly approach that often has low quality. In this paper, we envision a solution that does entirely the opposite. Instead of converting unstructured data (web pages, customer reviews, etc) to structured data, we instead convert structured data (product inventory, catalogs, taxonomies, etc) into textual data, which can be easily integrated into the text corpus that trains LLMs. Then, search and recommendation can be performed through a Q/A mechanism through an LLM instead of using traditional information retrieval methods over structured data.

摘要
电商搜索和推荐通常操作于结构化数据such as产品目录和分类。然而，创建更好的搜索和推荐系统经常需要大量的无结构数据，包括客户评价和网络上的文章。传统上，解决方案总是通过信息抽取将无结构数据转换为结构数据，然后进行搜索。然而，这种方法通常是成本高且质量低的。在这篇论文中，我们想象一种解决方案，即将结构数据（产品库、目录、分类等）转换为文本数据，可以轻松地与文本训练LMs（大语言模型）集成。然后，通过Q/A机制，使用LM进行搜索和推荐而不是使用传统的信息检索方法。

Detecting Rumor Veracity with Only Textual Information by Double-Channel Structure

paper_url: http://arxiv.org/abs/2312.03195
repo_url: None
paper_authors: Alex Kim, Sangwon Yoon
for: 本文目的是提出一种双通道结构，用于在社交媒体上预先鉴别谣言的真实性。
methods: 本文使用了两种方法：一是lie detection算法，用于有信息的谣言；二是thread-reply agreement detection算法，用于无信息的谣言。
results: 使用SemEval 2019 Task 7 dataset，本文的模型在预先三分类（真、假、未鉴别）社交媒体谣言上获得了0.4027的macro-F1分数，超过了所有基eline模型和第二名奖 winner（Gorrell et al., 2019）。此外，本文还证实了双通道结构的优越性，比单通道结构使用 lie detection或agreement detection算法来到所有帖子。

Abstract
Kyle (1985) proposes two types of rumors: informed rumors which are based on some private information and uninformed rumors which are not based on any information (i.e. bluffing). Also, prior studies find that when people have credible source of information, they are likely to use a more confident textual tone in their spreading of rumors. Motivated by these theoretical findings, we propose a double-channel structure to determine the ex-ante veracity of rumors on social media. Our ultimate goal is to classify each rumor into true, false, or unverifiable category. We first assign each text into either certain (informed rumor) or uncertain (uninformed rumor) category. Then, we apply lie detection algorithm to informed rumors and thread-reply agreement detection algorithm to uninformed rumors. Using the dataset of SemEval 2019 Task 7, which requires ex-ante threefold classification (true, false, or unverifiable) of social media rumors, our model yields a macro-F1 score of 0.4027, outperforming all the baseline models and the second-place winner (Gorrell et al., 2019). Furthermore, we empirically validate that the double-channel structure outperforms single-channel structures which use either lie detection or agreement detection algorithm to all posts.

摘要
凯尔（1985）提出了两种吹发：有信息的吹发和无信息的吹发（即恶作剧）。此外，先前的研究发现当人们有可靠的信息来源时，他们更可能使用更自信的文字语调在吹发消息。基于这些理论发现，我们提出了双渠道结构来确定社交媒体上吹发的预先真实性。我们首先将每个文本分为确定（有信息吹发）或未确定（无信息吹发）类别。然后，我们对有信息吹发应用了谎言检测算法，对无信息吹发应用了线程回复一致检测算法。使用SemEval 2019任务7的数据集，我们的模型在三分类（真、假、未知）预先分类任务中获得了0.4027的macro-F1分数，超过了所有基eline模型和第二名奖得者（Gorrell et al., 2019）。此外，我们经验 validate了双渠道结构的优越性，它在使用单渠道结构，其中使用谎言检测或线程回复一致检测算法来处理所有吹发时表现较差。

Corporate Bankruptcy Prediction with Domain-Adapted BERT

paper_url: http://arxiv.org/abs/2312.03194
repo_url: None
paper_authors: Alex Kim, Sangwon Yoon
for: 这研究使用BERT模型对公司财务披露数据进行预测，以预测公司破产。先前的研究主要集中在开发更加复杂的预测方法，使用金融变量。然而，在这种研究中，我们专注于提高输入数据质量。
methods: 我们使用BERT模型进行情感分析，对MD&A披露中的文本进行分析。我们发现，BERT比词典基于预测和Word2Vec基于预测更高效，以 adj R-square、kNN-5 和线性支持向量机器人（SVM）进行评估。
results: 我们发现，通过自学习与信任满足筛选，可以在10-K corporate disclosure数据上进行自适应适应。我们实现了预测精度91.56%，并证明了预测精度得到了显著提高。

Abstract
This study performs BERT-based analysis, which is a representative contextualized language model, on corporate disclosure data to predict impending bankruptcies. Prior literature on bankruptcy prediction mainly focuses on developing more sophisticated prediction methodologies with financial variables. However, in our study, we focus on improving the quality of input dataset. Specifically, we employ BERT model to perform sentiment analysis on MD&A disclosures. We show that BERT outperforms dictionary-based predictions and Word2Vec-based predictions in terms of adjusted R-square in logistic regression, k-nearest neighbor (kNN-5), and linear kernel support vector machine (SVM). Further, instead of pre-training the BERT model from scratch, we apply self-learning with confidence-based filtering to corporate disclosure data (10-K). We achieve the accuracy rate of 91.56% and demonstrate that the domain adaptation procedure brings a significant improvement in prediction accuracy.

摘要
这个研究使用BERT模型进行分析，BERT是一种代表性的语言模型，以预测公司破产。先前的文献关于破产预测主要集中在开发更复杂的预测方法ologies，而我们的研究则专注于提高输入数据质量。具体来说，我们使用BERT模型进行情感分析，并证明BERT的性能超过词典基于预测和Word2Vec基于预测。我们还采用了自学习和信任度基于筛选来自10-K报告数据，实现了预测精度为91.56%。这些结果表明，域 adaptaption程序可以提供显著的改善。