cs.CL - 2023-07-02

SSP: Self-Supervised Post-training for Conversational Search

paper_url: http://arxiv.org/abs/2307.00569
repo_url: https://github.com/morecry/ssp
paper_authors: Quan Tu, Shen Gao, Xiaolong Wu, Zhao Cao, Ji-Rong Wen, Rui Yan
for: 提高对话结构和上下文 semantic 理解
methods: 提出三种自动学习任务来升级 conversational search 模型
results: 在 CAsT-19 和 CAsT-20 两个 benchmark 数据集上，对已有 conversational search 方法进行了改进，并取得了广泛的实验成果。

Abstract
Conversational search has been regarded as the next-generation search paradigm. Constrained by data scarcity, most existing methods distill the well-trained ad-hoc retriever to the conversational retriever. However, these methods, which usually initialize parameters by query reformulation to discover contextualized dependency, have trouble in understanding the dialogue structure information and struggle with contextual semantic vanishing. In this paper, we propose \fullmodel (\model) which is a new post-training paradigm with three self-supervised tasks to efficiently initialize the conversational search model to enhance the dialogue structure and contextual semantic understanding. Furthermore, the \model can be plugged into most of the existing conversational models to boost their performance. To verify the effectiveness of our proposed method, we apply the conversational encoder post-trained by \model on the conversational search task using two benchmark datasets: CAsT-19 and CAsT-20. Extensive experiments that our \model can boost the performance of several existing conversational search methods. Our source code is available at \url{https://github.com/morecry/SSP}.

摘要
对话搜寻被视为未来搜寻模式。由于数据缺乏，大多现有方法将特定的对话搜寻器转换为对话搜寻器。然而，这些方法通常会将问题重新构成来发现对话结构信息，却对对话结构和上下文Semantic衰退过滤产生困难。在这篇论文中，我们提出了\fullmodel (\model)，一个新的后训练模式，具有三个自动训练任务，可以快速初始化对话搜寻模型，提高对话结构和上下文Semantic理解。此外，\model可以与大多数现有的对话模型整合，提高其表现。为了证明我们提出的方法的有效性，我们将对话核心过滤器训练后使用了\model进行对话搜寻任务，使用了两个benchmark数据集：CAsT-19和CAsT-20。广泛的实验结果表明，我们的\model可以提高许多现有的对话搜寻方法的表现。我们的原始代码可以在\url{https://github.com/morecry/SSP}获取。

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

paper_url: http://arxiv.org/abs/2307.00526
repo_url: None
paper_authors: Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
for: 这篇论文旨在解决大语言模型（LLM）中高维Token嵌入的问题，以提高复杂语言模式的模型化。
methods: 该论文提出了基于Tensor-Train Decomposition（TTD）的方法，将每个Token嵌入视为一个 Matrix Product State（MPS），可以高效地在分布式环境中计算。
results: 实验结果表明，通过该方法可以将嵌入层压缩 factor 达到 38.40 倍，并且当压缩因子为 3.31 倍时，even 超过原始 GPT-2 模型的性能。

Abstract
High-dimensional token embeddings underpin Large Language Models (LLMs), as they can capture subtle semantic information and significantly enhance the modelling of complex language patterns. However, the associated high dimensionality also introduces considerable model parameters, and a prohibitively high model storage. To address this issue, this work proposes an approach based on the Tensor-Train Decomposition (TTD), where each token embedding is treated as a Matrix Product State (MPS) that can be efficiently computed in a distributed manner. The experimental results on GPT-2 demonstrate that, through our approach, the embedding layer can be compressed by a factor of up to 38.40 times, and when the compression factor is 3.31 times, even produced a better performance than the original GPT-2 model.

摘要
高维度的токен嵌入在大语言模型（LLM）中起到重要作用，因为它们可以捕捉细微语义信息和复杂语言模式的特征。然而，相关的高维度也导致了较大的模型参数和庞大的模型存储空间。为解决这个问题，本工作提出了基于tensor-train分解（TTD）的方法，其中每个tokен嵌入被视为一个矩阵乘积状态（MPS），可以高效地在分布式环境中计算。实验结果表明，通过我们的方法，嵌入层可以被压缩38.40倍，而当压缩因子为3.31倍时，甚至超越原始GPT-2模型的性能。

Large Language Models Enable Few-Shot Clustering

paper_url: http://arxiv.org/abs/2307.00524
repo_url: None
paper_authors: Vijay Viswanathan, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, Graham Neubig
for: 提高文本整合的粒度和准确性，使用大语言模型提供指导和约束，实现查询效率和几何培育的 semi-supervised 文本整合。
methods: 本文提出了三个阶段可以将大语言模型 incorporated 到整合过程中： перед整合（改进输入特征）、在整合（提供约束给整合算法）和 после整合（使用 LLM 后 corrections）。
results: 结果表明，在第一个和第二个阶段 incorporating LLMs 可以routinely提供显著改进，并且允许用户根据成本和准确性进行负担和让步，以生成满足需求的 clusters。

Abstract
Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.

摘要
（traditional unsupervised clustering与 semi-supervised clustering的区别在于， semi-supervised clustering 允许用户提供有意义的数据结构，这帮助 clustering 算法与用户的意思相符。现有的 semi-supervised clustering 方法需要专家的重要反馈，以提高几何。在这篇文章中，我们询问 whether 大型语言模型可以增强专家的指导，以实现问题提交、少量 semi-supervised text clustering。我们发现 LLMs surprisingly effective 的提高 clustering。我们探索了在 clustering 中应用 LLMs 的三个阶段： before clustering（改善输入特征）、during clustering（通过提供约束给 clustering 算法）和 after clustering（使用 LLMs 后修）。我们发现在第一个和第二个阶段中 incorporating LLMs 可以提供重要的改善，并且 LLMs 允许用户做成本和准确之间的调整，以生成适当的几何。我们发布了我们的代码和 LLM 提示，以便公众使用。）

Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data

paper_url: http://arxiv.org/abs/2307.00456
repo_url: https://github.com/xinzhel/unlearnable_texts
paper_authors: Xinzhe Li, Ming Liu, Shang Gao
for: 本研究旨在解决深度学习模型中使用未经授权公共数据所带来的伦理问题，并提出了一种新的解决方案。
methods: 我们基于 Huang et al. (2021) 的二级优化方法，通过梯度基于搜索技术生成不可学习的文本。然而，这种方法具有实际限制，例如需要批处理实例和模型架构知识，这些知识不是普通用户可以访问自己数据的限制。另外，即使使用语义保持约束，不可学习的噪声仍可能改变文本的语义。
results: 我们提取了生成不可学习文本中的简单模式，并证明这些模式可以使文本保持不可学习性，即使用户只有有限的数据和模型知识。此外，这些模式不是特定实例或数据集的，因此用户可以轻松地应用它们于文本分类和问答任务。我们还开源了生成不可学习文本的代码和评估不可学习噪声的代码，以便公共和未来研究中使用。

Abstract
This paper addresses the ethical concerns arising from the use of unauthorized public data in deep learning models and proposes a novel solution. Specifically, building on the work of Huang et al. (2021), we extend their bi-level optimization approach to generate unlearnable text using a gradient-based search technique. However, although effective, this approach faces practical limitations, including the requirement of batches of instances and model architecture knowledge that is not readily accessible to ordinary users with limited access to their own data. Furthermore, even with semantic-preserving constraints, unlearnable noise can alter the text's semantics. To address these challenges, we extract simple patterns from unlearnable text produced by bi-level optimization and demonstrate that the data remains unlearnable for unknown models. Additionally, these patterns are not instance- or dataset-specific, allowing users to readily apply them to text classification and question-answering tasks, even if only a small proportion of users implement them on their public content. We also open-source codes to generate unlearnable text and assess unlearnable noise to benefit the public and future studies.

摘要
To address these challenges, we extract simple patterns from unlearnable text produced by bi-level optimization and demonstrate that the data remains unlearnable for unknown models. These patterns are not instance- or dataset-specific, allowing users to readily apply them to text classification and question-answering tasks, even if only a small proportion of users implement them on their public content. We also open-source our codes to generate unlearnable text and assess unlearnable noise to benefit the public and future studies.

Don’t Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

paper_url: http://arxiv.org/abs/2307.00453
repo_url: None
paper_authors: Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan, Sravan Bodapati, Katrin Kirchhoff
for: 这个研究的目的是将自愿式学习的语音表现适应化为不同的口音和非本地语言人员的说话。
methods: 研究使用了一种名为“自愿式适应器”的方法，将语音表现适应化为不同的口音和非本地语言人员的说话。
results: 研究获得了强大的词音误差减少（WERR）值，对于4种口音都获得了良好的 результа。在所有4种口音中，使用自愿式适应器得到了22.7%的WERR减少，而使用整个Encoder进行适应得到了25.1%的WERR减少。

Abstract
Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. We experiment with 4 accents and choose automatic speech recognition (ASR) as the downstream task of interest. We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted. While our experiments utilize HuBERT and ASR as the downstream task, our proposed approach is both model and task-agnostic.

摘要
自然语言处理中的自我超vision学习方法可以自然地学习大量无标注语音 Corpora 中的语音特征。然而，这些表示可能受到大量数据的标准化影响，并且在非典型、非本地口音 speaker 人群中表现不佳。基于当前顶尖 HuBERT 模型的基线，我们提出了一种parameter-efficient的自我超vision adaptation方法，通过在 residual adapters 上进行听话特征的自适应。我们在4种口音上进行了实验，选择了自动化语音识别（ASR）作为下游任务。我们得到了对 HuBERT-large 的强大单词错误率减少（WERR），对所有4种口音都有很好的表现，平均WERR为22.7%，对整个编码器进行了全面适应时平均WERR为25.1%。虽然我们的实验使用了 HuBERT 和 ASR 作为下游任务，但我们的提出的方法是模型和任务无关的。

A Dual-Stream Recurrence-Attention Network with Global-Local Awareness for Emotion Recognition in Textual Dialogue

paper_url: http://arxiv.org/abs/2307.00449
repo_url: None
paper_authors: Jiang Li, Xiaoping Wang, Zhigang Zeng
For: 这个论文的目的是提出一种简单的 dual-stream Recurrence-Attention Network (DualRAN)，用于实现 Emotion Recognition in Conversation (ERC) 任务。* Methods: 这个模型使用了 RNN 和 Multi-head ATtention network (MAT) 的组合，并提出了一种新的 dual-stream 结构，以模型对话的全局和局部上下文信息。* Results: 实验结果表明，提出的模型在四个常用的 benchmark 数据集上表现出色，超过了所有基eline。并且，我们进行了一系列的ablation study，以证明每个组件的效果。

Abstract
In real-world dialogue systems, the ability to understand the user's emotions and interact anthropomorphically is of great significance. Emotion Recognition in Conversation (ERC) is one of the key ways to accomplish this goal and has attracted growing attention. How to model the context in a conversation is a central aspect and a major challenge of ERC tasks. Most existing approaches are generally unable to capture both global and local contextual information efficiently, and their network structures are too complex to design. For this reason, in this work, we propose a straightforward Dual-stream Recurrence-Attention Network (DualRAN) based on Recurrent Neural Network (RNN) and Multi-head ATtention network (MAT). The proposed model eschews the complex network structure of current methods and focuses on combining recurrence-based methods with attention-based methods. DualRAN is a dual-stream structure mainly consisting of local- and global-aware modules, modeling a conversation from distinct perspectives. To achieve the local-aware module, we extend the structure of RNN, thus enhancing the expressive capability of the network. In addition, we develop two single-stream network variants for DualRAN, i.e., SingleRANv1 and SingleRANv2. We conduct extensive experiments on four widely used benchmark datasets, and the results reveal that the proposed model outshines all baselines. Ablation studies further demonstrate the effectiveness of each component.

摘要
在实际对话系统中，理解用户的情感和人工智能交互是非常重要的。情感识别在对话中（ERC）已经吸引了越来越多的关注，并且成为了解决这一问题的中心方向之一。在ERC任务中，模型对话上下文的捕捉是中心问题，也是一个主要挑战。现有的大多数方法都不能够有效地捕捉对话中的全局和局部上下文信息，其网络结构也很复杂，设计很难。因此，在这项工作中，我们提出了一种简单的双流回归注意网络（DualRAN），基于循环神经网络（RNN）和多头注意网络（MAT）。我们的模型弃用现有方法的复杂网络结构，而选择结合循环方法和注意方法来实现。DualRAN的主要结构是一种双流结构，主要由本地和全局意识模块组成，从不同的角度模型对话。为了提高网络表达能力，我们在RNN结构中进行了扩展。此外，我们还开发了两种单流网络变体，即SingleRANv1和SingleRANv2。我们在四个常用的 benchmark 数据集上进行了广泛的实验，结果显示，我们的模型胜过所有基准值。细化分析还证明了每个组件的有效性。

Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin

paper_url: http://arxiv.org/abs/2307.00382
repo_url: https://github.com/muhammed-saeed/clat
paper_authors: Pin-Jie Lin, Muhammed Saeed, Ernie Chang, Merel Scholman
for: 提高 Nigerian Pidgin（Naija）的口语处理系统的效果，采用大规模并行英文-pidgin corpus 收集和跨语言适应训练框架。
methods: 使用英语预训模型作为更强的先验，并在 task adaptive 和 continual 训练中使用数据增强和反向翻译来提高模型性能。
results: 研究显示，英语预训模型在英文-pidgin任务上比多语言语模型更强，具有最多2.38 BLEU 提升；同时，通过数据增强和反向翻译来进行任务适应训练，可以对模型性能产生显著的影响。

Abstract
Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.

摘要
developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. in this work, we target on improving upon both text classification and translation of nigerian pidgin (naija) by collecting a large-scale parallel english-pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. our studies show that english pre-trained language models serve as a stronger prior than multilingual language models on english-pidgin tasks with up to 2.38 bleu improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.

Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

paper_url: http://arxiv.org/abs/2307.00381
repo_url: https://github.com/ProjectDossier/patient-trial-matching
paper_authors: Wojciech Kusa, Óscar E. Mendoza, Petr Knoth, Gabriella Pasi, Allan Hanbury
for: 本研究目的是解决临床试验（CT）招募缺乏问题，提出一种缓解病人招募困难的方法，包括两个关键组件：一个数据增强技术，用于在第一个检索阶段提高查询和文档，以及一种基于转换器网络的重新排序方法。
methods: 本研究使用了两个关键组件：一个数据增强技术，用于在第一个检索阶段提高查询和文档，以及一种基于转换器网络的重新排序方法。数据增强技术包括命名实体识别和否定检测，用于增强病人描述和试验条件段落。重新排序方法使用了一个基于转换器网络的二步训练方法，其中第一步是匹配病人信息与试验描述段落，第二步是匹配病人信息与试验条件段落。
results: 研究结果表明，包含病人描述段落的试验条件部分对lexical模型的重levance分数有很大的影响，而数据增强技术可以提高试验检索的有效率。基于我们的训练方法，重新排序方法可以持续提高试验检索的精度，比较效果高于大型神经网络模型，即使用有限的训练数据。

Abstract
Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking schema that uses a Transformer network in a setup adapted to this task by leveraging the structure of the CT documents. We use named entity recognition and negation detection in both patient description and the eligibility section of CTs. We further classify patient descriptions and CT eligibility criteria into current, past, and family medical conditions. This extracted information is used to boost the importance of disease and drug mentions in both query and index for lexical retrieval. Furthermore, we propose a two-step training schema for the Transformer network used to re-rank the results from the lexical retrieval. The first step focuses on matching patient information with the descriptive sections of trials, while the second step aims to determine eligibility by matching patient information with the criteria section. Our findings indicate that the inclusion criteria section of the CT has a great influence on the relevance score in lexical models, and that the enrichment techniques for queries and documents improve the retrieval of relevant trials. The re-ranking strategy, based on our training schema, consistently enhances CT retrieval and shows improved performance by 15\% in terms of precision at retrieving eligible trials. The results of our experiments suggest the benefit of making use of extracted entities. Moreover, our proposed re-ranking schema shows promising effectiveness compared to larger neural models, even with limited training data.

摘要
临床试验（CT）常常失败因为缺乏合适的病人招募。这篇论文解决了CT检索的挑战，提出了一种管道模型中的两个关键组成部分：（i）用于提高查询和文档的数据增强技术，以及（ii）基于Transformer网络的一种新的重新排名方法。我们使用命名实体识别和否定检测在病人描述和试验条件中。我们进一步将病人描述和试验条件分类为当前、过去和家族医疗状况。这些提取的信息用于提高病情和药物提及的重要性在查询和索引中。此外，我们提议一种两步训练方案，用于在重新排名过程中使用Transformer网络。第一步是匹配病人信息与试验描述部分，第二步是匹配病人信息与试验条件部分。我们的实验结果表明，试验条件部分对lexical模型的相关分数有着很大的影响，而我们的增强技术可以提高有关试验的检索。我们的重新排名策略基于我们的训练方案，在提高CT检索中表现出了显著的优异性，相比于大型神经网络，我们的方案更有效率，即使用限制的训练数据。

Revisiting Sample Size Determination in Natural Language Understanding

paper_url: http://arxiv.org/abs/2307.00374
repo_url: https://github.com/pjlintw/sample-size
paper_authors: Ernie Chang, Muhammad Hassan Rashid, Pin-Jie Lin, Changsheng Zhao, Vera Demberg, Yangyang Shi, Vikas Chandra
for: 预测模型性能，减少数据标注预算
methods: 使用小量训练样本预测最大可达性能，并进行ablation study
results: 能够预测模型性能 within a small margin of mean absolute error (~ 0.9%) with only 10% data

Abstract
Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples - which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~ 0.9%) with only 10% data.

摘要
知道需要如多数据点标注以达到特定模型性能的准确数量是一项非常有利的进展，它适用于活动学习和传统数据标注，特别是在低资源情况下。然而，这还是一个相对较少研究的领域。我们因此探索了各种方法来估算需要达到目标性能值的训练样本数量。我们提出了一种简单 yet effective的方法，可以在小量训练样本下预测模型性能的最大可能值，并作为数据标注过程中的数据质量和样本数确定指标。我们在四种语言理解任务上进行了减少研究，并证明了我们的方法可以在只有10%的数据下预测模型性能，与实际值之间的差异在0.9%左右。