results: 对比基eline方法(无反馈),该方法可以降低审核努力 между 17.85% 和 59.04%,具体取决于固定的回归目标。Abstract
In a number of information retrieval applications (e.g., patent search, literature review, due diligence, etc.), preventing false negatives is more important than preventing false positives. However, approaches designed to reduce review effort (like "technology assisted review") can create false negatives, since they are often based on active learning systems that exclude documents automatically based on user feedback. Therefore, this research proposes a more recall-oriented approach to reducing review effort. More specifically, through iteratively re-ranking the relevance rankings based on user feedback, which is also referred to as relevance feedback. In our proposed method, the relevance rankings are produced by a BERT-based dense-vector search and the relevance feedback is based on cumulatively summing the queried and selected embeddings. Our results show that this method can reduce review effort between 17.85% and 59.04%, compared to a baseline approach (of no feedback), given a fixed recall target
摘要
在一些信息检索应用程序(如专利搜索、文献综述、due diligence等)中,避免假 отрицатель(false negative)更重要于避免假正(false positive)。然而,以减少审核努力(like "技术支持审核")为基础的方法可能会导致假 отрицатель的生成。因此,本研究提议一种更强调回快的方法来减少审核努力。具体来说,通过基于用户反馈的重新排序 relevance 排名,以及 referred to as relevance feedback。在我们的提议方法中, relevance 排名由 BERT 基于 dense-vector 搜索生成,而用户反馈基于累加查询和选择嵌入。我们的结果表明,这种方法可以在固定 recall 目标下降低审核努力,比基eline方法(无反馈)下降低的范围为17.85% 到 59.04%。
Solving the Right Problem is Key for Translational NLP: A Case Study in UMLS Vocabulary Insertion
results: 研究发现,使用新的问题形ulation和数据集,以及重新定制的现有解决方案,可以提高模型的表现,并且比所有强有力基eline都高。此外,该研究还提供了可衡量的编辑者所行的改进。Abstract
As the immense opportunities enabled by large language models become more apparent, NLP systems will be increasingly expected to excel in real-world settings. However, in many instances, powerful models alone will not yield translational NLP solutions, especially if the formulated problem is not well aligned with the real-world task. In this work, we study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms, referred to as atoms, are added to the UMLS, one of the most comprehensive open-source biomedical knowledge bases. Previous work aimed to develop an automated NLP system to make this time-consuming, costly, and error-prone task more efficient. Nevertheless, practical progress in this direction has been difficult to achieve due to a problem formulation and evaluation gap between research output and the real-world task. In order to address this gap, we introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task, datasets which faithfully represent it and several strong baselines we developed through re-purposing existing solutions. Additionally, we propose an effective rule-enhanced biomedical language model which enables important new model behavior, outperforms all strong baselines and provides measurable qualitative improvements to editors who carry out the UVI task. We hope this case study provides insight into the considerable importance of problem formulation for the success of translational NLP solutions.
摘要
As the immense opportunities enabled by large language models become more apparent, NLP systems will be increasingly expected to excel in real-world settings. However, in many instances, powerful models alone will not yield translational NLP solutions, especially if the formulated problem is not well aligned with the real-world task. In this work, we study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms, referred to as atoms, are added to the UMLS, one of the most comprehensive open-source biomedical knowledge bases. Previous work aimed to develop an automated NLP system to make this time-consuming, costly, and error-prone task more efficient. Nevertheless, practical progress in this direction has been difficult to achieve due to a problem formulation and evaluation gap between research output and the real-world task. In order to address this gap, we introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task, datasets which faithfully represent it and several strong baselines we developed through re-purposing existing solutions. Additionally, we propose an effective rule-enhanced biomedical language model which enables important new model behavior, outperforms all strong baselines and provides measurable qualitative improvements to editors who carry out the UVI task. We hope this case study provides insight into the considerable importance of problem formulation for the success of translational NLP solutions.Here's the translation in Traditional Chinese:为了推广大型语言模型的可能性,NLG系统将在实际应用中被越来越期待。然而,在许多情况下,强大的模型独立不足以提供翻译NLG解决方案,尤其是如果问题的形式化不好align with real-world task。在这个工作中,我们研究了UMLS词汇插入task,这是生物医学知识库中的一个重要实际任务,每年添加了百万个新的词汇。过去的工作尝试了开发一个自动NLG系统,以便更有效率地执行这个时间consuming、成本高和Error-prone的任务。然而,实际上进展难以取得,因为问题的形式化和评估 gap between research output和实际任务。为了解决这个问题,我们引入了一个新的UMLS词汇插入formulation,这个formulation faithfully reflects the real-world task,dataset和several strong baselines we developed through re-purposing existing solutions。此外,我们提出了一个有效的规则增强生医语言模型,这个模型具有重要的新模型行为,超越了所有强大基eline,并且为编辑者在UVI任务中提供了可衡量的质量提升。我们希望这个案例研究可以给出问题形式化的巨大重要性,以便翻译NLG解决方案的成功。
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching
results: 对比基线方法,finetuning自动生成的特征和n-gram语言模型可以降低绝对单词错误率达20%。这表明在具有有限训练数据的情况下,finetuning自动生成的特征是一个更好的和可行的解决方案。Abstract
While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
摘要
“许多语言的话者通常在与其他地域语言或英文混合说话,但 datasets of codeswitched speech 太小,无法训练自己的语音模型。我们提出了调整自我超vised语音表现,如 wav2vec 2.0 XLSR,以识别混合语言资料。我们发现,调整自我超vised多语言表现,并将其与 n-gram 语言模型结合,可以降低绝对字元误差率,相比基eline 的混合模型。我们的发现建议,在有限的训练数据情况下,调整自我超vised表现是一个更好的性能和可行的解决方案。”
Automatically Finding and Categorizing Replication Studies
results: 研究发现,可以通过文本内容来 correctly identify replication studies at a higher rate than chance(AUROC = 0.886),并且可以 correctly distinguish successful replication studies from failed replication studies at a higher rate than chance(AUROC = 0.664)。Abstract
In many fields of experimental science, papers that failed to replicate continue to be cited as a result of the poor discoverability of replication studies. As a first step to creating a system that automatically finds replication studies for a given paper, 334 replication studies and 344 replicated studies were collected. Replication studies could be identified in the dataset based on text content at a higher rate than chance (AUROC = 0.886). Additionally, successful replication studies could be distinguished from failed replication studies at a higher rate than chance (AUROC = 0.664).
摘要
在许多实验科学领域的论文中,无法复制的研究继续被引用,这主要是因为复制研究的发现性不足。为解决这个问题,我们首先收集了334个复制研究和344个复制论文。可以通过文本内容来识别复制研究,并且在各种机会上进行了分类(AUROC = 0.886)。此外,成功复制研究还可以与失败复制研究进行区分,并且在各种机会上进行了分类(AUROC = 0.664)。
Detection of developmental language disorder in Cypriot Greek children using a machine learning neural network algorithm
results: 研究结果显示,神经网络模型在分类儿童DLD和健康儿童时达到了高精度水平(准确率在0.92-0.98之间),这表明神经网络模型在检测DLD中具有高准确性。此外,变量重要性分析表明,儿童语言生产技能对模型性能的影响更大于语言感知技能。Abstract
Children with developmental language disorder (DLD) encounter difficulties in acquiring various language structures. Early identification and intervention are crucial to prevent negative long-term outcomes impacting the academic, social, and emotional development of children. The study aims to develop an automated method for the identification of DLD using artificial intelligence, specifically a neural network machine learning algorithm. This protocol is applied for the first time in Cypriot Greek children, which is generally considered underresearched in the context of DLD. The neural network model was trained using perceptual and production data elicited from children with DLD and healthy controls. The k-fold technique was used to crossvalidate the algorithm. The performance of the model was evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC/AUC curve to assess its ability to make accurate predictions on a set of unseen data. The results demonstrated high classification values for all metrics (between 0.92 and 0.98), indicating the high accuracy of the neural model in classifying children with DLD. Additionally, the variable importance analysis revealed that the language production skills of children had a more significant impact on the performance of the model compared to perception skills. Neural networks represent powerful tools for detecting DLD, providing early and quick assessments of the disorder, and having the potential to improve clinical outcomes.
摘要
儿童发展语言障碍(DLD)可能会导致儿童学习语言结构的困难。早期识别和 intervención是关键,以避免长期的负面影响,对儿童的学术、社会和情感发展产生负面影响。本研究旨在开发一种基于人工智能的DLD识别方法,使用神经网络机器学习算法。这种协议在塞浦路斯希腊语言中首次应用。神经网络模型通过对儿童DLD和健康儿童的语言感知和生产数据进行训练。使用k-fold技术进行交叉验证算法。模型的性能通过准确率、精度、回归率、F1分数和ROC/AUC曲线进行评估。结果显示神经网络模型在未seen数据上的分类性能强,准确率在0.92-0.98之间。此外,变量重要性分析表明儿童语言生产技能对模型性能的影响更大 than语言感知技能。神经网络代表了检测DLD的强大工具,提供早期快速诊断,有助于提高临床结果。
nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis
for: 本研究参加了第一届 Bangla Language Processing(BLP)工作坊的共同任务,对 Bangla 社交媒体吐字进行情感分析。
methods: 我们采用了传输学习策略,并进行数据增强来解决这个任务。
results: 我们的最佳系统在 Micro F1 分数上达到 0.71,在30个参与者中排名第12名。Abstract
In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The main objective of this task is to identify the polarity of social media content using a Bangla dataset annotated with positive, neutral, and negative labels provided by the shared task organizers. Our best system for this task is a transfer learning approach with data augmentation which achieved a micro F1 score of 0.71. Our best system ranked 12th among 30 teams that participated in the competition.
摘要
在这篇论文中,我们讨论了nlpBDpatriots对在孟加拉社交媒体帖子上的情感分析的入选。这个任务的主要目标是使用提供的孟加拉数据集,并使用正、中性和负标签进行标注,以识别社交媒体内容的情感 polarity。我们的最佳系统是基于传输学习和数据扩展的方法,其中微芽F1分数达到0.71。我们的最佳系统在30支参与比赛的团队中排名第12名。
nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla
results: 我们的最佳方法在VITD任务中获得了macro F1分数0.74,位于27支持组中第6名。Abstract
In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using back translation and multilinguality which ranked 6th out of 27 teams with a macro F1 score of 0.74.
摘要
在这篇论文中,我们讨论了nlpBDpatriots对共同任务《激进语言识别(VITD)》的参与,该任务是为了识别和分类激进威胁,这些威胁可能导致未经法律许可的暴力行为。我们的最佳策略是两步分类,使用回译和多语言,其中 macro F1 分数为 0.74,排名第 6 名 из 27 个队伍。
Offensive Language Identification in Transliterated and Code-Mixed Bangla
for: 本研究旨在Addressing offensive content identification in social media, particularly in multilingual societies where transliterations and code-mixing are common.
results: 研究结果显示,英语预训练变换器模型在TB-OLID dataset上得到了最好的表现。Abstract
Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.
摘要
“识别社交媒体中的攻击性内容非常重要,以建立安全的线上社区。最近几年,许多研究对这个问题提出了数据集,包括不同语言的数据集。本文则探讨了文字转写和混合语言现象,在多语言社会中很常见,并且对自然语言处理系统(NLP)提出了挑战。我们介绍了一个名为TB-OLID的实体化的孟加拉语攻击语言数据集,包含5,000个手动标注的评论。我们将这些模型训练和精确化,然后评估它们在这个数据集上的表现。我们的结果显示,英语预训transformer型模型,如fBERT和HateBERT,在这个数据集上表现最佳。”
Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains
results: 研究发现,现有的大语言模型在高风险领域存在一些局限性和不准确的问题,需要进一步改进和人类中心的方法来提高大语言模型的安全性和事实可靠性。Abstract
High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.
摘要
高风险领域的挑战需要语言模型提供准确和安全的回答。虽然大型语言模型(LLM)在legal和医疗等高风险领域的表现仍未得到清晰的评估。我们的研究进行了深入分析 instruciton-tuned LLMs的性能,关注事实准确性和安全遵循。为全面评估LLMs的能力,我们在六个NLP数据集中进行了问答和摘要任务的实验,其中两个高风险领域为法律和医疗。进一步的质量分析表明当前LLMs在高风险领域的评估存在一定的限制。这说明不仅需要提高LLM的能力,还需要优化领域专业的评估指标,并采取人类中心的方法来增强安全性和事实准确性。我们的发现有助于领域的发展,使LLMs能够适应社会的责任,并遵循未来的法规,如EU AI Act。
Vector-Quantized Prompt Learning for Paraphrase Generation
paper_authors: Haotian Luo, Yixin Liu, Peidong Liu, Xianggen Liu
for: 提高自然语言生成模型的多样性和 semantic preservation
methods: 使用预训练模型和实例特定的提示控制生成
results: 在 Quora、Wikianswers 和 MSCOCO 三个标准测试集上达到新的状态数据表现Abstract
Deep generative modeling of natural languages has achieved many successes, such as producing fluent sentences and translating from one language into another. However, the development of generative modeling techniques for paraphrase generation still lags behind largely due to the challenges in addressing the complex conflicts between expression diversity and semantic preservation. This paper proposes to generate diverse and high-quality paraphrases by exploiting the pre-trained models with instance-dependent prompts. To learn generalizable prompts, we assume that the number of abstract transforming patterns of paraphrase generation (governed by prompts) is finite and usually not large. Therefore, we present vector-quantized prompts as the cues to control the generation of pre-trained models. Extensive experiments demonstrate that the proposed method achieves new state-of-art results on three benchmark datasets, including Quora, Wikianswers, and MSCOCO. We will release all the code upon acceptance.
摘要
深度生成模型已经在自然语言处理中取得了许多成功,如生成流畅句子和翻译语言。但是对于句子重构生成技术的发展仍然落后于其他领域,主要是因为处理复杂的表达多样性和 semantics 保持的矛盾。这篇论文提议使用预训练模型和实例依存的提示来生成多样和高质量的重构。为了学习普适的提示,我们假设了重构生成的抽象变换模式(受控于提示)的数量是有限的,通常不大。因此,我们提出vector化的提示作为驱动预训练模型的生成的缓存。我们进行了广泛的实验,并证明了我们的方法可以在三个标准测试集上达到新的状态码。我们将在接受后发布所有代码。
Faster Minimum Bayes Risk Decoding with Confidence-based Pruning
For: The paper is written for improving the efficiency of Minimum Bayes risk (MBR) decoding in conditional language generation problems, specifically in neural machine translation.* Methods: The paper proposes an algorithm for MBR decoding that gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling.* Results: The paper demonstrates the effectiveness of the proposed approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics, with results showing that the proposed method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy.Abstract
Minimum Bayes risk (MBR) decoding outputs the hypothesis with the highest expected utility over the model distribution for some utility function. It has been shown to improve accuracy over beam search in conditional language generation problems and especially neural machine translation, in both human and automatic evaluations. However, the standard sampling-based algorithm for MBR is substantially more computationally expensive than beam search, requiring a large number of samples as well as a quadratic number of calls to the utility function, limiting its applicability. We describe an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling. Our method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy. We demonstrate the effectiveness of our approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics.
摘要
<>TRANSLATE_TEXT最小 bayes 风险(MBR)解码输出最高预期用于一个Utility函数的假设。它已经在条件语言生成问题和神经机器翻译中提高了准确率,并且在人工和自动评估中表现出色。然而,标准的抽样基本算法对MBR来说是计算成本很高的,需要许多抽样和Utility函数的 quadratic 数量的调用,限制了其应用。我们描述了一种算法,它逐渐增加用于估计Utility的样本数量,并在判断假设是否具有最高Utility的可能性时使用信任估计。我们的方法需要 fewer samples 和 drastically reduces the number of calls to the Utility function compared to standard MBR,而且和标准 MBR statistically indistinguishable terms of accuracy。我们在三个语言对的实验中证明了我们的方法的有效性,使用 chrF++ 和 COMET utility/evaluation metrics。Note:* "TRANSLATE_TEXT" is a system variable that indicates the text to be translated.* " Simplified Chinese" is the target language for the translation.
results: 我们的实验结果表明,我们的偏见调整框架可以有效地减少代码搜索模型的偏见,同时也提高了代码搜索的总排名性能。Abstract
Code search engine is an essential tool in software development. Many code search methods have sprung up, focusing on the overall ranking performance of code search. In this paper, we study code search from another perspective by analyzing the bias of code search models. Biased code search engines provide poor user experience, even though they show promising overall performance. Due to different development conventions (e.g., prefer long queries or abbreviations), some programmers will find the engine useful, while others may find it hard to get desirable search results. To mitigate biases, we develop a general debiasing framework that employs reranking to calibrate search results. It can be easily plugged into existing engines and handle new code search biases discovered in the future. Experiments show that our framework can effectively reduce biases. Meanwhile, the overall ranking performance of code search gets improved after debiasing.
摘要
<>代码搜索引擎是软件开发中不可或缺的工具。许多代码搜索方法已经出现,主要关注代码搜索的总排名性能。在这篇论文中,我们从另一个角度研究代码搜索,即代码搜索模型的偏见。偏见的代码搜索引擎会给用户带来差iente的用户体验,即使它们在总排名性能方面表现良好。因为不同的开发习惯(例如,偏好长 queries 或缩写),一些程序员可能会找到这些引擎有用,而其他程序员可能会找到寻找搜索结果很难。为了缓解偏见,我们开发了一个通用的减偏框架,使用重新排序来调整搜索结果。它可以轻松地插入现有引擎,并在未来发现的新的代码搜索偏见上进行处理。实验表明,我们的框架可以有效地减少偏见,同时提高代码搜索的总排名性能。