results: 研究发现,使用生成的对话来补充翻译训练数据可以提高翻译表达的准确率。Abstract
We demonstrate task-oriented dialogue generation within the dataflow dialogue paradigm. We show an example of agenda driven dialogue generation for the MultiWOZ domain, and an example of generation without an agenda for the SMCalFlow domain, where we show an improvement in the accuracy of the translation of user requests to dataflow expressions when the generated dialogues are used to augment the translation training dataset.
摘要
我们展示了任务导向对话生成在数据流对话模式下。我们给出了多普遍领域的例子,如多语言对话(MultiWOZ),以及无任务例子,如SMCalFlow领域。我们显示了使用生成对话来增强用户请求到数据流表达的翻译准确性。
Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization
paper_authors: Mousumi Akter, Shubhra Kanti Karmaker Santu for: This paper aims to address the limitations of the existing automated evaluation metric ROUGE, which lacks semantic awareness and does not consider the ranking quality of the summarizer.methods: The authors propose a new metric called Sem-nCG, which is both rank and semantic aware, and demonstrate how it can be used to evaluate model summaries against multiple references. They also explore different ways of incorporating redundancy into the original metric through extensive experiments.results: The authors show that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.Abstract
While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.
摘要
While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.Here's the translation in Traditional Chinese:而ROUGE指数在抽取摘要任务评估中很受欢迎,但是它缺乏semantic awareness和排名质量的认识。过去的研究已经解决了这些问题,提出了一个名为Sem-nCG的增量自动评估指数,它同时具有排名和semantic awareness。然而,Sem-nCG不考虑模型生成的摘要中的重复度,现在也不支持多个参考摘要的评估。不幸的是,同时解决这两个限制是不单简的。因此,在这篇论文中,我们提出了一个具有重复度考虑的Sem-nCG指数,并评估这个新指数可以与多个参考摘要进行评估。我们还进行了各种将重复度纳入原始指数的实验。实验结果显示,新的重复度考虑指数与人类评价更高相关性比原始Sem-nCG指数更高,包括单一和多个参考摘要的情况下。
Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
results: 我们的提议模型在Voice Bank + DEMAND数据集上实现了相当或更好的结果,与SOTA模型相比,但具有较小的参数数量(0.58M)。Abstract
Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
摘要
《speech enhancement是自动化语音处理管道中的一项艰难任务,旨在从噪音损坏的通道中分离干净的语音。基于Transformer的模型在最近几年内在语音增强中表现出色,但同时它们也比RNN和CNN模型更加 computationally expensive和需要更多高质量的训练数据,这些数据往往很难得到。本文提出一种改进语音增强模型,保持了自我注意的表现力,同时显著减少模型复杂度,我们称之为Spectrum Attention Fusion。我们 méticulously construct了一个卷积模块,用于取代一些自我注意层,让模型更有效地融合频谱特征。我们的提议模型在Voice Bank + DEMAND dataset上可以 дости到与SOTA模型相当或更好的结果,但具有明显更小的参数(0.58M)。》Note that Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore, and it is different from Traditional Chinese, which is used in Taiwan and other countries.
results: 本研究的实验结果表明,三个数据集均具有高质量和可靠性,可以用于英语和斯里兰卡语言之间的多语言处理任务。Abstract
Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.
摘要
平行数据集是多语言任务的关键。然而,在考虑一个语言对的情况下,如低资源语言,现有的顶部向下的平行数据集,如文献,缺乏量和质量,因为缺乏人工标注。因此,对低资源语言来说,更可行在底部向上方向下进行,开发字典数据集。这些数据集可以用于中级任务,如监督多语言单词嵌入对齐。这些又可以用于更高级的任务,如对齐文本 corpora 用于机器翻译(MT)。虽然更可 accessible than generating and aligning massive corpus for a low-resource language, but even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Therefore, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which can be used in multilingual natural language processing tasks related to English and Sinhala languages. In this paper, we will explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.
Learning to Paraphrase Sentences to Different Complexity Levels
results: 使用这些三个数据集进行训练,我们在 ASSET 简化数据集上实现了同等级别的最佳性能,并在 sentence level targeting tasks 上超过了之前的成果。此外,我们还证明了一些大语言模型在这些任务上的零基础情况下的性能。Abstract
While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for training, we perform extensive experiments on both multitasking and prompting strategies. Compared to other systems trained on unsupervised parallel data, models trained on our weak classifier labeled dataset achieve state-of-the-art performance on the ASSET simplification benchmark. Our models also outperform previous work on sentence level targeting. Finally, we establish how a handful of Large Language Models perform on these tasks under a zero-shot setting.
摘要
“对于句子简化是 aktive 的研究项目,但它的相邻任务,即句子复杂化和同等级重写,则未得到充分的研究。为了训练这三个任务,我们提出了两个新的无监督数据集。我们比较了这两个数据集,其中一个被轻量级分类器标注,另一个则使用规律化方法标注。使用这三个数据集进行训练,我们进行了广泛的实验,包括多任务训练和提示策略。相比其他基于无监督平行数据的系统,我们的模型在 ASSET 简化标准库上 achieve state-of-the-art 性能。我们的模型也超越了过去的句子级目标。最后,我们证明了一些大型自然语言模型在这些任务下的 zero-shot 性能。”
ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation
results: 实验结果表明,使用这两种采样策略可以在RL训练序列生成模型时提高效率和减少内存占用量。此外,RL从人类反馈(RLHF)训练大型自然语言模型,并评估了ESRL方法。实验结果表明,ESRL方法可以超越所有基eline,并在训练效率和内存占用量两个方面具有优异表现。Abstract
Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (\textit{e.g.,} BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (\textit{e.g.,} a vocabulary) and a long action sequence (\textit{e.g.,} a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods.
摘要
使用强化学习(RL)对序列生成模型进行直接优化长期奖励(例如BLEU和人工反馈),通常需要大规模的样本扫描ACTION序列空间。这是一个计算挑战,如序列生成问题的实践所示,我们经常面临庞大的行动空间(例如词汇)和长行动序列(例如翻译)。在这种工作中,我们介绍了两个阶段采样和动态采样方法,以提高在RL训练序列生成模型的样本效率。我们在传统的序列生成任务中进行了实验,包括翻译和概括摘要。此外,我们通过训练大语言模型使用奖励模型进行RLHF来评估我们的方法。实验结果表明,能效采样基于RL(ESRL)可以超越所有基elines,包括强化算法、最小风险训练和距离政策优化方法。尤其是,ESRL在奖励模型训练中具有一致性的性能提升。
Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition
methods: 这篇论文提出了一个名为Emotion Decoupling aNd Alignment learning framework(EMO-DNA)的新方法,用于跨语料库SER。EMO-DNA包括两个新特点:对比情绪分离和双层情绪Alignment。
results: 实验结果显示,EMO-DNA在多个跨语料库情况下表现更好,较以前的方法。源代码可以在https://github.com/Jiaxin-Ye/Emo-DNA中找到。Abstract
Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.
摘要
cross-corpus speech emotion recognition (SER) 目的是将极限对于一个具有良好标签的资料集进行推导,并将其扩展到另一个无标签的资料集,这是一个非常困难的任务,因为两个资料集之间存在很大的差异。现有的方法通常基于不监督领域适应(UDA),尝试从全球分布对顶点进行对顶点的对顶点适应,但是实际上,这些结果通常是混合了资料集特有的特征或不是类别特征。为了解决这些挑战,我们提出了一个名为Emotion Decoupling anD Alignment learning framework(EMO-DNA)的跨资料集 SER方法,并将其应用于不同的资料集之间。EMO-DNA的两大创新之处是:一是对应性抑制和二维度类别对顶点。在一方面,我们的对应性抑制通过一个对应抑制损失来强化感情相关的特征和资料集特有的特征之间的分离性。在另一方面,我们的二维度类别对顶点引入了一个可靠度预测 Pseudo-labeling,以选择信任度高的目标样本进行类别对顶点,并执行类别层对顶点进行协调,以帮助模型从多个资料集中学习类别特征。我们的实验结果显示,EMO-DNA在多个跨资料集情况下表现出色,较以往的方法有更好的性能。资料集和代码可以在https://github.com/Jiaxin-Ye/Emo-DNA 中找到。
From Fake to Hyperpartisan News Detection Using Domain Adaptation
results: 实验结果显示这些技术可以提高性能,并且组合 clustering 和主题探索算法与 UDA 可以得到更好的结果。Abstract
Unsupervised Domain Adaptation (UDA) is a popular technique that aims to reduce the domain shift between two data distributions. It was successfully applied in computer vision and natural language processing. In the current work, we explore the effects of various unsupervised domain adaptation techniques between two text classification tasks: fake and hyperpartisan news detection. We investigate the knowledge transfer from fake to hyperpartisan news detection without involving target labels during training. Thus, we evaluate UDA, cluster alignment with a teacher, and cross-domain contrastive learning. Extensive experiments show that these techniques improve performance, while including data augmentation further enhances the results. In addition, we combine clustering and topic modeling algorithms with UDA, resulting in improved performances compared to the initial UDA setup.
摘要
不监督领域适应(Unsupervised Domain Adaptation,简称UDA)是一种常用的技术,旨在降低两个数据分布之间的领域偏移。它在计算机视觉和自然语言处理领域得到了成功应用。在当前工作中,我们探索了两种文本分类任务之间的无监督领域适应技术的效果:假新闻和激进政治新闻检测。我们不在训练过程中使用目标标签,而是通过知识传递来帮助学习。因此,我们评估了UDA、教师对齐和跨领域对比学习。广泛的实验表明,这些技术可以提高表现,并且包含数据增强更加提高结果。此外,我们将 clustering 和主题挖掘算法与 UDA 结合使用,从而获得了与初始 UDA 设置相比的更好的表现。
Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology
results: 初步结果表明,使用LLM可以大幅提高匹配效果,虽然还有一些不足之处,但可以作为人工审核的准备。此外,研究还发现了应用LLM到终端临床试验匹配中的一些重要发展方向,例如Context Limitation和准确性,特别是从患者 longitudinal医疗记录中提取patient信息。Abstract
Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deployment at a large U.S. health network. Initial findings are promising: out of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially outperform prior strong baselines and may serve as a preliminary solution to help triage patient-trial candidates with humans in the loop. Our study also reveals a few significant growth areas for applying LLMs to end-to-end clinical trial matching, such as context limitation and accuracy, especially in structuring patient information from longitudinal medical records.
摘要
临床试验匹配是健康提供和发现的关键过程。在实践中,它受到过载的未结构化数据和不可扩展的手动处理的困扰。在这篇论文中,我们进行了系统性的研究,使用大语言模型(LLMs)扩大临床试验匹配,以肿瘤学为研究领域。我们的研究基于一个在大型美国医疗网络中测试的临床试验匹配系统。初步发现结果非常有 promise:直接使用最新的GPT-4等高级语言模型,可以立即结构化临床试验资格的复杂逻辑(例如,嵌入AND/OR/NOT)。虽然仍然很过不完美,但LLMs在比较强的基准模型之上显著超越,并可能作为人工协助的初步解决方案。我们的研究还揭示了应用LLMs到终端临床试验匹配中的一些重要成长领域,如CONTEXT限制和准确率,特别是从悠久医疗记录中结构化病人信息。
You talk what you read: Understanding News Comment Behavior by Dispositional and Situational Attribution
results: 可以更好地理解用户的注意点和意见,并在新闻摘要和新闻方面观点预测等应用中得到验证Abstract
Many news comment mining studies are based on the assumption that comment is explicitly linked to the corresponding news. In this paper, we observed that users' comments are also heavily influenced by their individual characteristics embodied by the interaction history. Therefore, we position to understand news comment behavior by considering both the dispositional factors from news interaction history, and the situational factors from corresponding news. A three-part encoder-decoder framework is proposed to model the generative process of news comment. The resultant dispositional and situational attribution contributes to understanding user focus and opinions, which are validated in applications of reader-aware news summarization and news aspect-opinion forecasting.
摘要
许多新闻评论研究假设评论直接关联新闻。在这篇论文中,我们发现用户的评论也受到其个人特征所 Embodied 的交互历史影响。因此,我们提议通过考虑新闻交互历史中的个性特征和相关新闻的情况因素,来理解新闻评论行为。我们建议使用三部分encoder-decoder框架来模型新闻评论生成过程。结果的个性和情况因素各自贡献于理解用户注意力和意见,这些结果在新闻摘要和新闻方面观点预测等应用中得到验证。
Speaker Diarization of Scripted Audiovisual Content
results: 我们在66集节目测试集上进行了评估,结果显示,我们的方法在我们的 Metrics 上与两个没有监督的基底模型相比,获得了51.7%的改善。Abstract
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
摘要
媒体地方化业务通常需要最终电影或电视制作的准确脚本,以创建外语字幕或配音脚本。特别是,准确脚本(即播送脚本)必须按照时间代码、发言人名称和字幕进行结构。当前的语音识别技术可以减轻转录步骤。然而,当前的说话人分类模型仍然在电视节目上存在两个主要问题:(1)它们无法跟踪大量的说话人,(2)它们在发生频繁的说话人变换时低准确。为解决这个问题,我们提出了一种新的方法,利用制作过程中使用的制作脚本,提取 pseudo-标注数据来进行说话人分类任务。我们提出了一种新的半监督方法,并在66集测试集上示出了51.7%的提升相对于两个无监督基线模型。
Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques
results: 结果表明,该ensemble模型可以超越传统方法的限制,实现更高的情感识别精度。Abstract
Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependencies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals. We found that GCNs excel at capturing Long-term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the modeling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
摘要
传统方法在语音情绪识别方面,如LSTM、CNN、RNN、SVM和MLP,具有缺点,如难以捕捉语音序列数据中长期依赖关系、模式和复杂的关系。本研究提出一个 ensemble 模型,将文本数据处理用 Graph Convolutional Networks (GCN), audio 信号分析用 HuBERT 变换器。我们发现GCNs可以很好地捕捉文本数据中长期上下文关系和Semantic relationships между字符,而 HuBERT 利用自我注意机制,可以捕捉语音中的长期关系,捕捉语音中的时间动态和细节,从而提高情绪识别的准确率。通过将 GCN 和 HuBERT ensemble,我们可以利用这两种方法的优点。这使得同时分析多 modal 数据,并将这些模式结合,从而提高情绪识别系统的推断力。结果表明,组合模型可以超越传统方法的局限性,实现更高的情绪识别精度。
Tweet Insights: A Visualization Platform to Extract Temporal Insights from Twitter
results: 这个数据集可以用于探索和 caracterize 语言的时间变化,包括补充的信息,如情感和主题关系的时间变化。Abstract
This paper introduces a large collection of time series data derived from Twitter, postprocessed using word embedding techniques, as well as specialized fine-tuned language models. This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution. The interface built on top of this data enables temporal analysis for detecting and characterizing shifts in meaning, including complementary information to trending metrics, such as sentiment and topic association over time. We release an online demo for easy experimentation, and we share code and the underlying aggregated data for future work. In this paper, we also discuss three case studies unlocked thanks to our platform, showcasing its potential for temporal linguistic analysis.
摘要
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
results: 对4种类型的后门攻击和4个不同的数据集进行实验,与基线方法(STRIP、RAP、ONION)相比,提出的方法在精度和准确性方面表现出色。Abstract
Backdoor attacks have emerged as a prominent threat to natural language processing (NLP) models, where the presence of specific triggers in the input can lead poisoned models to misclassify these inputs to predetermined target classes. Current detection mechanisms are limited by their inability to address more covert backdoor strategies, such as style-based attacks. In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to stay stealthy. Based on this observation, we hypothesize that while the model's predictions for paraphrased clean samples should remain stable, predictions for poisoned samples should revert to their true labels upon the mutations applied to triggers during the paraphrasing process. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem. We adopt fuzzing, a technique commonly used for unearthing software vulnerabilities, to discover optimal paraphrase prompts that can effectively eliminate triggers while concurrently maintaining input semantics. Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.
摘要
<>TRANSLATE_TEXT黑客攻击已经成为自然语言处理(NLP)模型的主要威胁,特定的触发器在输入中存在可以让毒化模型错分这些输入为预先确定的目标类。现有的检测机制受限于其无法处理更加潜在的黑客攻击策略,如样式基本攻击。在这项工作中,我们提出了一种创新的测试时毒化样本检测框架,基于模型预测的可读性,围绕输入的含义。我们认为触发器(如罕见词)不应该fundamentally改变毒化样本中的含义,因为他们想要保持隐蔽。基于这一观察,我们提出了一个假设:如果在重新排序和修改触发器时,模型对涂改后的毒化样本的预测结果应该回归到其真实标签。我们使用ChatGPT,一种现代大语言模型,来实现重新排序和修改触发器的任务,并将此任务定义为提示工程问题。我们采用了混淆,一种通常用于发现软件漏洞的技术,来找到最佳的提示问题,以有效地除去触发器而同时保持输入含义。在4种黑客攻击和4个不同的数据集上,我们的方法比基准方法(包括STRIP、RAP和ONION)在准确率和敏感率方面表现出色。
Chinese Financial Text Emotion Mining: GCGTS – A Character Relationship-based Approach for Simultaneous Aspect-Opinion Pair Extraction
results: 对比先前的 SDRN 和 GTS 模型,提出的 GCGTS 模型在中文金融文本中的表达能力显著提高,提供了一种新的和有效的 AOPE 方法。Abstract
Aspect-Opinion Pair Extraction (AOPE) from Chinese financial texts is a specialized task in fine-grained text sentiment analysis. The main objective is to extract aspect terms and opinion terms simultaneously from a diverse range of financial texts. Previous studies have mainly focused on developing grid annotation schemes within grid-based models to facilitate this extraction process. However, these methods often rely on character-level (token-level) feature encoding, which may overlook the logical relationships between Chinese characters within words. To address this limitation, we propose a novel method called Graph-based Character-level Grid Tagging Scheme (GCGTS). The GCGTS method explicitly incorporates syntactic structure using Graph Convolutional Networks (GCN) and unifies the encoding of characters within the same syntactic semantic unit (Chinese word level). Additionally, we introduce an image convolutional structure into the grid model to better capture the local relationships between characters within evaluation units. This innovative structure reduces the excessive reliance on pre-trained language models and emphasizes the modeling of structure and local relationships, thereby improving the performance of the model on Chinese financial texts. Through comparative experiments with advanced models such as Synchronous Double-channel Recurrent Network (SDRN) and Grid Tagging Scheme (GTS), the proposed GCGTS model demonstrates significant improvements in performance.
摘要
《评价分析》从中文金融文本中提取方面评价对(AOPE)是一种特殊的精细文本情感分析任务。主要目标是同时提取方面 термина和评价 термина从多样化的金融文本中。先前的研究主要集中在开发格式化标注方案,以便进行这种提取过程。然而,这些方法通常依赖于字符级别(token级别)的特征编码,可能会忽略中文字符在单词水平上的逻辑关系。为了解决这一限制,我们提出了一种新的方法——基于图学推理的字符级别Grid Tagging Scheme(GCGTS)。GCGTS方法会将中文字符在单词水平上的逻辑关系进行Explicit地表示,并通过图学推理来强化字符级别的编码。此外,我们还引入了图像推理结构,以更好地捕捉单词水平上的本地关系。这种创新的结构可以减少依赖于预训练语言模型的过度依赖,并强调结构和本地关系的模型化,从而提高中文金融文本上的模型性能。通过与高级模型如同步双通道循环网络(SDRN)和Grid Tagging Scheme(GTS)进行比较实验,我们的提出的GCGTS模型在中文金融文本上表现出了显著的改善。
Prompt2Gaussia: Uncertain Prompt-learning for Script Event Prediction
methods: 使用公共预训练语言模型作为知识库,自动挖掘脚本相关知识via prompt-学习,并使用 Gaussian Distribution 表示不确定性
results: 比 Priors 基eline 高1.46%和1.05% 在两个benchmark上,表现出优于先前的方法Abstract
Script Event Prediction (SEP) aims to predict the subsequent event for a given event chain from a candidate list. Prior research has achieved great success by integrating external knowledge to enhance the semantics, but it is laborious to acquisite the appropriate knowledge resources and retrieve the script-related knowledge. In this paper, we regard public pre-trained language models as knowledge bases and automatically mine the script-related knowledge via prompt-learning. Still, the scenario-diversity and label-ambiguity in scripts make it uncertain to construct the most functional prompt and label token in prompt learning, i.e., prompt-uncertainty and verbalizer-uncertainty. Considering the innate ability of Gaussian distribution to express uncertainty, we deploy the prompt tokens and label tokens as random variables following Gaussian distributions, where a prompt estimator and a verbalizer estimator are proposed to estimate their probabilistic representations instead of deterministic representations. We take the lead to explore prompt-learning in SEP and provide a fresh perspective to enrich the script semantics. Our method is evaluated on the most widely used benchmark and a newly proposed large-scale one. Experiments show that our method, which benefits from knowledge evoked from pre-trained language models, outperforms prior baselines by 1.46\% and 1.05\% on two benchmarks, respectively.
摘要
Causality Guided Disentanglement for Cross-Platform Hate Speech Detection
results: 根据实验结果,这种模型在四个不同平台上的检测仇恨言论效果比现有的状态态方法更高。Abstract
Social media platforms, despite their value in promoting open discourse, are often exploited to spread harmful content. Current deep learning and natural language processing models used for detecting this harmful content overly rely on domain-specific terms affecting their capabilities to adapt to generalizable hate speech detection. This is because they tend to focus too narrowly on particular linguistic signals or the use of certain categories of words. Another significant challenge arises when platforms lack high-quality annotated data for training, leading to a need for cross-platform models that can adapt to different distribution shifts. Our research introduces a cross-platform hate speech detection model capable of being trained on one platform's data and generalizing to multiple unseen platforms. To achieve good generalizability across platforms, one way is to disentangle the input representations into invariant and platform-dependent features. We also argue that learning causal relationships, which remain constant across diverse environments, can significantly aid in understanding invariant representations in hate speech. By disentangling input into platform-dependent features (useful for predicting hate targets) and platform-independent features (used to predict the presence of hate), we learn invariant representations resistant to distribution shifts. These features are then used to predict hate speech across unseen platforms. Our extensive experiments across four platforms highlight our model's enhanced efficacy compared to existing state-of-the-art methods in detecting generalized hate speech.
摘要
Seasonality Based Reranking of E-commerce Autocomplete Using Natural Language Queries
results: 研究发现,通过 incorporating 季节性信号,自动完成排名模型可以提高搜索结果的相关性和商业指标。I hope that helps! Let me know if you have any other questions.Abstract
Query autocomplete (QAC) also known as typeahead, suggests list of complete queries as user types prefix in the search box. It is one of the key features of modern search engines specially in e-commerce. One of the goals of typeahead is to suggest relevant queries to users which are seasonally important. In this paper we propose a neural network based natural language processing (NLP) algorithm to incorporate seasonality as a signal and present end to end evaluation of the QAC ranking model. Incorporating seasonality into autocomplete ranking model can improve autocomplete relevance and business metric.
摘要
查询自动完成(QAC)也称为前缀提示,是现代搜索引擎的一个关键特性,尤其在电商领域。QAC的一个目标是向用户提供相关的查询,以便在各个季节中提高搜索结果的相关性。在这篇论文中,我们提议一种基于神经网络的自然语言处理(NLP)算法,以兼容季节信号并提供综合评估QAC排名模型的终端评估。在搜索结果中包含季节信号可以提高搜索结果的相关性和业务指标。
Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models
paper_authors: Mahammed Kamruzzaman, Gene Louis Kim
for: 这 paper 是为了评估文档级别的情感分析模型,并考虑了资源成本的影响。
methods: 这 paper 使用了不同的特征提取技术、集成、任务专门的深度学习模型和领域独立的大语言模型。
results: 研究发现,精度最佳化的大语言模型可以达到最高的准确率,但有些配置可以提供巨大(最多24,283*)的资源减少,只带来少量(<1%)的减少。此外,对于较小的数据集,准确率之间的差异减小,而资源消耗差异增加。Abstract
While reaching for NLP systems that maximize accuracy, other important metrics of system performance are often overlooked. Prior models are easily forgotten despite their possible suitability in settings where large computing resources are unavailable or relatively more costly. In this paper, we perform a broad comparative evaluation of document-level sentiment analysis models with a focus on resource costs that are important for the feasibility of model deployment and general climate consciousness. Our experiments consider different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternate configurations provide huge (up to 24, 283 *) resource savings for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.
摘要
While striving for NLP systems with maximum accuracy, other important metrics of system performance are often overlooked. Prior models may be suitable in settings where large computing resources are scarce or relatively more costly, but are often forgotten. In this paper, we conduct a comprehensive comparative evaluation of document-level sentiment analysis models, with a focus on resource costs that are crucial for the feasibility of model deployment and general climate consciousness. Our experiments cover different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternative configurations can provide significant (up to 24,283 *) resource savings for a minor (<1%) loss in accuracy. Moreover, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.Note: * indicates a number in Chinese characters.
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
results: exceeds 两个教师模型的性能,并且比不使用 distillation 的模型更好Here’s a breakdown of each point:
for: The paper is written to improve the sample efficiency of language models.
methods: The paper uses an ensemble of a GPT-2 and small LLaMA models, as well as distillation techniques.
results: The distilled LLaMA model exceeds the performance of both of its teachers and a similar model trained without distillation.Abstract
We present our proposed solution to the BabyLM challenge [arXiv:2301.11796], whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.
摘要
我们提出了一种解决方案来提高语言模型的样本效率挑战(arXiv:2301.11796)。我们使用一个GPT-2和一些小型LLaMA模型在发展可能的10M字 BabyLM数据集上进行训练,然后将其浓缩为一个小型,58M参数的LLaMA模型,该模型超越了它的两位教师以及不使用浓缩的相似模型。这表明,浓缩不仅可以保持教师模型在较小数据集上的完整性,而且可以超越它,并导致直接训练的性能明显更好。
Federated Representation Learning for Automatic Speech Recognition
for: This paper is written for learning Automatic Speech Recognition (ASR) representations while preserving data privacy using Federated Learning (FL) and Self-supervised Learning (SSL).
methods: The paper uses the Contrastive Predictive Coding framework with FedSGD to pre-train an LSTM encoder on unlabeled speech data from Libri-Light, simulating non-IID speaker-siloed data distributions.
results: The pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. The federated pre-trained models are also adapted to a new language, French, and show a 20% (WER) improvement over no pre-training.Here is the same information in Simplified Chinese text:
results: 预训ASR编码器在FL中表现与中央预训模型一样好,与没有预训相比提高12-15%(WER)。此外,使用了 Federated 预训模型,在新语言法语言中进行了20%(WER)的提高。Abstract
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.
摘要
federated learning (FL) 是一种隐私保护的思想,允许边缘设备学习共同无需分享数据。边缘设备如 Alexa 和 Siri 是可能的无标语音数据的来源,可以用来学习强大的语音表示。在这种工作中,我们将自我监督学习 (SSL) 和 FL 结合以学习对自动语音识别 (ASR) 尊重数据隐私约束的表示。我们使用 Libri-Light 无标语音 dataset 中的 speaker 和 chapter 信息来模拟非同一个 speaker-siloed 数据分布,并在 Contrastive Predictive Coding 框架中使用 FedSGD 预训练 LSTM 编码器。我们发现预训练 ASR 编码器在 FL 中表现与中央预训练模型一样好,并且提高了12-15% (WER) compared to no pre-training。我们进一步适应了联邦预训练模型到一个新语言法语,并显示了20% (WER) 的提高 compared to no pre-training。
Bengali Fake Reviews: A Benchmark Dataset and Detection System
paper_authors: G. M. Shahariar, Md. Tanvir Rouf Shawon, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub
for: This paper aims to identify fake reviews in the Bengali language, which is an under-explored research area in the field of fake review detection.
methods: The authors propose a unique pipeline to translate English words to their corresponding Bengali meaning and back transliterate Romanized Bengali to Bengali. They also use multiple deep learning and pre-trained transformer language models to develop a reliable detection system.
results: The proposed ensemble model achieved a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library.Here’s the Chinese translation of the three key points:
results: 提议的ensemble模型在13390个评论中取得了0.9843的权重F1分数,其中包括1339个实际的假评论和5356个扩充的假评论,使用nlpaug库生成的。模型在使用bnaug库扩充假评论后的权重F1分数为0.9558。Abstract
The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as Bengali is still a relatively unexplored research area. This paper introduces the Bengali Fake Review Detection (BFRD) dataset, the first publicly available dataset for identifying fake reviews in Bengali. The dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts. To convert non-Bengali words in a review, a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali. We have conducted rigorous experimentation using multiple deep learning and pre-trained transformer language models to develop a reliable detection system. Finally, we propose a weighted ensemble model that combines four pre-trained transformers: BanglaBERT, BanglaBERT Base, BanglaBERT Large, and BanglaBERT Generator . According to the experiment results, the proposed ensemble model obtained a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The remaining 6695 reviews were randomly selected from the 7710 non-fake instances. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library.
摘要
“伪评论的滥蔽在网上 платфорms上导致了consumers和企业的担忧。这些评论可能会欺骗顾客和破坏产品或服务的声誉,从而使其识别成为一个重要的研究领域。虽然在英文语言中检测伪评论的研究已经很广泛,但在非英文语言 such as 孟加拉语仍然是一个相对未探索的研究领域。本文发布了第一个公开可用的孟加拉语伪评论检测(BFRD)数据集,包括7710个非伪评论和1339个伪评论 food-related 评论,从社交媒体帖子中收集。为了将英文字在评论中转换为孟加拉语的意思,我们提出了一个唯一的管道,将英文字转换为孟加拉语的意思,并且将罗马化孟加拉语转换为孟加拉语。我们进行了严谨的实验,使用多种深度学习和预训习语言模型,以建立一个可靠的检测系统。最后,我们提出了一个权重组合模型,结合四个预训习transformer语言模型:BanglaBERT、BanglaBERT Base、BanglaBERT Large 和 BanglaBERT Generator。根据实验结果,我们的提案组合模型在13390个评论中取得了0.9843的权重F1分,包括1339个实际的伪评论和5356个扩展的伪评论,使用 nlpaug 库生成。剩下的6695个评论是随机选择的。当伪评论使用 bnaug 库生成时,组合模型在13390个评论中取得了0.9558的权重F1分。”
Athena 2.0: Discourse and User Modeling in Open Domain Dialogue
results: 这个论文表明,Athena 2.0 可以在多个流行话题上进行有效的对话,并且可以根据用户的个性进行个性化对话。Abstract
Conversational agents are consistently growing in popularity and many people interact with them every day. While many conversational agents act as personal assistants, they can have many different goals. Some are task-oriented, such as providing customer support for a bank or making a reservation. Others are designed to be empathetic and to form emotional connections with the user. The Alexa Prize Challenge aims to create a socialbot, which allows the user to engage in coherent conversations, on a range of popular topics that will interest the user. Here we describe Athena 2.0, UCSC's conversational agent for Amazon's Socialbot Grand Challenge 4. Athena 2.0 utilizes a novel knowledge-grounded discourse model that tracks the entity links that Athena introduces into the dialogue, and uses them to constrain named-entity recognition and linking, and coreference resolution. Athena 2.0 also relies on a user model to personalize topic selection and other aspects of the conversation to individual users.
摘要
很多人每天都与对话代理 interact,这些对话代理的popularity在不断增长。although many conversational agents act as personal assistants, they can have many different goals. some are task-oriented, such as providing customer support for a bank or making a reservation. others are designed to be empathetic and form emotional connections with the user. the alexa prize challenge aims to create a socialbot that allows the user to engage in coherent conversations on a range of popular topics that will interest the user. here we describe athena 2.0,UCSC的 conversational agent for Amazon's socialbot grand challenge 4. athena 2.0 utilizes a novel knowledge-grounded discourse model that tracks the entity links that athena introduces into the dialogue, and uses them to constrain named-entity recognition and linking, and coreference resolution. athena 2.0 also relies on a user model to personalize topic selection and other aspects of the conversation to individual users.
Tag Prediction of Competitive Programming Problems using Deep Learning Techniques
results: 实验结果显示,使用MLP模型可以达到最高准确率为78.0%。Abstract
In the past decade, the amount of research being done in the fields of machine learning and deep learning, predominantly in the area of natural language processing (NLP), has risen dramatically. A well-liked method for developing programming abilities like logic building and problem solving is competitive programming. It can be tough for novices and even veteran programmers to traverse the wide collection of questions due to the massive number of accessible questions and the variety of themes, levels of difficulty, and questions offered. In order to help programmers find questions that are appropriate for their knowledge and interests, there is a need for an automated method. This can be done using automated tagging of the questions using Text Classification. Text classification is one of the important tasks widely researched in the field of Natural Language Processing. In this paper, we present a way to use text classification techniques to determine the domain of a competitive programming problem. A variety of models, including are implemented LSTM, GRU, and MLP. The dataset has been scraped from Codeforces, a major competitive programming website. A total of 2400 problems were scraped and preprocessed, which we used as a dataset for our training and testing of models. The maximum accuracy reached using our model is 78.0% by MLP(Multi Layer Perceptron).
摘要
在过去的一个十年中,机器学习和深度学习领域内的自然语言处理(NLP)领域的研究量有了很大的增长。一种受欢迎的方法是竞赛编程,它可以帮助程序员提高逻辑建设和问题解决能力。但由于问题的数量繁多,多种主题、Difficulty Level和问题的选择,新手和经验 programmer都可能难以找到合适的问题。为了帮助程序员找到适合自己水平和兴趣的问题,需要一种自动化的方法。这可以通过自动标记问题来实现,使用自然语言处理领域的文本分类技术。在这篇论文中,我们提出了一种使用文本分类技术来确定竞赛编程问题的领域。我们实现了多种模型,包括LSTM、GRU和MLP。数据集来自Codeforces竞赛编程网站,共抽取了2400个问题,并进行了处理和训练模型。最高的准确率达到了78.0%,由MLP(多层感知器)实现。
Wider and Deeper LLM Networks are Fairer LLM Evaluators
results: 研究发现,使用更深和宽的网络可以导致更公正的评估结果。此外,利用WideDeep可以加速评估过程,提高人类协议率至93%。Abstract
Measuring the quality of responses generated by LLMs is a challenging task, particularly when it comes to evaluating whether the response is aligned with human preference. A novel approach involves using the LLM itself to make evaluation and stabilizing the results through multiple independent evaluations, similar to a single-layer narrow LLM network. This network consists of a fixed number of neurons, with each neuron being the same LLM. In this paper, we draw upon the extensive research on deep neural networks to explore whether deeper and wider networks can lead to fairer evaluations. Specifically, inspired by the observation that different neurons in a neural network are responsible for detecting different concepts, we first adaptively generate as many neuron roles as possible for each evaluation sample. Each perspective corresponds to the role of a specific LLM neuron in the first layer. In subsequent layers, we follow the idea that higher layers in deep networks are responsible for more comprehensive features, each layer receives representations from all neurons in the previous layer, integrating the locally learned evaluation information to obtain a more comprehensive evaluation result. Interestingly, this network design resembles the process of academic paper reviewing. To validate the effectiveness of our method, we construct the largest and most diverse English evaluation benchmark LLMEval$^2$ for LLM evaluators, comprising 15 tasks, 8 abilities, and 2,553 samples. Experimental results demonstrate that a wider network (involving many reviewers) with 2 layers (one round of discussion) performs the best, improving kappa correlation coefficient from 0.28 to 0.34. We also leverage WideDeep to aid in the assessment of Chinese LLMs, which has accelerated the evaluation time by 4.6 times, resulting in a 60% cost saving. WideDeep achieves a remarkable 93% agreement level among humans.
摘要
评估语言模型(LLM)生成的响应质量是一项具有挑战性的任务,特别是当需要评估响应是否与人类偏好相符。一种新的方法是使用LLM自己进行评估,并通过多个独立评估来稳定结果,类似于单层窄LLM网络。这种网络由固定数量的神经元组成,每个神经元都是相同的LLM。在这篇论文中,我们 draw upon deep neural networks的广泛研究,探索深度和宽度是否可以导致更公平的评估。具体来说,我们根据观察deep neural network中不同神经元负责检测不同概念的现象,我们首先适应性地生成评估样本中的最多神经元角色。在后续层中,我们遵循深度网络中高层的神经元负责更全面的特征,每层都收到前一层所有神经元的表示,将地方学习的评估信息集成起来,以获得更全面的评估结果。这种网络设计与学术论文审核过程类似。为验证方法的效果,我们构建了最大和最多样的英语评估数据集LLMEval$^2$,包括15个任务、8种能力和2553个样本。实验结果表明,一个更宽的网络(即多个评审者)以2层(一次讨论)的设计表现最佳,从0.28到0.34的κ相互关系系数进行提高。我们还使用了WideDeep来辅助中文LLM的评估,从而提高评估速度,并实现了60%的成本减少。WideDeep达到了人类93%的一致水平。
Curricular Transfer Learning for Sentence Encoded Tasks
results: 在我们的实验中,我们比其他已知的预训方法在 MultiWoZ 任务上获得了较好的表现。Abstract
Fine-tuning language models in a downstream task is the standard approach for many state-of-the-art methodologies in the field of NLP. However, when the distribution between the source task and target task drifts, \textit{e.g.}, conversational environments, these gains tend to be diminished. This article proposes a sequence of pre-training steps (a curriculum) guided by "data hacking" and grammar analysis that allows further gradual adaptation between pre-training distributions. In our experiments, we acquire a considerable improvement from our method compared to other known pre-training approaches for the MultiWoZ task.
摘要
通常,在自然语言处理(NLP)领域中,许多状态下方法会在下游任务中细化语言模型。然而,当源任务和目标任务分布之间的差异较大,例如对话环境,这些收益往往减少。这篇文章提出了一系列的预训练步骤(课程),通过“数据黑客”和语法分析引导,以便进行逐渐的适应 между预训练分布。在我们的实验中,我们对多语言对话任务(MultiWoZ)获得了明显的改善,相比其他已知的预训练方法。
XNLP: An Interactive Demonstration System for Universal Structured NLP
results: 该论文的系统在多个方面提高了XNLP示范平台,包括通用XNLP模型、高性能、可解释性、可扩展性和交互性等方面,为研究者提供了一个统一的平台来探索多种XNLP任务。Abstract
Structured Natural Language Processing (XNLP) is an important subset of NLP that entails understanding the underlying semantic or syntactic structure of texts, which serves as a foundational component for many downstream applications. Despite certain recent efforts to explore universal solutions for specific categories of XNLP tasks, a comprehensive and effective approach for unifying all XNLP tasks long remains underdeveloped. In the meanwhile, while XNLP demonstration systems are vital for researchers exploring various XNLP tasks, existing platforms can be limited to, e.g., supporting few XNLP tasks, lacking interactivity and universalness. To this end, we propose an advanced XNLP demonstration platform, where we propose leveraging LLM to achieve universal XNLP, with one model for all with high generalizability. Overall, our system advances in multiple aspects, including universal XNLP modeling, high performance, interpretability, scalability, and interactivity, providing a unified platform for exploring diverse XNLP tasks in the community. XNLP is online: https://xnlp.haofei.vip
摘要
“结构化自然语言处理(XNLP)是 NLP 中一个重要的子集,它涉及理解文本中的 semantics 或 syntax 结构,这种基础组件对多个下游应用有很大的重要性。尽管有些最近的努力是为了探索特定类型的 XNLP 任务的通用解决方案,但是一个包容和有效的 XNLP 任务统一方法仍然没有得到开发。在这之前,XNLP 演示系统是研究者探索不同 XNLP 任务的重要工具,但现有的平台有限,例如只支持一些 XNLP 任务,缺乏通用性和互动性。为此,我们提出了一个高级 XNLP 演示平台,我们提议利用 LLM 来实现 universality XNLP,一个模型可以涵盖所有 XNLP 任务,具有高通用性。总的来说,我们的系统在多个方面进步,包括 universality XNLP 模型、高性能、可解释性、可扩展性和互动性,提供一个统一的平台 для研究者探索不同 XNLP 任务。XNLP 在线:https://xnlp.haofei.vip”Note: "LLM" stands for "large language model" in English.