cs.CL - 2023-09-17

Augmenting text for spoken language understanding with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.09390
  • repo_url: None
  • paper_authors: Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer
  • for: 该研究的目的是解决现有应用领域中的Robust模型训练问题,需要对应的 triplets 数据,但获取这些数据可以是非常复杂和昂贵的。
  • methods: 该研究使用了不同的方法来使用无对应的文本数据进行模型训练,包括Joint Audio Text (JAT)和Text-to-Speech (TTS)等方法。
  • results: 实验结果表明,使用无对应的文本数据可以提高模型的性能,对于现有领域和新领域来说,使用LLMs生成的文本可以提高EM的精度 by 1.4%和2.6%。
    Abstract Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.
    摘要 spoken semantic parsing (SSP) 即从语音输入生成机器可理解的架构。为已有应用领域的训练数据或扩展到新领域训练模型而需要对应的三元组(语音、 trascript、semantic parse)数据,这是 expensive 的。在这篇论文中,我们解决这个挑战,通过不使用对应的语音,使用 trascript-semantic parse 数据(无对应的语音)来生成 speech 表示。首先,当 unpaired text 从现有的文本库中提取时,我们比较使用 Joint Audio Text (JAT) 和 Text-to-Speech (TTS) 生成 speech 表示。STOP 数据集上的实验结果显示,从现有和新领域的 unpaired text 提取可以提高表达的性能,相对于无提取情况下,减少了2%和30%的精确匹配(EM)。其次,当 unpaired text 不在现有的文本库中时,我们提议使用 Large Language Models (LLMs) 生成 unpaired text。我们发现,可以使用 intents 相关的例子和单词来生成 unpaired text。使用生成的文本与 JAT 和 TTS 生成的 speech 进行 spoken semantic parsing,可以提高 STOP 数据集上的 EM 表达,相对于无提取情况下,提高了1.4%和2.6%的精确匹配。

Mitigating Shortcuts in Language Models with Soft Label Encoding

  • paper_url: http://arxiv.org/abs/2309.09380
  • repo_url: None
  • paper_authors: Zirui He, Huiqi Deng, Haiyan Zhao, Ninghao Liu, Mengnan Du
  • for: 本研究的目的是解决大语言模型在自然语言理解(NLU)任务上依赖偶合关系的问题。
  • methods: 我们提出了一种简单 yet effective的减弱架构,名为软标签编码(SoftLE)。我们首先使用师模型通过硬标签训练每个样本的偶合度。然后,我们添加了一个假类来编码偶合度,并将其用于融合其他维度的真实标签生成软标签。这个新的真实标签用于训练更加可靠的学生模型。
  • results: 我们在两个 NLU benchmark 任务上进行了广泛的实验,结果表明,SoftLE 可以显著提高对于样本集外的泛化性能,而保持满意的在样本集内的准确率。
    Abstract Recent research has shown that large language models rely on spurious correlations in the data for natural language understanding (NLU) tasks. In this work, we aim to answer the following research question: Can we reduce spurious correlations by modifying the ground truth labels of the training data? Specifically, we propose a simple yet effective debiasing framework, named Soft Label Encoding (SoftLE). We first train a teacher model with hard labels to determine each sample's degree of relying on shortcuts. We then add one dummy class to encode the shortcut degree, which is used to smooth other dimensions in the ground truth label to generate soft labels. This new ground truth label is used to train a more robust student model. Extensive experiments on two NLU benchmark tasks demonstrate that SoftLE significantly improves out-of-distribution generalization while maintaining satisfactory in-distribution accuracy.
    摘要 近期研究发现大语言模型在自然语言理解任务上依赖于偶合关系。在这项工作中,我们想回答以下研究问题:可以通过修改训练数据的真实标签来减少偶合关系吗?我们提出了一种简单 yet effective的减偶框架,名为软标签编码(SoftLE)。我们首先使用师模型通过硬标签确定每个样本的偶合度。然后,我们添加了一个幌子类来编码偶合度,并将其用于软标签的缓冲处理。这个新的真实标签用于训练更加可靠的学生模型。我们在两个NLU benchmark任务上进行了广泛的实验,结果表明,SoftLE可以明显提高异常情况泛化性,同时保持满意的内部准确率。

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

  • paper_url: http://arxiv.org/abs/2309.09369
  • repo_url: None
  • paper_authors: Kung-Hsiang Huang, Philippe Laban, Alexander R. Fabbri, Prafulla Kumar Choubey, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
  • for: 本研究旨在提出一种多文摘要新闻Summarization任务,即对多篇新闻文章中的不同信息进行摘要。
  • methods: 我们提出了一种数据收集方案,并开发了一个名为DiverseSumm的数据集,包括245篇新闻故事和10篇新闻文章。此外,我们还进行了全面的分析,探讨使用大语言模型(LLM)测试摘要的覆盖率和准确性的问题,以及与人类评估的相关性。
  • results: 我们的分析发现,使用LLM进行多文摘要时,主要存在覆盖率和准确性的问题,GPT-4只能覆盖不同信息的 Less than 40% 的情况。
    Abstract Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.
    摘要 To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to identify the position and verbosity biases when using Large Language Model (LLM)-based metrics to evaluate the coverage and faithfulness of the summaries, as well as their correlation with human assessments.We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that while LLMs have extraordinary capabilities in single-document summarization, the proposed task remains a complex challenge for them, with GPT-4 only able to cover less than 40% of the diverse information on average.

Language models are susceptible to incorrect patient self-diagnosis in medical applications

  • paper_url: http://arxiv.org/abs/2309.09362
  • repo_url: None
  • paper_authors: Rojin Ziaei, Samuel Schmidgall
  • for: 这个研究旨在检验大型自然语言模型(LLMs)在医疗领域中的可用性,以及它们在实际患者-医生交流中的表现。
  • methods: 研究使用了多种LLMs,并对它们进行了多项选择和修改,以模拟实际患者自诊的情况。研究还使用了美国医学会考试的多选题来评估LLMs的诊断准确率。
  • results: 研究发现,当患者提供了错误的偏见验证信息时,LLMs的诊断准确率会受到极大的影响,表明LLMs在自诊情况下存在高度易错的问题。
    Abstract Large language models (LLMs) are becoming increasingly relevant as a potential tool for healthcare, aiding communication between clinicians, researchers, and patients. However, traditional evaluations of LLMs on medical exam questions do not reflect the complexity of real patient-doctor interactions. An example of this complexity is the introduction of patient self-diagnosis, where a patient attempts to diagnose their own medical conditions from various sources. While the patient sometimes arrives at an accurate conclusion, they more often are led toward misdiagnosis due to the patient's over-emphasis on bias validating information. In this work we present a variety of LLMs with multiple-choice questions from United States medical board exams which are modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically, revealing a high susceptibility to errors in self-diagnosis.
    摘要

Performance of the Pre-Trained Large Language Model GPT-4 on Automated Short Answer Grading

  • paper_url: http://arxiv.org/abs/2309.09338
  • repo_url: None
  • paper_authors: Gerd Kortemeyer
  • for: 这 paper 是 investigate GPT-4 LLM 在 Automated Short Answer Grading (ASAG) 领域的性能。
  • methods: 这 paper 使用了 GPT-4 LLM 和手动设计的模型,对 SciEntsBank 和 Beetle 数据集进行测试。
  • results: 结果表明,GPT-4 LLM 的性能相对于手动设计的模型类似,但与特殊训练的 LLM 相比较差。
    Abstract Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.
    摘要 自动化简答评分(ASAG)已经是机器学习研究的活跃领域之一,时间已经超过十年。它承诺能让教育工作者在大规模课程中评分和给出反馈,尽管人工评分员的可用性有限。总之,仔细训练的模型在过去的几年中取得了不断提高的性能。更加最近,大型自然语言模型(LLM)出现了,一个有趣的问题是一个通用工具没有进行特殊训练是如何与特殊训练的模型相比。我们研究了GPT-4在标准benchmark上的性能,包括SciEntsBank和Beetle dataset,我们还研究了不公开参考答案的情况。我们发现,总的来说,预训练的通用GPT-4 LLM的性能和手工设计的模型相比较,但是比特殊训练的LLM更差。

A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers

  • paper_url: http://arxiv.org/abs/2309.09329
  • repo_url: None
  • paper_authors: Paleti Nikhil Chowdary, Vadlapudi Sai Aravind, Gorantla V N S L Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Jyothish Lal. G
  • for: 检测异常speech(dysarthria),以便开发治疗计划,提高人们的交流能力和生活质量。
  • methods: 使用transformer模型,采用几招shot学习法,对有限数据进行分类,并且解决之前研究中的数据泄露问题。
  • results: 使用whisper-large-v2 transformer模型,在UASpeech数据集中的中度智能水平患者上达到了85%的准确率,0.92的精度、0.8的准确率、0.85的F1分数和0.91的特异性。模型在’words’数据集上表现较好,比’letters’和’digits’数据集更好。多类模型达到67%的准确率。
    Abstract Dysarthria is a speech disorder that hinders communication due to difficulties in articulating words. Detection of dysarthria is important for several reasons as it can be used to develop a treatment plan and help improve a person's quality of life and ability to communicate effectively. Much of the literature focused on improving ASR systems for dysarthric speech. The objective of the current work is to develop models that can accurately classify the presence of dysarthria and also give information about the intelligibility level using limited data by employing a few-shot approach using a transformer model. This work also aims to tackle the data leakage that is present in previous studies. Our whisper-large-v2 transformer model trained on a subset of the UASpeech dataset containing medium intelligibility level patients achieved an accuracy of 85%, precision of 0.92, recall of 0.8 F1-score of 0.85, and specificity of 0.91. Experimental results also demonstrate that the model trained using the 'words' dataset performed better compared to the model trained on the 'letters' and 'digits' dataset. Moreover, the multiclass model achieved an accuracy of 67%.
    摘要 《异常speech》是一种语言障碍,影响人们的交流能力,由于说话困难。检测异常speech的重要性在于可以开发治疗计划,帮助人们提高交流能力和语言表达效果。许多文献关注提高ASR系统对异常speech的识别能力。目前研究的目标是开发一种可以准确地识别异常speech并提供语音elligibility水平信息的模型,使用少量数据和transformer模型进行几架 Approach。此外,本研究还希望解决过去研究中存在的数据泄露问题。我们使用UASpeech数据集中的中度语音智能水平患者训练了我们的whisper-large-v2 transformer模型,达到了85%的准确率、0.92的精度、0.8的 recall、0.85的F1得分和0.91的特异性。实验结果还表明,使用“words”数据集训练的模型比使用“letters”和“digits”数据集训练的模型性能更高。此外,多类模型达到了67%的准确率。

How People Perceive The Dynamic Zero-COVID Policy: A Retrospective Analysis From The Perspective of Appraisal Theory

  • paper_url: http://arxiv.org/abs/2309.09324
  • repo_url: None
  • paper_authors: Na Yang, Kyrie Zhixuan Zhou, Yunzhe Li
  • for: 这 paper 是为了研究中国的动态零病政策三年来的情感反应和观点变化。
  • methods: 这 paper 使用了 sentiment analysis 方法来分析了 2,358 篇微博文章中的情感,并从 appraisal theory 的视角进行了深入的诠释。
  • results: 研究发现了四个代表点:政策初始化、情感态度快速变化、情感分数最低、政策终止。这些结果可能有助于未来卫生防疫控制措施的开发。
    Abstract The Dynamic Zero-COVID Policy in China spanned three years and diverse emotional responses have been observed at different times. In this paper, we retrospectively analyzed public sentiments and perceptions of the policy, especially regarding how they evolved over time, and how they related to people's lived experiences. Through sentiment analysis of 2,358 collected Weibo posts, we identified four representative points, i.e., policy initialization, sharp sentiment change, lowest sentiment score, and policy termination, for an in-depth discourse analysis through the lens of appraisal theory. In the end, we reflected on the evolving public sentiments toward the Dynamic Zero-COVID Policy and proposed implications for effective epidemic prevention and control measures for future crises.
    摘要 中国的动态零病政策覆盖了三年的时间,而人们对这政策的情感响应也在不同时间 exhibited 多样化的情感。在这篇论文中,我们通过对2358篇微博文章的情感分析,发现了四个代表点:政策 initialize,快速情感变化,最低情感分数,和政策终止,并通过评价理论进行深入的讨论。最后,我们回顾了公众对动态零病政策的情感发展,并提出了未来危机管理的有效措施。Note: "微博" (Weibo) is a popular social media platform in China, similar to Twitter.

A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models

  • paper_url: http://arxiv.org/abs/2309.10003
  • repo_url: None
  • paper_authors: Sébastien Ragot
  • for: 这种方法用于测量专利索引的范围,它基于信息理论,假设罕见的概念更有启示性,因为它更有启示性。
  • methods: 该方法基于语言模型,从报告语言模型中计算自信息,从而计算专利索引的范围。五种语言模型被考虑,从 simplest 模型(每个词或字符从均匀分布中选择)到中间模型(使用平均词或字符频率),最后一个模型是 GPT2。
  • results: 结果表明,随着语言模型的复杂程度的增加,模型的性能也得到了改善。GPT2模型在多个静态测试中表现出色,超过了基于词和字符频率的模型,而这些模型本身也超过了基于词和字符计数的模型。
    Abstract This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. Grounded in information theory, this approach is based on the assumption that a rare concept is more informative than a usual concept, inasmuch as it is more surprising. The self-information is calculated from the probability of occurrence of that claim, where the probability is calculated in accordance with a language model. Five language models are considered, ranging from the simplest models (each word or character is drawn from a uniform distribution) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the simplest language models reduce the scope measure to the reciprocal of the word or character count, a metric already used in previous works. Application is made to nine series of patent claims directed to distinct inventions, where the claims in each series have a gradually decreasing scope. The performance of the language models is then assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. The GPT2 model outperforms models based on word and character frequencies, which are themselves ahead of models based on word and character counts.
    摘要

AutoAM: An End-To-End Neural Model for Automatic and Universal Argument Mining

  • paper_url: http://arxiv.org/abs/2309.09300
  • repo_url: None
  • paper_authors: Lang Cao
  • for: 本文是为了提出一种基于神经网络的自动论证挖掘模型(AutoAM),以解决现有的论证挖掘技术尚未成熟和不具备准确描述论证关系的问题。
  • methods: 本文提出了一种新的论证组件注意力机制,可以更好地捕捉论证中相关的信息,从而提高论证挖掘的性能。此外,本文还提出了一种通用的终端框架,可以无需Constraints like tree structure,完成论证挖掘的三个子任务。
  • results: 实验结果表明, compared with现有的工作,我们的模型在两个公共数据集上的多个维度上表现出色,达到了更高的性能。
    Abstract Argument mining is to analyze argument structure and extract important argument information from unstructured text. An argument mining system can help people automatically gain causal and logical information behind the text. As argumentative corpus gradually increases, like more people begin to argue and debate on social media, argument mining from them is becoming increasingly critical. However, argument mining is still a big challenge in natural language tasks due to its difficulty, and relative techniques are not mature. For example, research on non-tree argument mining needs to be done more. Most works just focus on extracting tree structure argument information. Moreover, current methods cannot accurately describe and capture argument relations and do not predict their types. In this paper, we propose a novel neural model called AutoAM to solve these problems. We first introduce the argument component attention mechanism in our model. It can capture the relevant information between argument components, so our model can better perform argument mining. Our model is a universal end-to-end framework, which can analyze argument structure without constraints like tree structure and complete three subtasks of argument mining in one model. The experiment results show that our model outperforms the existing works on several metrics in two public datasets.
    摘要 Argument mining是分析对话结构,从不结构化文本中提取重要的论据信息的技术。一个有效的论据挖掘系统可以帮助人们自动获得文本中的 causal 和逻辑信息。随着社交媒体上的论据资源的不断增加,论据挖掘在自然语言任务中变得越来越重要。然而,论据挖掘仍然是自然语言任务中的大挑战,因为它的难度和相关技术还没有成熟。例如,研究非树结构论据挖掘还需要做得更多。大多数工作都是只关注EXTRACTING 树结构论据信息。此外,当前的方法无法准确地描述和捕捉论据关系,也无法预测它们的类型。在这篇论文中,我们提出了一种新的神经网络模型called AutoAM,以解决这些问题。我们首先介绍了我们模型中的论据组件注意力机制。它可以捕捉论据组件之间的相关信息,因此我们的模型可以更好地进行论据挖掘。我们的模型是一个通用的端到端框架,可以不受树结构的限制,完成论据挖掘中的三个子任务。实验结果表明,我们的模型在两个公共数据集上的多个纪录度量上都高于现有的作品。

OWL: A Large Language Model for IT Operations

  • paper_url: http://arxiv.org/abs/2309.09298
  • repo_url: https://github.com/Aryia-Behroziuan/Other-sources
  • paper_authors: Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, Xu Shi, Tieqiao Zheng, Liangfan Zheng, Bo Zhang, Ke Xu, Zhoujun Li
  • for: 这篇论文旨在探讨特有的大自然语言处理技术(NLP)在信息技术(IT)操作中的应用。
  • methods: 本论文使用了一个名为OWL的大语言模型,该模型在我们收集的OWL-Instruct数据集上进行了训练,该数据集包含了各种IT相关信息。在训练过程中,提出了混合适应器策略来提高参数效率的调整 across different domains or tasks。
  • results: 根据我们在OWL-Bench和开放IT相关的benchmark上进行的评估,OWL模型在IT任务上表现出色,与现有模型相比,具有显著的性能优势。此外,我们希望这些发现能够为IT操作技术的发展提供更多的思路和灵感。
    Abstract With the rapid development of IT operations, it has become increasingly crucial to efficiently manage and analyze large volumes of data for practical applications. The techniques of Natural Language Processing (NLP) have shown remarkable capabilities for various tasks, including named entity recognition, machine translation and dialogue systems. Recently, Large Language Models (LLMs) have achieved significant improvements across various NLP downstream tasks. However, there is a lack of specialized LLMs for IT operations. In this paper, we introduce the OWL, a large language model trained on our collected OWL-Instruct dataset with a wide range of IT-related information, where the mixture-of-adapter strategy is proposed to improve the parameter-efficient tuning across different domains or tasks. Furthermore, we evaluate the performance of our OWL on the OWL-Bench established by us and open IT-related benchmarks. OWL demonstrates superior performance results on IT tasks, which outperforms existing models by significant margins. Moreover, we hope that the findings of our work will provide more insights to revolutionize the techniques of IT operations with specialized LLMs.
    摘要 随着信息技术运营的快速发展,管理和分析大量数据的效率已成为非常重要的。自然语言处理(NLP)技术在各种任务上表现出了惊人的能力,包括命名实体识别、机器翻译和对话系统。最近,大型自然语言模型(LLMs)在各种 NLP 下渠道任务上实现了显著的改进。然而,对 IT 运营的特殊化 LLMs 缺乏。本文介绍了 OWL,一个基于我们收集的 OWL-Instruct 数据集的大型自然语言模型,其中混合 adapter 策略可以在不同的领域或任务中进行参数高效调整。此外,我们评估了 OWL 在 OWL-Bench 和开放的 IT 相关benchmark上的性能,并发现 OWL 在 IT 任务上表现出了显著的优异性。此外,我们希望通过这些研究成果,为 IT 运营技术的发展提供更多的新思路和灵感。

Model-based Subsampling for Knowledge Graph Completion

  • paper_url: http://arxiv.org/abs/2309.09296
  • repo_url: https://github.com/xincanfeng/ms_kge
  • paper_authors: Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
  • for: 提高 Knowledge Graph Embedding (KGE) 模型的适应性和性能
  • methods: 提出 Model-based Subsampling (MBS) 和 Mixed Subsampling (MIX) 方法,通过 KGE 模型的预测来估计不够频繁的查询的出现概率
  • results: 对 FB15k-237、WN18RR 和 YAGO3-10 等 dataset 进行评估,显示我们的提案的抽样方法可以提高受欢迎 KGE 模型的 KG 完成性能
    Abstract Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.
    摘要 通过抽样可以有效地降低知识图(KG)数据集中的过拟合问题,但现有的抽样方法只考虑了实体和关系之间的频率。这可能会下降不常见的查询的出现概率,即使实体或关系的频率很高。为解决这个问题,我们提出了基于模型的抽样(MBS)和混合抽样(MIX)方法,通过KGE模型的预测来估计它们的出现概率。我们在FB15k-237、WN18RR和YAGO3-10等 dataset上进行了评估,结果表明,我们的提议的抽样方法实际上提高了各种KGE模型(RotatE、TransE、HAKE、ComplEx、DistMult)的KG完成性能。

Leveraging Social Discourse to Measure Check-worthiness of Claims for Fact-checking

  • paper_url: http://arxiv.org/abs/2309.09274
  • repo_url: None
  • paper_authors: Megha Sundriyal, Md Shad Akhtar, Tanmoy Chakraborty
  • for: 本研究旨在提出一种细化的声明检查价值 Task,以优化现有的声明检查系统。
  • methods: 该研究使用了一个大量人工标注的Twitter数据集,并提出了一种基于人工评估的CheckMate方法,以 JOINTLY 确定声明是否值得检查,以及导致这个结论的因素。
  • results: 研究表明, integrating 多种因素可以提高声明检查的准确率和效率,并且人工评估 validate 了这些结论。
    Abstract The expansion of online social media platforms has led to a surge in online content consumption. However, this has also paved the way for disseminating false claims and misinformation. As a result, there is an escalating demand for a substantial workforce to sift through and validate such unverified claims. Currently, these claims are manually verified by fact-checkers. Still, the volume of online content often outweighs their potency, making it difficult for them to validate every single claim in a timely manner. Thus, it is critical to determine which assertions are worth fact-checking and prioritize claims that require immediate attention. Multiple factors contribute to determining whether a claim necessitates fact-checking, encompassing factors such as its factual correctness, potential impact on the public, the probability of inciting hatred, and more. Despite several efforts to address claim check-worthiness, a systematic approach to identify these factors remains an open challenge. To this end, we introduce a new task of fine-grained claim check-worthiness, which underpins all of these factors and provides probable human grounds for identifying a claim as check-worthy. We present CheckIt, a manually annotated large Twitter dataset for fine-grained claim check-worthiness. We benchmark our dataset against a unified approach, CheckMate, that jointly determines whether a claim is check-worthy and the factors that led to that conclusion. We compare our suggested system with several baseline systems. Finally, we report a thorough analysis of results and human assessment, validating the efficacy of integrating check-worthiness factors in detecting claims worth fact-checking.
    摘要 在线社交媒体平台的扩展导致在线内容的摄 consumption 增加,但也导致了假宣传和不准确信息的传播。因此,需要一大量的人力来筛选和验证这些未经证实的宣言。现在,这些宣言都是由 фак-checker 手动验证的。然而,在线内容的量太大,fact-checker 无法在有限时间内验证每一个宣言。因此,是非常重要的确定哪些宣言值得验证,并且需要立即验证的宣言。多种因素会影响哪些宣言需要验证,包括宣言的准确性、公众影响、激进的可能性和更多。虽然有很多努力来解决宣言验证价值的问题,但是一个系统atic approach 来确定这些因素仍然是一个开放的挑战。为此,我们介绍了一个新的细化的宣言验证任务,这个任务涵盖了所有这些因素,并提供了人类可靠的判据来确定宣言是否值得验证。我们提供了 CheckIt,一个手动标注的大型 Twitter 数据集,用于细化的宣言验证。我们对我们的数据集进行了对joint 方法 CheckMate 的比较,joint 方法可以同时判断一个宣言是否值得验证,以及这个宣言被验证的原因。我们与多个基线系统进行比较。最后,我们进行了详细的分析结果和人类评估,证明了将验证价值因素纳入检测宣言值得验证的效果。

Code quality assessment using transformers

  • paper_url: http://arxiv.org/abs/2309.09264
  • repo_url: None
  • paper_authors: Mosleh Mahamud, Isak Samsten
  • for: 这个论文的目的是自动评估编程作业的正确性,但是编程任务可以有多种解决方案,其中许多方案尽管正确,但是代码具有冗长、糟糕的命名和重复等� субjective 质量。
  • methods: 这篇论文使用了 CodeBERT 来自动分配代码质量分数。作者们尝试了不同的模型和训练方法,并对一个新的代码质量评估数据集进行了实验。
  • results: 研究发现,代码质量有一定的可预测性,并且使用 transformer 基于的模型,使用任务适应预训练可以更有效地解决这个问题。
    Abstract Automatically evaluate the correctness of programming assignments is rather straightforward using unit and integration tests. However, programming tasks can be solved in multiple ways, many of which, although correct, are inelegant. For instance, excessive branching, poor naming or repetitiveness make the code hard to understand and maintain. These subjective qualities of code are hard to automatically assess using current techniques. In this work we investigate the use of CodeBERT to automatically assign quality score to Java code. We experiment with different models and training paradigms. We explore the accuracy of the models on a novel dataset for code quality assessment. Finally, we assess the quality of the predictions using saliency maps. We find that code quality to some extent is predictable and that transformer based models using task adapted pre-training can solve the task more efficiently than other techniques.
    摘要 自动评估程序作业正确性 relativamente直 forward 使用单元测试和集成测试。然而,程序任务可以通过多种方式解决,许多方法都是正确的,但是命名不佳、重复性强或分支过多,导致代码难以理解和维护。这些编程质量的subjective特征难以通过当前技术自动评估。在这种工作中,我们 investigate使用CodeBERT自动为Java代码分配质量分数。我们试用不同的模型和训练方法。我们探索一个新的代码质量评估数据集的准确性。最后,我们评估预测的质量使用Saliency Map。我们发现代码质量有一定的预测性,并且基于 transformer 模型的任务适应预处理可以更有效地解决这个问题。

A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

  • paper_url: http://arxiv.org/abs/2309.09198
  • repo_url: None
  • paper_authors: Yi Chen, Haiyun Jiang, Wei Bi, Rui Wang, Longyue Wang, Shuming Shi, Ruifeng Xu
  • for: This paper presents a new task called Text Expansion (TE), which aims to insert fine-grained modifiers into plain text to make it more concrete and vivid.
  • methods: The authors use four complementary approaches to construct a dataset of 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. They also design various metrics to evaluate the effectiveness of the expansions.
  • results: The proposed Locate&Infill models demonstrate superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research.Here’s the simplified Chinese text:
  • for: 这篇论文提出了一个新的文本扩展任务(TE),旨在插入细化修饰符到文本中,使其更加具体和生动。
  • methods: 作者使用了四种补充方法构建了一个包含12万自动生成实例和2000个人工标注的数据集,以及多种评价指标。
  • results: 提出的 Locate&Infill 模型在扩展信息量方面表现出色,特别是在对 Text2Text 基elines 的比较中。实验证明了文本扩展任务的可行性,并指出了未来研究的可能性。
    Abstract This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifiers into proper locations of the plain text to concretize or vivify human writings. Different from existing insertion-based writing assistance tasks, TE requires the model to be more flexible in both locating and generation, and also more cautious in keeping basic semantics. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. To facilitate automatic evaluation, we design various metrics from multiple perspectives. In particular, we propose Info-Gain to effectively measure the informativeness of expansions, which is an important quality dimension in TE. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research toward better automatic text expansion.
    摘要 To create a dataset for this task, we use four complementary approaches to generate 12 million automatically generated instances and 2,000 human-annotated references for both English and Chinese. We also design various metrics to evaluate the effectiveness of the expansions, including a new metric called Info-Gain that measures the informativeness of the expansions.We build both pipelined and joint Locate&Infill models on top of a pre-trained text-infilling model, and compare them with Text2Text baselines. Our experiments show that the proposed models outperform the baselines, especially in terms of expansion informativeness. This demonstrates the feasibility of the TE task and provides a new direction for future research in automatic text expansion.

Detecting covariate drift in text data using document embeddings and dimensionality reduction

  • paper_url: http://arxiv.org/abs/2309.10000
  • repo_url: https://github.com/vinayaksodar/nlp_drift_paper_code
  • paper_authors: Vinayak Sodar, Ankit Sekseria
  • for: 本研究旨在提高文本分析模型的可靠性和性能,通过检测文本数据中的covariate漂移。
  • methods: 本研究使用了三种文档嵌入:TF-IDF使用LSA进行维度减少, Doc2Vec和BERT嵌入,以及使用PCA进行维度减少。检测covariate漂移的方法包括KS统计和MMD测试。
  • results: 实验结果表明,某些嵌入、维度减少方法和漂移检测方法的组合表现较好,可以有效地检测文本数据中的covariate漂移。这些结果对于提高可靠的文本分析模型做出了贡献。
    Abstract Detecting covariate drift in text data is essential for maintaining the reliability and performance of text analysis models. In this research, we investigate the effectiveness of different document embeddings, dimensionality reduction techniques, and drift detection methods for identifying covariate drift in text data. We explore three popular document embeddings: term frequency-inverse document frequency (TF-IDF) using Latent semantic analysis(LSA) for dimentionality reduction and Doc2Vec, and BERT embeddings, with and without using principal component analysis (PCA) for dimensionality reduction. To quantify the divergence between training and test data distributions, we employ the Kolmogorov-Smirnov (KS) statistic and the Maximum Mean Discrepancy (MMD) test as drift detection methods. Experimental results demonstrate that certain combinations of embeddings, dimensionality reduction techniques, and drift detection methods outperform others in detecting covariate drift. Our findings contribute to the advancement of reliable text analysis models by providing insights into effective approaches for addressing covariate drift in text data.
    摘要 检测文本数据中的变量漂移是维护文本分析模型的可靠性和性能的关键。在这个研究中,我们研究了不同的文档嵌入、维度减少技术和变量漂移检测方法,以确定在文本数据中检测变量漂移的效果。我们探索了三种流行的文档嵌入:term frequency-inverse document frequency(TF-IDF)使用隐藏语义分析(LSA)进行维度减少,Doc2Vec和BERT嵌入,并使用主成分分析(PCA)进行维度减少。为了量化训练和测试数据分布之间的差异,我们采用了科维莫洛夫-斯密涅夫(KS)统计和最大均值差(MMD)测试作为变量漂移检测方法。实验结果表明,某些组合的嵌入、维度减少技术和变量漂移检测方法在检测变量漂移方面表现出色。我们的发现对于建立可靠的文本分析模型做出了贡献,提供了有关有效地在文本数据中检测变量漂移的信息。