cs.CL - 2023-11-05

Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context

  • paper_url: http://arxiv.org/abs/2311.02777
  • repo_url: None
  • paper_authors: Michael Ginn, Alexis Palmer
  • for: 这篇论文旨在 investigate the ability of morpheme labeling models to generalize, especially in resource-constrained settings.
  • methods: 这篇论文使用 weight decay optimization, output denoising, and iterative pseudo-labeling 方法来减少模型在不同类型文本上的差异性。
  • results: experiments 表明,通过使用这些方法,模型的性能在未经见过的类型文本上提高了2%。
    Abstract Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.
    摘要 通用化在有限资源的情况下 particualrly important, where the available training data may only represent a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.Here's the text with Traditional Chinese characters:通用化在有限资源的情况下 particualrly important, where the available training data may only represent a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

  • paper_url: http://arxiv.org/abs/2311.02772
  • repo_url: None
  • paper_authors: Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel
  • for: 这个论文目的是提出一种简单自编程的音频模型,可以达到与更复杂的预训练模型相同的推理效率。
  • methods: 这个论文使用了混合卷积模块和自注意模块的speech transformerEncoder,实现了ASR的state-of-the-art性和高效性。
  • results: 研究表明,使用这种speech transformerEncoder可以大幅提高预训练音频模型的效率,但是我们还可以通过使用高级自注意来实现相同的效率。此外,我们发现使用低位数字量化技术可以进一步提高效率。
    Abstract In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.
    摘要 在这篇论文中,我们显示了一种简单的自我超vised预训练音频模型可以达到与更复杂的预训练模型(具有speech transformer Encoder)相当的推理效率。这些speech transformer Encoder通过混合径向模块与自我注意模块来实现了ASR中的状态环境。我们首先显示了使用这些speech transformer Encoder作为encoder可以显著提高预训练音频模型的效率。然而,我们的研究表明,我们可以通过高级自我注意来实现相同的效率。我们示出了这种更简单的方法在使用低位数量量化神经网络时 particualrly有利。我们假设这种方法可以避免在不同量化模块之间传递错误,相比之下,当前的speech transformers混合量化径向模块和量化自我注意模块。

Pyclipse, a library for deidentification of free-text clinical notes

  • paper_url: http://arxiv.org/abs/2311.02748
  • repo_url: None
  • paper_authors: Callandra Moore, Jonathan Ranisau, Walter Nelson, Jeremy Petch, Alistair Johnson
  • for: Automated deidentification of clinical text data is crucial due to the high cost of manual deidentification, which has been a barrier to sharing clinical text and the advancement of clinical natural language processing.
  • methods: The pyclipse framework is proposed to address the challenges of creating effective automated deidentification tools, including issues in reproducibility due to differences in text processing, evaluation methods, and a lack of consistency across clinical domains and institutions.
  • results: The pyclipse framework is demonstrated to be a unified and configurable evaluation procedure that can streamline the comparison of deidentification algorithms, and it is found that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset.
    Abstract Automated deidentification of clinical text data is crucial due to the high cost of manual deidentification, which has been a barrier to sharing clinical text and the advancement of clinical natural language processing. However, creating effective automated deidentification tools faces several challenges, including issues in reproducibility due to differences in text processing, evaluation methods, and a lack of consistency across clinical domains and institutions. To address these challenges, we propose the pyclipse framework, a unified and configurable evaluation procedure to streamline the comparison of deidentification algorithms. Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data, allowing for context-specific evaluation. To demonstrate the utility of pyclipse, we compare six deidentification algorithms across four public and two private clinical text datasets. We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset. These discrepancies highlight the complexity of accurately assessing and comparing deidentification algorithms, emphasizing the need for a reproducible, adjustable, and extensible framework like pyclipse. Our framework lays the foundation for a unified approach to evaluate and improve deidentification tools, ultimately enhancing patient protection in clinical natural language processing.
    摘要 自动化识别临床文本数据的重要性在于手动识别的高成本,这成为了临床自然语言处理的发展的一个障碍。然而,创建有效的自动化识别工具面临着许多挑战,包括评估方法的不同和临床领域和机构之间的不一致性。为解决这些挑战,我们提出了pyclipse框架,一个可配置的评估过程框架,可以帮助Streamline识别算法的比较。pyclipse提供了一个单一的界面,可以在本地临床数据上运行开源识别算法,并为每个临床领域和机构提供上下文特定的评估。为了证明pyclipse的有用性,我们将比较六种识别算法在四个公共和两个私人临床文本数据集上的表现。我们发现,算法的表现 consistently short of the results reported in the original papers, even when evaluated on the same benchmark dataset.这些差异 highlights the complexity of accurately assessing and comparing deidentification algorithms, emphasizing the need for a reproducible, adjustable, and extensible framework like pyclipse.我们的框架为识别工具的评估和改进提供了一个统一的方法,从而推动了患者保护在临床自然语言处理中。

Nepali Video Captioning using CNN-RNN Architecture

  • paper_url: http://arxiv.org/abs/2311.02699
  • repo_url: None
  • paper_authors: Bipesh Subedi, Saugat Singh, Bal Krishna Bal
  • for: 这个研究旨在开发一个基于深度神经网络的尼泊尔视频描述系统,以提供精准和contextually relevant的视频描述 для尼泊尔视频。
  • methods: 该研究使用了预训练的CNN和RNN,并通过数据采集、数据处理、模型实现和评估来实现目标。研究使用了Google翻译将MSVD数据集扩展到尼泊尔语描述,然后训练了不同的CNN-RNN架构。
  • results: 研究发现,使用EfficientNetB0和BiLSTM结构的模型在BLEU和METEOR metric上达到了17和46的分数。此外,研究还描述了在尼泊尔语视频描述方面遇到的挑战和未来研究的方向。
    Abstract This article presents a study on Nepali video captioning using deep neural networks. Through the integration of pre-trained CNNs and RNNs, the research focuses on generating precise and contextually relevant captions for Nepali videos. The approach involves dataset collection, data preprocessing, model implementation, and evaluation. By enriching the MSVD dataset with Nepali captions via Google Translate, the study trains various CNN-RNN architectures. The research explores the effectiveness of CNNs (e.g., EfficientNetB0, ResNet101, VGG16) paired with different RNN decoders like LSTM, GRU, and BiLSTM. Evaluation involves BLEU and METEOR metrics, with the best model being EfficientNetB0 + BiLSTM with 1024 hidden dimensions, achieving a BLEU-4 score of 17 and METEOR score of 46. The article also outlines challenges and future directions for advancing Nepali video captioning, offering a crucial resource for further research in this area.
    摘要 The study involves several steps, including dataset collection, data preprocessing, model implementation, and evaluation. To enrich the MSVD dataset with Nepali captions, the researchers use Google Translate to add captions to the videos. They then train various CNN-RNN architectures, including EfficientNetB0, ResNet101, and VGG16, paired with different RNN decoders such as LSTM, GRU, and BiLSTM.The evaluation metrics used in the study are BLEU and METEOR, and the best model is found to be EfficientNetB0 + BiLSTM with 1024 hidden dimensions, achieving a BLEU-4 score of 17 and METEOR score of 46. The article also discusses challenges and future directions for advancing Nepali video captioning, providing a valuable resource for further research in this area.

LLM-enhanced Self-training for Cross-domain Constituency Parsing

  • paper_url: http://arxiv.org/abs/2311.02660
  • repo_url: None
  • paper_authors: Jianling Li, Meishan Zhang, Peiming Guo, Min Zhang, Yue Zhang
  • for: 本研究探讨了自动训练在跨领域任务中的应用,特别是在跨领域成分分析中。
  • methods: 本研究提出了利用大语言模型(LLM)生成领域特定的raw corpora,并通过 grammar rules和假实例选择 criterion来引导LLM生成raw corpora。
  • results: 实验结果表明,自动训练 для成分分析,启用LLM,可以超越传统方法,无论LLM的性能如何。此外,结合grammar rules和假实例选择 criterion可以实现最高的跨领域成分分析性能。
    Abstract Self-training has proven to be an effective approach for cross-domain tasks, and in this study, we explore its application to cross-domain constituency parsing. Traditional self-training methods rely on limited and potentially low-quality raw corpora. To overcome this limitation, we propose enhancing self-training with the large language model (LLM) to generate domain-specific raw corpora iteratively. For the constituency parsing, we introduce grammar rules that guide the LLM in generating raw corpora and establish criteria for selecting pseudo instances. Our experimental results demonstrate that self-training for constituency parsing, equipped with an LLM, outperforms traditional methods regardless of the LLM's performance. Moreover, the combination of grammar rules and confidence criteria for pseudo-data selection yields the highest performance in the cross-domain constituency parsing.
    摘要 自我训练已经证明是跨领域任务的有效方法,在这种研究中,我们探索了它的应用于跨领域成分分析。传统的自我训练方法取得有限和可能是低质量的Raw corpora。为了超越这些限制,我们提议通过大型语言模型(LLM)生成领域特定的Raw corpora,并在每一轮生成Raw corpora时遵循语法规则。对于成分分析,我们引入语法规则来导引LLM生成Raw corpora,并设置pseudo实例选择的标准。我们的实验结果表明,将自我训练与LLM结合使用,可以超越传统方法,无论LLM的性能如何。此外,结合语法规则和pseudo实例选择的信心标准,可以在跨领域成分分析中获得最高性能。

Divide & Conquer for Entailment-aware Multi-hop Evidence Retrieval

  • paper_url: http://arxiv.org/abs/2311.02616
  • repo_url: None
  • paper_authors: Fan Luo, Mihai Surdeanu
  • for: Answering multi-hop questions by retrieving evidences that are semantically equivalent or entailed by the question.
  • methods: Divide the task into two sub-tasks: semantic textual similarity retrieval and inference similarity retrieval, and use two ensemble models (EAR and EARnest) to jointly re-rank sentences with consideration of diverse relevance signals.
  • results: Significantly outperform all single retrieval models and two ensemble baseline models on HotpotQA, and more effective in retrieving relevant evidences for multi-hop questions.
    Abstract Lexical and semantic matches are commonly used as relevance measurements for information retrieval. Together they estimate the semantic equivalence between the query and the candidates. However, semantic equivalence is not the only relevance signal that needs to be considered when retrieving evidences for multi-hop questions. In this work, we demonstrate that textual entailment relation is another important relevance dimension that should be considered. To retrieve evidences that are either semantically equivalent to or entailed by the question simultaneously, we divide the task of evidence retrieval for multi-hop question answering (QA) into two sub-tasks, i.e., semantic textual similarity and inference similarity retrieval. We propose two ensemble models, EAR and EARnest, which tackle each of the sub-tasks separately and then jointly re-rank sentences with the consideration of the diverse relevance signals. Experimental results on HotpotQA verify that our models not only significantly outperform all the single retrieval models it is based on, but is also more effective than two intuitive ensemble baseline models.
    摘要 lexical和semantic匹配通常用于信息检索中的相关性评估。它们共同估计查询和候选答案之间的semanticEquivalence。但semanticEquivalence并不是多步问题检索证据的唯一相关性信号。在这种情况下,我们表明文本涵义关系是另一个重要的相关性维度。为了同时检索具有查询和问题相似或涵义涵盖的证据,我们将多步问题answering(QA)证据检索任务分为两个子任务:semantic textual similarity retrieval和inference similarity retrieval。我们提出了两种ensemble模型,EAR和EARnest,它们分别处理每个子任务,然后对结果进行jointly重新排序,考虑多种相关性信号的多样性。实验结果表明,我们的模型不仅在HotpotQA上显著超越所有基于它的单个检索模型,还比两个INTUITIVE ensemble基eline模型更有效。

mahaNLP: A Marathi Natural Language Processing Library

  • paper_url: http://arxiv.org/abs/2311.02579
  • repo_url: https://github.com/l3cube-pune/MarathiNLP
  • paper_authors: Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar, Saloni Mittal, Raviraj Joshi
  • for: 这个研究是为了提供一个开源的自然语言处理(NLP)库,专门针对印度语言Marathi进行支持。
  • methods: 这个研究使用了现代的MahaBERT基于trasnformer模型,并提供了一个易于使用、可扩展、对应的Marathi文本分析工具组。
  • results: 这个研究提供了一个全面的NLP任务集,包括基本的预处理任务和进阶的NLP任务,例如情感分析、命名实体识别、讨厌话检测和句子完成。
    Abstract We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language. It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP. It is an easy-to-use, extensible, and modular toolkit for Marathi text analysis built on state-of-the-art MahaBERT-based transformer models. Our work holds significant importance as other existing Indic NLP libraries provide basic Marathi processing support and rely on older models with restricted performance. Our toolkit stands out by offering a comprehensive array of NLP tasks, encompassing both fundamental preprocessing tasks and advanced NLP tasks like sentiment analysis, NER, hate speech detection, and sentence completion. This paper focuses on an overview of the mahaNLP framework, its features, and its usage. This work is a part of the L3Cube MahaNLP initiative, more information about it can be found at https://github.com/l3cube-pune/MarathiNLP .
    摘要 我们介绍mahaNLP,一个开源的自然语言处理(NLP)库,专门为旁遮普语言提供支持。它的目标是在NLP领域提高旁遮普语言的支持。这是一个易于使用、可扩展、具有模块性的旁遮普文本分析工具库,建立于现代的MahaBERT基于转移模型。我们的工作具有重要的意义,因为现有的印度语言NLP库只提供了基本的旁遮普处理支持,并且使用older模型,性能有限。我们的工具库包括了许多NLP任务,包括基本的预处理任务以及高级NLP任务,如情感分析、命名实体识别、仇恨言语检测和句子完成。本文将对mahaNLP框架、特点和使用进行概述。这是L3Cube MahaNLP项目的一部分,更多信息可以在https://github.com/l3cube-pune/MarathiNLP查看。

Temporal Sequencing of Documents

  • paper_url: http://arxiv.org/abs/2311.02578
  • repo_url: None
  • paper_authors: Michael Gervers, Gelila Tilahun
  • for: 这篇论文是为了排序历史文档的时间顺序而写的。
  • methods: 这种方法使用非 Parametric 泛函模型(Fan, Heckman, 和 Wand, 1995)来捕捉文字使用的慢滑度变化。
  • results: 这种方法可以有效地对历史文档进行排序,并且在对 medieval English 财产转让文档和美国国情报告 addresses 进行比较时,都有显著的改善。
    Abstract We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through possible cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.
    摘要 我们提出了一种无监督的方法,用于排序历史文档集合,包括美国州联合宪言和中世纪英格兰财产转让文档集。我们的方法基于有效地捕捉文本中慢慢变化的词汇使用情况,通过非参数化的泛化线性模型(Fan, Heckman, 和 Wand,1995)来估算带宽。由于搜索可能的排序方案的数量可能很大,即使是一小组文档也可能会出现这个问题。我们使用模拟熔化算法来解决这个问题,从而获得最佳的文档排序顺序。我们的排序方法在对两个 corpora 进行比较时具有显著改善,相比随机排序基线。这种无监督的方法应该能够应用于无日期文档集。

BanMANI: A Dataset to Identify Manipulated Social Media News in Bangla

  • paper_url: http://arxiv.org/abs/2311.02570
  • repo_url: https://github.com/kamruzzaman15/banmani
  • paper_authors: Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim
  • for: 本研究旨在识别社交媒体新闻中 false 地操纵相关新闻文章的具体说法。
  • methods: 本研究使用了一种数据集采集方法,以 circumvent 当前可用的 NLP 工具在 Bangla 语言上的限制。
  • results: 研究发现,当 Zero-shot 和微调设定下,现有的 LLM 都难以满足这个任务的要求。
    Abstract Initial work has been done to address fake news detection and misrepresentation of news in the Bengali language. However, no work in Bengali yet addresses the identification of specific claims in social media news that falsely manipulates a related news article. At this point, this problem has been tackled in English and a few other languages, but not in the Bengali language. In this paper, we curate a dataset of social media content labeled with information manipulation relative to reference articles, called BanMANI. The dataset collection method we describe works around the limitations of the available NLP tools in Bangla. We expect these techniques will carry over to building similar datasets in other low-resource languages. BanMANI forms the basis both for evaluating the capabilities of existing NLP systems and for training or fine-tuning new models specifically on this task. In our analysis, we find that this task challenges current LLMs both under zero-shot and fine-tuned settings.
    摘要 初步工作已经对假新闻检测和新闻歪曲的问题进行了准备。然而,目前没有任何工作在孟加拉语中对社交媒体新闻中谎言性的具体CLAIM进行识别。在这篇论文中,我们为这个问题收集了一个社交媒体内容的标注数据集,称为BanMANI。我们的数据集采集方法会讲述在可用的NLP工具 limitation下如何实现。我们期望这些技术可以扩展到其他低资源语言。BanMANI将成为评估现有NLP系统的能力以及训练或精度调整新模型的基础。在我们的分析中,我们发现这个任务对当前LLMs都是一个挑战,无论在零情况下或者精度调整后。

Topic model based on co-occurrence word networks for unbalanced short text datasets

  • paper_url: http://arxiv.org/abs/2311.02566
  • repo_url: None
  • paper_authors: Chengjie Ma, Junping Du, Meiyu Liang, Zeli Guan
  • for: 检测罕见话题在短文本 datasets 中的检测 (Detecting scarce topics in unbalanced short text datasets)
  • methods: 基于 co-occurrence word networks 的话题模型 (Topic model based on co-occurrence word networks)
  • results: 在不平衡短文本 datasets 中提供了一种可靠的话题检测方法 (Provides a reliable method for detecting topics in unbalanced short text datasets)
    Abstract We propose a straightforward solution for detecting scarce topics in unbalanced short-text datasets. Our approach, named CWUTM (Topic model based on co-occurrence word networks for unbalanced short text datasets), Our approach addresses the challenge of sparse and unbalanced short text topics by mitigating the effects of incidental word co-occurrence. This allows our model to prioritize the identification of scarce topics (Low-frequency topics). Unlike previous methods, CWUTM leverages co-occurrence word networks to capture the topic distribution of each word, and we enhanced the sensitivity in identifying scarce topics by redefining the calculation of node activity and normalizing the representation of both scarce and abundant topics to some extent. Moreover, CWUTM adopts Gibbs sampling, similar to LDA, making it easily adaptable to various application scenarios. Our extensive experimental validation on unbalanced short-text datasets demonstrates the superiority of CWUTM compared to baseline approaches in discovering scarce topics. According to the experimental results the proposed model is effective in early and accurate detection of emerging topics or unexpected events on social platforms.
    摘要 我们提出了一种直观的解决方案,用于探测罕见话题在不均衡短文本数据集中。我们的方法,名为CWUTM(基于协occurrence词网络的短文本数据集中罕见话题模型),解决了短文本话题的罕见性和不均衡性的挑战。我们的模型可以增强对罕见话题的识别,并且可以在不同应用场景中轻松地适应。我们对不均衡短文本数据集进行了广泛的实验 validate,结果表明,相比基eline方法,CWUTM在发现罕见话题方面表现出了明显的优势。根据实验结果,我们的模型可以在社交平台上早期发现emerging话题或意外事件。Note: "短文本数据集" (short-text dataset) in Chinese is typically translated as "短文本集" (short-text collection), and "罕见话题" (scarce topic) is translated as "罕见话题" (rare topic) or "罕见话题" (underrepresented topic).

Relation Extraction Model Based on Semantic Enhancement Mechanism

  • paper_url: http://arxiv.org/abs/2311.02564
  • repo_url: None
  • paper_authors: Peiyu Liu, Junping Du, Yingxia Shao, Zeli Guan
  • for: 提高信息抽取中关系EXTRACTION的效果,解决 triple overlap 问题
  • methods: 基于CasRel框架和semantic enhancement mechanism,提出了CasAug模型,通过对可能主语进行semantic coding,采用含义增强机制,对可能主语进行权重调整,提高关系EXTRACTION的精度
  • results: 比基eline模型提高了关系EXTRACTION的效果,可以更好地处理 triple overlap 问题,提高了对多个关系的EXTRACTION能力
    Abstract Relational extraction is one of the basic tasks related to information extraction in the field of natural language processing, and is an important link and core task in the fields of information extraction, natural language understanding, and information retrieval. None of the existing relation extraction methods can effectively solve the problem of triple overlap. The CasAug model proposed in this paper based on the CasRel framework combined with the semantic enhancement mechanism can solve this problem to a certain extent. The CasAug model enhances the semantics of the identified possible subjects by adding a semantic enhancement mechanism, First, based on the semantic coding of possible subjects, pre-classify the possible subjects, and then combine the subject lexicon to calculate the semantic similarity to obtain the similar vocabulary of possible subjects. According to the similar vocabulary obtained, each word in different relations is calculated through the attention mechanism. For the contribution of the possible subject, finally combine the relationship pre-classification results to weight the enhanced semantics of each relationship to find the enhanced semantics of the possible subject, and send the enhanced semantics combined with the possible subject to the object and relationship extraction module. Complete the final relation triplet extraction. The experimental results show that, compared with the baseline model, the CasAug model proposed in this paper has improved the effect of relation extraction, and CasAug's ability to deal with overlapping problems and extract multiple relations is also better than the baseline model, indicating that the semantic enhancement mechanism proposed in this paper It can further reduce the judgment of redundant relations and alleviate the problem of triple overlap.
    摘要 基于自然语言处理的信息EXTRACTION中,关系提取是一项基础任务和核心任务,与信息提取、自然语言理解和信息检索 closely related。现有的关系提取方法无法有效解决 triple overlap 问题。本文提出的 CasAug 模型,基于 CasRel 框架和semantic enhancement mechanism,可以减少重复的关系判断和 triple overlap 问题。CasAug 模型首先使用可能主语的semantic coding进行预类型,然后使用主语词典计算 Possible subjects 的semantic similarity,以获得每个关系中的相似词汇。通过注意机制,对每个关系中的每个词语进行计算。最后,根据关系预类型的结果,对各种关系中的semantics进行权重计算,并将权重计算结果与可能主语进行组合。最终,通过对象和关系提取模块进行完善,完成最终的关系 triplet 提取。实验结果表明,相比基eline模型,提出的 CasAug 模型在关系提取方面有所提高,并且 CasAug 模型在 triple overlap 问题上的处理能力也比基eline模型更好,这表明该paper中提出的 semantic enhancement mechanism 可以进一步减少重复的关系判断和 triple overlap 问题。