cs.CL - 2023-08-18

ChatHaruhi: Reviving Anime Character in Reality via Large Language Model

  • paper_url: http://arxiv.org/abs/2308.09597
  • repo_url: https://github.com/LC1332/Chat-Haruhi-Suzumiya
  • paper_authors: Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi MI, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, Haozhen Sun
  • for: 这篇论文旨在提出一种控制语言模型以模拟特定的虚构人物的算法,以提高角色扮演能力。
  • methods: 该算法使用改进的提示和从剧本中提取的人物记忆来控制语言模型。
  • results: 自动和人工评估都表明,该方法在比基eline进行角色扮演时表现更好。Translation:
  • for: This paper proposes an algorithm to control language models to mimic specific fictional characters, with the goal of improving role-playing ability.
  • methods: The algorithm uses an improved prompt and memories of the character extracted from scripts to control the language models.
  • results: Both automatic and human evaluations show that the proposed approach performs better than baselines in role-playing.
    Abstract Role-playing chatbots built on large language models have drawn interest, but better techniques are needed to enable mimicking specific fictional characters. We propose an algorithm that controls language models via an improved prompt and memories of the character extracted from scripts. We construct ChatHaruhi, a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues. Both automatic and human evaluations show our approach improves role-playing ability over baselines. Code and data are available at https://github.com/LC1332/Chat-Haruhi-Suzumiya .
    摘要 大语言模型上的角色扮演聊天机器人已经吸引了注意,但更好的技术是需要实现模拟特定的虚构角色。我们提出了一个算法,可以通过改进提示和从剧本中提取的角色记忆来控制语言模型。我们建立了ChatHaruhi,一个覆盖32个中文/英文电视/动画角色的32000多个虚构对话。自动和人类评估都显示了我们的方法可以提高角色扮演能力比基eline。代码和数据可以在https://github.com/LC1332/Chat-Haruhi-Suzumiya 获取。

PUMGPT: A Large Vision-Language Model for Product Understanding

  • paper_url: http://arxiv.org/abs/2308.09568
  • repo_url: None
  • paper_authors: Shuhui Wu, Zengming Tang, Zongyi Guo, Weiwei Zhang, Baoliang Cui, Haihong Tang, Weiming Lu
  • for: 本研究主要针对产品理解任务进行探讨,以提高在线购物体验。产品理解任务包括多种子任务,需要模型回答基于多modal产品信息的多种问题。
  • methods: 我们提出了PUMGPT,一个大型视言语模型,旨在统一所有产品理解任务于单一模型结构。为 bridging视觉和文本表现之间的差距,我们提出了层刻拓展(LA),一种方法,通过增加 fewer 视觉 токен来提供更好的整合,并允许实际 Parameter-efficient 微调。
  • results: PUMGPT 在多种产品理解任务中表现出色,包括产品描述、类别问答、特征EXTRACTION、特征问答以及自由形问答关于产品。
    Abstract Recent developments of multi-modal large language models have demonstrated its strong ability in solving vision-language tasks. In this paper, we focus on the product understanding task, which plays an essential role in enhancing online shopping experience. Product understanding task includes a variety of sub-tasks, which require models to respond diverse queries based on multi-modal product information. Traditional methods design distinct model architectures for each sub-task. On the contrary, we present PUMGPT, a large vision-language model aims at unifying all product understanding tasks under a singular model structure. To bridge the gap between vision and text representations, we propose Layer-wise Adapters (LA), an approach that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. Moreover, the inherent parameter-efficient fine-tuning ability allows PUMGPT to be readily adapted to new product understanding tasks and emerging products. We design instruction templates to generate diverse product instruction datasets. Simultaneously, we utilize open-domain datasets during training to improve the performance of PUMGPT and its generalization ability. Through extensive evaluations, PUMGPT demonstrates its superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.
    摘要 To bridge the gap between vision and text representations, we introduce Layer-wise Adapters (LA), a method that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. PUMGPT's inherent parameter-efficient fine-tuning ability allows it to be easily adapted to new product understanding tasks and emerging products. We create instruction templates to generate diverse product instruction datasets, and we train PUMGPT using open-domain datasets to improve its performance and generalization ability.Through extensive evaluations, PUMGPT demonstrates superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.

Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model

  • paper_url: http://arxiv.org/abs/2308.09454
  • repo_url: None
  • paper_authors: Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer
  • for: investigate the impact of different sampling techniques on musical qualities such as diversity and structure
  • methods: train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies, and analyze the musical qualities of the samples generated using distribution truncation sampling techniques
  • results: discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.
    Abstract Research in natural language processing has demonstrated that the quality of generations from trained autoregressive language models is significantly influenced by the used sampling strategy. In this study, we investigate the impact of different sampling techniques on musical qualities such as diversity and structure. To accomplish this, we train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies and analyze the musical qualities of the samples generated using distribution truncation sampling techniques. Specifically, we use nucleus sampling, the recently proposed "typical sampling", and conventional ancestral sampling. We evaluate the effect of these sampling strategies in two scenarios: optimal circumstances with a well-calibrated model and suboptimal circumstances where we systematically degrade the model's performance. We assess the generated samples using objective and subjective evaluations. We discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.
    摘要 研究自然语言处理显示,训练 autoregressive 语言模型的质量生成受采样策略的影响。在这项研究中,我们调查不同采样技术对音乐质量的影响。为此,我们使用高容量 transformer 模型训练一大量高结构性的爱尔兰传统民歌旋律,并分析生成的样本中的音乐质量。特别是,我们使用核心采样、“典型采样”和传统祖先采样。我们在两种情况下评估这些采样策略的影响:优化的情况下,模型性能很好,以及受损的情况下,我们系统地降低模型性能。我们使用对象和主观评估来评估生成的样本。我们发现,概率 truncation 技术可能会在优化情况下压缩多样性和结构性特征,但在受损情况下可能会生成更多的音乐样本。

Scope is all you need: Transforming LLMs for HPC Code

  • paper_url: http://arxiv.org/abs/2308.09440
  • repo_url: https://github.com/scientific-computing-lab-nrcn/tokompiler
  • paper_authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren
    for:这个论文旨在探讨大型自然语言处理(NLP)模型如何应用于编程任务,特别是高性能计算(HPC)领域的任务。methods:该论文提出了一种名为 Tokompiler 的新型编译器,用于适应编程语言和编译任务。Tokompiler 利用了语言基础知识,生成了语言相关的标记,以提供语言结构的上下文感知,而完全避免了人工含义的代码结构。results:实验结果表明,使用 Tokompiler 进行预训练,可以大幅提高代码完成率和语义理解能力,比传统的标识符更低,约为 1 个折衡指数。这些结果开启了领域特定 LLM 的发展前景,以满足特定领域的独特需求。
    Abstract With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.
    摘要 <>使用更加可accessible的计算资源,在人工智能领域的软件开发中,正在增长一种大型语言模型(LLM)的趋势,以解决多种编程任务。即使应用于高性能计算(HPC)领域的LLM也非常大(例如,十亿个参数),需要昂贵的计算资源进行训练。我们认为这种设计选择是奇怪的——为什么需要大型LLM在自然语言和编程语言不related的HPC任务上进行训练?在这个研究中,我们想要质问现有LLM的设计选择,而是开发特定领域的LLM——我们称之为域specific LLM。 Specifically,我们开始于HPC领域,并提出了一种新的tokenizer名为Tokompiler,用于适应编译和编程任务。Tokompiler利用语言基本元素的知识来生成语言 oriented 的token,提供了代码结构上的上下文感知,而完全避免了人类语义 attributed 到代码结构。我们对HPC领域中的 Fortran 代码集进行预训练两个现有模型,SPT-Code 和 Polycoder。我们对这些模型进行评估,并与传统的 tokenizer进行比较。结果表明,Tokompiler在 норма化的复杂度测试中显著提高了代码完成率和semantic理解,相比传统的 tokenizer,下降至~1复杂度分数。这些研究开创了域specific LLM的新途径,适应HPC和编译任务的特殊需求。

A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages

  • paper_url: http://arxiv.org/abs/2308.09435
  • repo_url: None
  • paper_authors: Nikita Martynov, Mark Baushenko, Anastasia Kozlova, Katerina Kolomeytseva, Aleksandr Abramov, Alena Fenogenova
    for: 本研究的目的是提出一种生成式拼写修正(SC)方法,用于更好地在文本编辑任务中 corrected spelling errors and mistypings。methods: 本研究使用了自然语言拼写错误和 mistypings 的研究,以及如何通过在正确句子中模拟这些错误来增强生成模型的预训练过程。我们还 investigate 了不同的文本领域下模型的能力。results: 我们在不同的损害策略、模型结构和大小下进行了实验,并评估了模型在单一领域和多领域测试集上的性能。此外,我们还提出了一个名为 SAGET (拼写检查 via 增强和生成分布 Emulation) 的自动生成 SC 库,包括一家生成模型的家族和内置的增强算法。
    Abstract Modern large language models demonstrate impressive capabilities in text generation and generalization. However, they often struggle with solving text editing tasks, particularly when it comes to correcting spelling errors and mistypings. In this paper, we present a methodology for generative spelling correction (SC), which was tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistypings in texts and studying the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure. We investigate the impact of such emulations and the models' abilities across different text domains. In this work, we investigate two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from particular dataset and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts. We conducted experiments employing various corruption strategies, models' architectures and sizes on the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation) is a library for automatic generative SC that includes a family of pre-trained generative models and built-in augmentation algorithms.
    摘要 现代大语言模型表现出了优秀的文本生成和通用能力。然而,它们经常在文本编辑任务上遇到困难,特别是正确推理和短语输入错误。在这篇论文中,我们提出了一种生成拼写检查(SC)方法,在英文和俄文语言上进行测试,并可以适用于任何语言。我们的研究主要关注自然发生的拼写错误和输入错误在文本中的表现方式,并研究如何通过模拟这些错误来增强生成模型的预训练过程。我们 investigate了不同的损害策略、模型架构和大小在预训练和精度调整阶段的影响。在我们的实验中,我们使用了不同的损害策略、模型架构和大小,并对单domain和多domain测试集进行评估。作为实践的结果,我们介绍了一个名为SAGE(拼写检查via扩展和生成分布Emulation)的自动生成SC库,该库包括一家拼写检查模型和内置的扩展算法。

Leveraging Large Language Models for DRL-Based Anti-Jamming Strategies in Zero Touch Networks

  • paper_url: http://arxiv.org/abs/2308.09376
  • repo_url: None
  • paper_authors: Abubakar S. Ali, Dimitrios Michael Manias, Abdallah Shami, Sami Muhaidat
  • for: 这篇论文主要是为了探讨自动化网络中的零点touch网络(ZTN)概念,以及在自动化过程中提高网络透明度和用户交互的可能性。
  • methods: 本论文使用了大语言模型(LLM)来把自动化网络过程与人类中心的界面相结合,以提高网络透明度和用户交互。
  • results: 通过一个深度强化学习(DRL)基于的防止干扰技术的实践案例,本论文示出了 LLM 可以将复杂的网络操作概念化为人类可读的报告。
    Abstract As the dawn of sixth-generation (6G) networking approaches, it promises unprecedented advancements in communication and automation. Among the leading innovations of 6G is the concept of Zero Touch Networks (ZTNs), aiming to achieve fully automated, self-optimizing networks with minimal human intervention. Despite the advantages ZTNs offer in terms of efficiency and scalability, challenges surrounding transparency, adaptability, and human trust remain prevalent. Concurrently, the advent of Large Language Models (LLMs) presents an opportunity to elevate the ZTN framework by bridging the gap between automated processes and human-centric interfaces. This paper explores the integration of LLMs into ZTNs, highlighting their potential to enhance network transparency and improve user interactions. Through a comprehensive case study on deep reinforcement learning (DRL)-based anti-jamming technique, we demonstrate how LLMs can distill intricate network operations into intuitive, human-readable reports. Additionally, we address the technical and ethical intricacies of melding LLMs with ZTNs, with an emphasis on data privacy, transparency, and bias reduction. Looking ahead, we identify emerging research avenues at the nexus of LLMs and ZTNs, advocating for sustained innovation and interdisciplinary synergy in the domain of automated networks.
    摘要 随着第六代网络(6G)的到来,它承诺了前所未有的通信和自动化技术。其中一项主导技术是零接触网络(ZTN),旨在实现无需人类干预的完全自动化网络。虽然ZTN具有高效率和可扩展性的优势,但是在透明度、适应性和人类信任方面仍存在许多挑战。同时,大型自然语言模型(LLM)的出现提供了一个机会,通过结合LLM和ZTN来bridging自动化过程和人类中心的界面。本文 explore了LLM在ZTN中的整合,探讨其能够提高网络透明度和改善用户互动。通过一个基于深度强化学习(DRL)的防止干扰技术的实践案例,我们示出了LLM可以将复杂的网络操作概括成易于理解的人类可读报告。此外,我们还讨论了将LLM与ZTN结合的技术和道德复杂性,强调数据隐私、透明度和偏见减少。 looking ahead,我们认为在LLM和ZTN之间的研究前景很广阔,希望能够持续推动这两个领域之间的创新和跨学科共融。

TrOMR:Transformer-Based Polyphonic Optical Music Recognition

  • paper_url: http://arxiv.org/abs/2308.09370
  • repo_url: https://github.com/netease/polyphonic-tromr
  • paper_authors: Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, Peng Li
  • for: 这个论文是关于音乐Recognition(OMR)技术的研究,旨在提出一种基于 transformer 的全音程识别方法,以提高recognition accuracy。
  • methods: 该方法使用 transformer 来实现全音程识别,并引入了一种新的一致性损失函数和合理的数据注释方法来提高识别精度。
  • results: 实验表明,TrOMR 方法在实际场景下比现有的 OMR 方法表现更高,特别是在识别复杂的乐谱上。此外,作者还开发了 TrOMR 系统和一个实际拍摄的乐谱场景数据集。
    Abstract Optical Music Recognition (OMR) is an important technology in music and has been researched for a long time. Previous approaches for OMR are usually based on CNN for image understanding and RNN for music symbol classification. In this paper, we propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores. Extensive experiments demonstrate that TrOMR outperforms current OMR methods, especially in real-world scenarios. We also develop a TrOMR system and build a camera scene dataset for full-page music scores in real-world. The code and datasets will be made available for reproducibility.
    摘要 《光学音乐识别(OMR)技术在音乐领域已经被研究了很长时间。前一些OMR方法通常基于CNN для图像理解和RNN для音乐符号分类。在这篇论文中,我们提出了基于transformer的全globale感知方法,称为TrOMR,以提高端到端多重音乐OMR的准确率。我们还介绍了一种新的一致损失函数和合理的数据注释方法,以提高复杂音乐手稿的识别率。广泛的实验表明,TrOMR已经超越了当前OMR方法,特别是在实际场景中。我们还开发了TrOMR系统和一个摄像头场景数据集,用于实际全页音乐手稿识别。代码和数据集将被提供,以便重现。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

A tailored Handwritten-Text-Recognition System for Medieval Latin

  • paper_url: http://arxiv.org/abs/2308.09368
  • repo_url: None
  • paper_authors: Philipp Koch, Gilary Vera Nuñez, Esteban Garces Arias, Christian Heumann, Matthias Schöffel, Alexander Häberlin, Matthias Aßenmacher
  • for: 针对 medieval Latin dictionary 的数字化进行 Handwritten Text Recognition (HTR) 任务。
  • methods: 使用两个 state-of-the-art (SOTA) 图像分割模型进行数据集的准备,并运行多种组合的 transformer-based 模型和 GPT-2 解码器进行实验。
  • results: 实现了一个高度竞争力的模型,最佳设置 achievement 的 Character Error Rate (CER) 为 0.015,超过了商业 Google Cloud Vision 模型,并且表现更加稳定。
    Abstract The Bavarian Academy of Sciences and Humanities aims to digitize its Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the Handwritten Text Recognition (HTR) of the handwritten lemmas found on these record cards. In our work, we introduce an end-to-end pipeline, tailored to the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art (SOTA) image segmentation models to prepare the initial data set for the HTR task. Furthermore, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.
    摘要 Bavarian Academy of Sciences and Humanities 计划数字化中世纪拉丁词典。这个词典包含手写 Record cards 上的中世纪拉丁词语,这是一种低资源语言。我们的工作是设计一个端到端管道,专门为中世纪拉丁词典进行找到、提取和转录词语的任务。我们使用两个现代状态的图像分割模型来准备初始数据集 для HTR 任务。此外,我们还对不同的 transformer 模型进行了试验,并对不同的视觉编码器与 GPT-2 解码器的不同组合进行了一系列实验。此外,我们还应用了广泛的数据增强,实现了非常竞争力的模型。最佳设置达到了 Character Error Rate (CER)0.015,超过了商业 Google Cloud Vision 模型,并且表现更加稳定。

Accelerated materials language processing enabled by GPT

  • paper_url: http://arxiv.org/abs/2308.09354
  • repo_url: None
  • paper_authors: Jaewoong Choi, Byungju Lee
  • for: 这研究的目的是提高材料科学文献中信息抽取的效率,并采用生成式预训练变换器(GPT)来替换优化的模型结构。
  • methods: 这研究使用了GPT引入的文档分类方法、命名实体识别(NER)方法和抽取问答(QA)方法,其中使用了策略性的提示工程来替换优化的模型结构。
  • results: 研究发现,使用GPT引入的方法可以实现与优化模型结构相当的准确率和可靠性,并且只需要小量的数据进行训练。此外,这些方法还可以在不同的材料科学领域中应用,以加速文献中信息抽取的过程。
    Abstract Materials language processing (MLP) is one of the key facilitators of materials science research, as it enables the extraction of structured information from massive materials science literature. Prior works suggested high-performance MLP models for text classification, named entity recognition (NER), and extractive question answering (QA), which require complex model architecture, exhaustive fine-tuning and a large number of human-labelled datasets. In this study, we develop generative pretrained transformer (GPT)-enabled pipelines where the complex architectures of prior MLP models are replaced with strategic designs of prompt engineering. First, we develop a GPT-enabled document classification method for screening relevant documents, achieving comparable accuracy and reliability compared to prior models, with only small dataset. Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance on most of entities in three open datasets. Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations. While our findings confirm the potential of GPT-enabled MLP models as well as their value in terms of reliability and practicability, our scientific methods and systematic approach are applicable to any materials science domain to accelerate the information extraction of scientific literature.
    摘要

Document Automation Architectures: Updated Survey in Light of Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09341
  • repo_url: None
  • paper_authors: Mohammad Ahmadi Achachlouei, Omkar Patil, Tarun Joshi, Vijayan N. Nair
  • for: 本研究审查了当前文档自动化(DA)领域的最新状况,尤其是在法律领域的商业解决方案中的自动化文档生成。
  • methods: 本研究通过审查学术文献,为DA的定义和特征提供了更清晰的定义,并识别了学术研究中的DA架构和技术。
  • results: 本研究提供了新的DA研究机遇,基于最新的生成AI和大语言模型。
    Abstract This paper surveys the current state of the art in document automation (DA). The objective of DA is to reduce the manual effort during the generation of documents by automatically creating and integrating input from different sources and assembling documents conforming to defined templates. There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies. The current survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features, identifies state-of-the-art DA architectures and technologies in academic research, and provides ideas that can lead to new research opportunities within the DA field in light of recent advances in generative AI and large language models.
    摘要

KESDT: knowledge enhanced shallow and deep Transformer for detecting adverse drug reactions

  • paper_url: http://arxiv.org/abs/2308.09329
  • repo_url: None
  • paper_authors: Yunzhi Qiu, Xiaokun Zhang, Weiwei Wang, Tongxuan Zhang, Bo Xu, Hongfei Lin
    for:The paper aims to improve the detection of adverse drug reactions (ADRs) on social media platforms by proposing a novel model called Knowledge Enhanced Shallow and Deep Transformer (KESDT).methods:The KESDT model incorporates domain keywords into the Transformer model through a shallow fusion manner and integrates synonym sets through a deep fusion manner to address the challenges of low annotated data and sample imbalance.results:The proposed KESDT model outperforms state-of-the-art baselines on three public datasets (TwiMed, Twitter, and CADEC) in terms of F1 values, with relative improvements of 4.87%, 47.83%, and 5.73%, respectively.
    Abstract Adverse drug reaction (ADR) detection is an essential task in the medical field, as ADRs have a gravely detrimental impact on patients' health and the healthcare system. Due to a large number of people sharing information on social media platforms, an increasing number of efforts focus on social media data to carry out effective ADR detection. Despite having achieved impressive performance, the existing methods of ADR detection still suffer from three main challenges. Firstly, researchers have consistently ignored the interaction between domain keywords and other words in the sentence. Secondly, social media datasets suffer from the challenges of low annotated data. Thirdly, the issue of sample imbalance is commonly observed in social media datasets. To solve these challenges, we propose the Knowledge Enhanced Shallow and Deep Transformer(KESDT) model for ADR detection. Specifically, to cope with the first issue, we incorporate the domain keywords into the Transformer model through a shallow fusion manner, which enables the model to fully exploit the interactive relationships between domain keywords and other words in the sentence. To overcome the low annotated data, we integrate the synonym sets into the Transformer model through a deep fusion manner, which expands the size of the samples. To mitigate the impact of sample imbalance, we replace the standard cross entropy loss function with the focal loss function for effective model training. We conduct extensive experiments on three public datasets including TwiMed, Twitter, and CADEC. The proposed KESDT outperforms state-of-the-art baselines on F1 values, with relative improvements of 4.87%, 47.83%, and 5.73% respectively, which demonstrates the effectiveness of our proposed KESDT.
    摘要 医疗领域内的药物反应检测是一项非常重要的任务,因为药物反应会对患者的健康产生严重的影响,同时也会对医疗系统产生沉重的负担。随着更多的人通过社交媒体平台分享信息,有越来越多的努力集中在社交媒体数据上进行有效的药物反应检测。尽管现有的检测方法已经取得了很好的表现,但是这些方法仍然面临着三个主要挑战。首先,研究人员一直忽略了域关键词和其他单词之间的互动关系。第二,社交媒体数据受到严重的标注数据不足的影响。第三,社交媒体数据中的样本偏度问题很常见。为解决这些挑战,我们提出了基于知识的扩展深度传播(KESDT)模型,用于药物反应检测。具体来说,为了处理第一个问题,我们将域关键词 integration到传播模型中,使得模型能够充分利用域关键词和其他单词之间的互动关系。为了解决低标注数据的问题,我们将同义词集 integrate到传播模型中,从而扩大样本的大小。为了缓解样本偏度问题,我们将标准十字 entropy损失函数替换为关注损失函数,以便更好地训练模型。我们对三个公共数据集,包括TwiMed、Twitter和CADEC进行了广泛的实验。我们的提出的KESDT比 estado-of-the-art基elines在F1值上提高4.87%、47.83%和5.73%,这表明了我们的KESDT的效果。

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

  • paper_url: http://arxiv.org/abs/2308.09311
  • repo_url: None
  • paper_authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
  • for: 本文提出了一种新的脸部说话框架,特别适用于低资源语言,过去的文献中没有充分考虑这个问题。由于低资源语言没有充分的视频文本对数据来训练模型,因此lipreading模型的开发受到了挑战。
  • methods: 我们尝试了通过Predicting speech units来学习通用语言知识,即模型嘴部运动的能力。不同语言部分共享相同的phoneme,因此可以将学习一种语言的通用语言知识扩展到其他语言。然后,我们提出了Language-specific Memory-augmented Decoder(LMDecoder)来学习语言特定的知识。LMDecoder将语言特定的声音特征存储在内存银行中,可以通过声音文本对来训练。因此,我们可以将输入的speech units转换成语言特定的声音特征,并使用学习的丰富语言知识来翻译它们。
  • results: 通过对五种语言(英语、西班牙语、法语、意大利语、葡萄牙语)的广泛实验,我们证明了提出的方法的效iveness。
    Abstract This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
    摘要

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

  • paper_url: http://arxiv.org/abs/2308.09308
  • repo_url: None
  • paper_authors: Chenyu Zhao, Yunjiang Jiang, Yiming Qiu, Han Zhang, Wen-Yun Yang
  • for: 提高自然语言处理(NLP)任务中下游模型的性能,具体来说是在电商搜索中的查询意图分类任务。
  • methods: 通过知识检索器和外部词库的协同使用,实现增强下游模型的性能,而不是增加模型参数的方式。
  • results: 通过实验和减少研究,证明了我们提出的方法可以在查询意图分类任务中显著提高州网络处理(NLP)任务的性能,并且在实际应用中也能够达到良好的效果。
    Abstract Retrieval augmentation, which enhances downstream models by a knowledge retriever and an external corpus instead of by merely increasing the number of model parameters, has been successfully applied to many natural language processing (NLP) tasks such as text classification, question answering and so on. However, existing methods that separately or asynchronously train the retriever and downstream model mainly due to the non-differentiability between the two parts, usually lead to degraded performance compared to end-to-end joint training. In this paper, we propose Differentiable Retrieval Augmentation via Generative lANguage modeling(Dragan), to address this problem by a novel differentiable reformulation. We demonstrate the effectiveness of our proposed method on a challenging NLP task in e-commerce search, namely query intent classification. Both the experimental results and ablation study show that the proposed method significantly and reasonably improves the state-of-the-art baselines on both offline evaluation and online A/B test.
    摘要 <>TRANSLATE_TEXT Retrieval augmentation, which enhances downstream models by a knowledge retriever and an external corpus instead of by merely increasing the number of model parameters, has been successfully applied to many natural language processing (NLP) tasks such as text classification, question answering, and so on. However, existing methods that separately or asynchronously train the retriever and downstream model mainly due to the non-differentiability between the two parts, usually lead to degraded performance compared to end-to-end joint training. In this paper, we propose Differentiable Retrieval Augmentation via Generative lANguage modeling(Dragan), to address this problem by a novel differentiable reformulation. We demonstrate the effectiveness of our proposed method on a challenging NLP task in e-commerce search, namely query intent classification. Both the experimental results and ablation study show that the proposed method significantly and reasonably improves the state-of-the-art baselines on both offline evaluation and online A/B test.TRANSLATE_END

Conversational Ontology Alignment with ChatGPT

  • paper_url: http://arxiv.org/abs/2308.09217
  • repo_url: None
  • paper_authors: Sanaz Saki Norouzi, Mohammad Saeid Mahdavinejad, Pascal Hitzler
  • for: 本研究evaluates the feasibility and efficiency of ChatGPT for ontology alignment using a naive approach.
  • methods: 本研究使用了一种Naive方法,使用ChatGPT的输出与 Ontology Alignment Evaluation Initiative 2022 的会议轨道 ontologies进行比较,以获得更多关于 conversational large language model 在 Naive 方法下的 ontology matching 能力的信息。
  • results: 研究发现,ChatGPT 的输出与 Ontology Alignment Evaluation Initiative 2022 的结果之间存在一定的相似性,但是还有一些差异。这表明了 conversational large language model 在 Naive 方法下的 ontology matching 能力是有限的,但也有一定的潜在优势。
    Abstract This study evaluates the applicability and efficiency of ChatGPT for ontology alignment using a naive approach. ChatGPT's output is compared to the results of the Ontology Alignment Evaluation Initiative 2022 campaign using conference track ontologies. This comparison is intended to provide insights into the capabilities of a conversational large language model when used in a naive way for ontology matching, and to investigate the potential advantages and disadvantages of this approach.
    摘要 这项研究评估了chatGPT在ontologyAlignment中的适用性和效率,使用了一种简单的方法。chatGPT的输出与2022年ontologyAlignment评估活动的会议轨道 ontologies 的结果进行比较,以提供 conversational large language model 在 naive 方式上的ontology匹配能力的洞察,并 investigate这种方法的优劣点。

A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports

  • paper_url: http://arxiv.org/abs/2308.09193
  • repo_url: https://github.com/av9ash/duplicatebugdetection
  • paper_authors: Avinash Patil, Kihwan Han, Sabyasachi Mukhopadhyay
  • for: 本研究旨在比较不同文本 Similarity 方法在bug report retrieval中的效果,以提高bug report的检索效率。
  • methods: 本研究使用了TF-IDF (基eline), FastText, Gensim, BERT和ADA embedding模型进行比较。
  • results: 实验结果显示BERT模型在回忆率方面表现最好,其次是ADA模型,接下来是Gensim、FastText和TF-IDF模型。
    Abstract Bug reports are an essential aspect of software development, and it is crucial to identify and resolve them quickly to ensure the consistent functioning of software systems. Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. In this paper, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. We used the Software Defects Data containing bug reports for various software projects to evaluate the performance of these models. Our experimental results showed that BERT generally outperformed the rest of the models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task. Our code is available on GitHub.
    摘要 📝 Bug 报告是软件开发中非常重要的一部分,快速标识和解决它们以确保软件系统的一致性。从现有的数据库中检索类似的bug报告可以减少解决bug所需的时间和努力。在这篇论文中,我们对用 semantic textual similarity 方法检索类似bug报告的效果进行了比较,并根据相似性分数进行评估。我们检查了TF-IDF(基准)、FastText、Gensim、BERT和ADA 等嵌入模型。我们使用了 Software Defects Data 中的 bug 报告来评估这些模型的性能。我们的实验结果表明,BERT 通常在 recall 方面表现更好,其次是 ADA,Gensim,FastText 和 TF-IDF。我们的研究提供了不同嵌入方法在检索类似bug报告的效果的视角,并 highlights 选择合适的嵌入方法对这项任务的重要性。我们的代码可以在 GitHub 上找到。

Is Argument Structure of Learner Chinese Understandable: A Corpus-Based Analysis

  • paper_url: http://arxiv.org/abs/2308.09186
  • repo_url: None
  • paper_authors: Yuguang Duan, Zi Lin, Weiwei Sun
  • for: 这个论文是为了分析learner中文中的意向结构错误的。
  • methods: 这个论文使用了 sentence produced by language learners和 их corrected by native speakers的数据,并与 semantic role labeling标注。
  • results: 这个论文发现了learner中文中的意向结构错误,包括word order、word selection、lack of proposition和argument-adjunct confounding等。
    Abstract This paper presents a corpus-based analysis of argument structure errors in learner Chinese. The data for analysis includes sentences produced by language learners as well as their corrections by native speakers. We couple the data with semantic role labeling annotations that are manually created by two senior students whose majors are both Applied Linguistics. The annotation procedure is guided by the Chinese PropBank specification, which is originally developed to cover first language phenomena. Nevertheless, we find that it is quite comprehensive for handling second language phenomena. The inter-annotator agreement is rather high, suggesting the understandability of learner texts to native speakers. Based on our annotations, we present a preliminary analysis of competence errors related to argument structure. In particular, speech errors related to word order, word selection, lack of proposition, and argument-adjunct confounding are discussed.
    摘要 Translated into Simplified Chinese:这篇论文介绍了学习中文learner的语法错误分析。数据来源包括学习者生成的句子以及Native speaker的修改。我们将数据 coupling avec manually created的semantic role labeling标注,遵循中文PropBank规范,这个规范原本是为first language phenomena而设计。然而,我们发现它对second language phenomena也非常全面。标注过程中的间接对应度较高,表明学习者的文本对Native speaker来说很容易理解。基于我们的标注,我们对语法错误进行了初步分析,包括word order错误、word selection错误、缺乏 Proposition 和 argument-adjunct杂合错误等。

ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse

  • paper_url: http://arxiv.org/abs/2308.09158
  • repo_url: https://github.com/zhangyikaii/lamda-zhijian
  • paper_authors: Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, Han-Jia Ye
  • for: 本研究旨在提供一个可用于实际应用中的模型重复使用工具箱(ZhiJian),实现了多种模型重复使用方法的整合,并提供了一个简单易用的PyTorch backend。
  • methods: 本研究使用了PyTorch backend,提出了一个称为PTM的目标架构建方法,并且提供了一个基于PTM的探索和调整方法,可以帮助深度学习专家在下游任务中探索和发现不同方法之间的补偿优点。
  • results: 本研究通过实际应用和评估,显示了ZhiJian在实际应用中的效果和可靠性,并且显示了PTM-based inference可以帮助深度学习专家在下游任务中探索和发现不同方法之间的补偿优点。
    Abstract The rapid expansion of foundation pre-trained models and their fine-tuned counterparts has significantly contributed to the advancement of machine learning. Leveraging pre-trained models to extract knowledge and expedite learning in real-world tasks, known as "Model Reuse", has become crucial in various applications. Previous research focuses on reusing models within a certain aspect, including reusing model weights, structures, and hypothesis spaces. This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference. This empowers deep learning practitioners to explore downstream tasks and identify the complementary advantages among different methods. ZhiJian is readily accessible at https://github.com/zhangyikaii/lamda-zhijian facilitating seamless utilization of pre-trained models and streamlining the model reuse process for researchers and developers.
    摘要 “快速扩展的基础模型和其精细化版本的发展,对机器学习的进步做出了重要贡献。利用预训练模型提取知识和快速学习实际任务中的概念,称为“模型重用”,在各种应用中变得非常重要。先前的研究主要关注在模型重用的特定方面,包括模型权重、结构和假设空间的重用。本文介绍了一个名为ZhiJian的通用和易用的工具箱,使用PyTorch后端。ZhiJian提出了一种新的思想,即在PTM中构建目标建筑、在PTM中调参目标模型和PTM基于的推理。这使得深度学习实践者可以更好地探索下游任务和发现不同方法之间的补做优势。ZhiJian可以很容易地在https://github.com/zhangyikaii/lamda-zhijian上获取,便于预训练模型的使用和模型重用过程中的流程,为研究者和开发者提供了便捷的使用方式。”

  • paper_url: http://arxiv.org/abs/2308.09156
  • repo_url: None
  • paper_authors: Omar Sharif, Madhusudan Basak, Tanzia Parvin, Ava Scharfstein, Alphonso Bradham, Jacob T. Borodovsky, Sarah E. Lord, Sarah Masud Preum
    for: This paper focuses on analyzing health-related information-seeking on social media, specifically on Reddit, to understand the treatment options and misconceptions related to Opioid Use Disorder (OUD).methods: The authors use a novel approach called event-driven analysis to categorize health-related information-seeking on social media into different events, such as treatment options, misconceptions, and knowledge gaps. They also develop a dataset called TREAT-ISE, which contains Reddit posts annotated with the type of events related to recovery from OUD.results: The authors achieve a strong performance benchmark of 77.4% F1 score for the task of classifying information-seeking events on Reddit related to OUD using machine learning and deep learning classifiers. They also investigate the performance and errors of ChatGPT on this task, providing insights into the capabilities and limitations of large language models.
    Abstract Social media sites have become a popular platform for individuals to seek and share health information. Despite the progress in natural language processing for social media mining, a gap remains in analyzing health-related texts on social discourse in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.
    摘要 社交媒体平台已成为个人寻求和分享健康信息的受欢迎平台。 despite the progress in自然语言处理 для社交媒体挖掘, a gap remains in analyzing health-related texts in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.

Linearity of Relation Decoding in Transformer Language Models

  • paper_url: http://arxiv.org/abs/2308.09124
  • repo_url: None
  • paper_authors: Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau
  • for: 本研究探讨了transformer语言模型(LM)中大量知识的表达方式,即关系之间的计算。
  • methods: 研究人员使用了一种单个提示来构建一个first-order approximation的LM,并证明了某些关系可以通过单一的线性变换来表示。
  • results: 研究发现,LM的表达中存在许多不是线性地编码的关系知识,但是它们可以准确地预测关系。这些结果表明LM使用了一种简单、可读的,但是不均匀分布的知识表示策略。
    Abstract Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.
    摘要 许多语言模型中的知识可以表示为关系:单词和其同义词之间的关系,实体和其属性之间的关系等。我们发现,对于一些关系,这种计算可以被简单地表示为单一的线性变换在主题表示中。这种线性关系表示可以通过从单个提示构建首个预测来获得,并存在许多事实、通俗知识和语言关系中。然而,我们也发现许多情况下,LM的预测捕捉到了关系知识,但这种知识不是直接地编码在其表示中。我们的结果因此揭示了一种简单、可解释的,但受到不同应用场景的各种表示策略在transformer语言模型中。

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09115
  • repo_url: None
  • paper_authors: Mohd Zaki, Jayadeva, Mausam, N. M. Anoop Krishnan
  • for: 本研究旨在开发一个可以快速找到材料的知识库,以便更好地满足材料科学领域的研究需求。
  • methods: 本研究使用语言模型来回答材料领域的问题,并从知识库中提取信息。
  • results: GPT-4模型在解决材料领域的650个问题中表现最好(约62%的准确率),而链式思维提示对模型的性能没有显著提升。研究发现,概念错误(约64%)是LLMs表现下降的主要原因,而计算错误(约36%)则是次要原因。
    Abstract Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.
    摘要 信息抽取和文本理解从材料文献中是发展加速材料发现的关键。语言模型已经表现出其能够回答域务特定问题和从知识库中提取信息。但是,在材料领域没有 benchmark 数据集来评估这些语言模型对关键概念的理解。在这项工作中,我们积集了 650 个材料领域的复杂问题,这些问题需要Materials 学生完成本科学位课程后的知识和技能。我们将这些问题分类为结构和材料科学领域下的子类别。然后,我们使用 zero-shot 和链条提问训练 GPT-3.5 和 GPT-4 模型,并评估其性能。结果显示,GPT-4 的性能最高(约 62% 准确率),而 GPT-3.5 的性能较低。另外,与通常观察不同,链条提问不对性能的提高有显著影响。为了评估局限性,我们进行了错误分析,发现概念错误(约 64%)是 LLMS 表现不佳的主要原因,而计算错误(约 36%)则是次要原因。我们希望这些数据和分析可以促进更好的材料科学领域特定 LLMS 的开发和信息抽取策略的研究。

mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning

  • paper_url: http://arxiv.org/abs/2308.09073
  • repo_url: None
  • paper_authors: Ying Mo, Jian Yang, Jiahao Liu, Qifan Wang, Ruoyu Chen, Jingang Wang, Zhoujun Li
  • for: 提高 Cross-Lingual Named Entity Recognition (CrossNER) 的性能,尤其是非英语数据的 scarcity 问题。
  • methods: 提出 Multi-view Contrastive Learning for Cross-Lingual Named Entity Recognition (mCL-NER),通过考虑 token 之间的关系来协调 semantic 和 token-level 表示之间的差异。
  • results: 在 XTREME benchmark 上进行了实验,与先前的数据驱动和模型驱动方法进行比较,显示 mCL-NER 可以大幅提高 CrossNER 的性能,在40种语言中提高了 nearly +2.0 $F_1$ 得分。
    Abstract Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 $F_1$ scores across a broad spectrum and establishes itself as the new state-of-the-art performer.
    摘要 cross-lingual named entity recognition (CrossNER) 面临着因语言异常缺乏多语言资料而导致的性能不均衡的挑战。现有的尝试主要集中在数据驱动的传输方法上,而一个重要的方面尚未得到完全探索是在多语言空间中对表示进行同步。在这篇论文中,我们提议一种多视角对比学习方法,称为多视角对比学习 для跨语言命名实体识别(mCL-NER)。具体来说,我们将跨语言命名实体识别任务转化为一个涉及到对实体中token之间的关系的问题。这种方法利用了实体中token之间的语言特征,从而使得表示之间进行同步。我们提出了一个多视角对比学习框架,包括源语言、codeswitched语言和目标语言之间的semantic对比和token之间的关系对比。通过在Semantic和Relational空间中强制同步,我们最小化了源语言和其相应的codeswitched语言和目标语言之间的差异。这种对应扩展到不同语言之间的关系,提高了实体的跨语言 проекing。我们还通过与标注的源数据和无标注目标数据进行自我训练,进一步提高了 CrossNER 的性能。我们在XTREME benchmark上进行了40种语言的实验,并证明了mCL-NER 在先前的数据驱动和模型基于的方法之上取得了显著的改进,增加了约 +2.0 $F_1$ 分数,并成为新的状态级表现者。