cs.CL - 2023-09-23

Hierarchical attention interpretation: an interpretable speech-level transformer for bi-modal depression detection

  • paper_url: http://arxiv.org/abs/2309.13476
  • repo_url: None
  • paper_authors: Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
  • For: The paper aims to improve the accuracy and interpretability of automatic depression detection tools using speech, which can help early screening of depression.* Methods: The proposed bi-modal speech-level transformer model avoids segment-level labelling and provides both speech-level and sentence-level interpretations using gradient-weighted attention maps.* Results: The proposed model outperforms a model that learns at a segment level, with improved accuracy and interpretability. The model can identify the most relevant sentences and text tokens within a given speech that are indicative of depression.
    Abstract Depression is a common mental disorder. Automatic depression detection tools using speech, enabled by machine learning, help early screening of depression. This paper addresses two limitations that may hinder the clinical implementations of such tools: noise resulting from segment-level labelling and a lack of model interpretability. We propose a bi-modal speech-level transformer to avoid segment-level labelling and introduce a hierarchical interpretation approach to provide both speech-level and sentence-level interpretations, based on gradient-weighted attention maps derived from all attention layers to track interactions between input features. We show that the proposed model outperforms a model that learns at a segment level ($p$=0.854, $r$=0.947, $F1$=0.897 compared to $p$=0.732, $r$=0.808, $F1$=0.768). For model interpretation, using one true positive sample, we show which sentences within a given speech are most relevant to depression detection; and which text tokens and Mel-spectrogram regions within these sentences are most relevant to depression detection. These interpretations allow clinicians to verify the validity of predictions made by depression detection tools, promoting their clinical implementations.
    摘要 抑郁是一种常见的心理疾病。使用机器学习技术实现的自动抑郁检测工具可以帮助早期检测抑郁。这篇论文解决了两个可能阻碍临床应用的限制: segment-level 标注导致的噪音和模型解释性不足。我们提议使用双模块的speech-level transformer来避免 segment-level 标注,并提出一种层次解释方法,以提供 both speech-level 和 sentence-level 的解释,基于所有注意层的梯度权重注意力地图来跟踪输入特征之间的交互。我们表明,提议的模型在比较 segment level 学习的模型($p$=0.732, $r$=0.808, $F1$=0.768)的情况下表现出色,其中 $p$=0.854, $r$=0.947, $F1$=0.897。为了解释模型,我们使用一个真正正确的样本,显示某些speech中的哪些句子是抑郁检测中最重要的,以及哪些文本字符和 Mel-spectrogram 区域在这些句子中对抑郁检测最重要。这些解释可以帮助临床专业人员验证抑郁检测工具的预测结果,从而促进其临床应用。

Grounding Description-Driven Dialogue State Trackers with Knowledge-Seeking Turns

  • paper_url: http://arxiv.org/abs/2309.13448
  • repo_url: None
  • paper_authors: Alexandru Coca, Bo-Hsiang Tseng, Jinghong Chen, Weizhe Lin, Weixuan Zhang, Tisha Anders, Bill Byrne
  • for: 提高对话管理模型的稳定性和泛化能力
  • methods: 使用对话 corpora 和知识图来固定状态跟踪模型,并在推理和训练过程中添加知识图 turns
  • results: 对比原始模型,新的方法可以大幅提高对话管理模型的平均共同目标准确率和schema敏感度
    Abstract Schema-guided dialogue state trackers can generalise to new domains without further training, yet they are sensitive to the writing style of the schemata. Augmenting the training set with human or synthetic schema paraphrases improves the model robustness to these variations but can be either costly or difficult to control. We propose to circumvent these issues by grounding the state tracking model in knowledge-seeking turns collected from the dialogue corpus as well as the schema. Including these turns in prompts during finetuning and inference leads to marked improvements in model robustness, as demonstrated by large average joint goal accuracy and schema sensitivity improvements on SGD and SGD-X.
    摘要 Schema-guided dialogue state trackers可以通过新领域掌握而无需进一步训练,但它们受到文本风格的影响很sensitive。增加训练集中的人工或 sintetic schema paraphrase可以提高模型的可靠性,但这可能会成本高或控制困难。我们提议通过将知识寻求turns集成到对话 corpus和schema中来固定状态跟踪模型。在训练和推理中包含这些turns的提问可以获得显著改进,如joint目标准确率和schema敏感度的提高,如SGD和SGD-X所示。

My Science Tutor (MyST) – A Large Corpus of Children’s Conversational Speech

  • paper_url: http://arxiv.org/abs/2309.13347
  • repo_url: None
  • paper_authors: Sameer S. Pradhan, Ronald A. Cole, Wayne H. Ward
  • for: This paper describes the development of the MyST corpus, a large collection of children’s conversational speech, which can be used to improve automatic speech recognition algorithms, build and evaluate conversational AI agents for education, and develop multimodal applications to improve children’s learning.
  • methods: The MyST corpus was developed as part of the My Science Tutor project, which involves 100K utterances transcribed from approximately 10.5K virtual tutor sessions by 1.3K third, fourth, and fifth grade students. The corpus is available for non-commercial and commercial use under a creative commons license.
  • results: To date, ten organizations have licensed the corpus for commercial use, and approximately 40 university and other not-for-profit research groups have downloaded the corpus. The corpus has the potential to be used to improve children’s learning and excitement about science, and to help them learn remotely.Here is the information in Simplified Chinese text:
  • for: 这篇论文描述了MyST corpus的开发,这是一个儿童对话语音集,可以用于改进自动语音识别算法、建立和评估教育机器人、并开发多Modal应用程序,以提高儿童学习科学的兴趣和成就。
  • methods: MyST corpus是My Science Tutor项目的一部分,涉及100万句话的对话语音,来自约10.5万个虚拟导师会议,由1.3万名第三、四、五年级学生提供。这个 corpus 是可以免费使用(https://myst.cemantix.org),也可以用于商业用途(https://boulderlearning.com/resources/myst-corpus/)。到目前为止,有十家组织已经购买了这个 corpus 的商业授权,并且有约40个大学和其他非营利研究机构下载了这个 corpus。
  • results: 这个 corpus 的开发可以用于改进儿童学习科学的方法,并且可以帮助儿童在远程学习中学习更好地。到目前为止,有十家组织已经购买了这个 corpus 的商业授权,并且有约40个大学和其他非营利研究机构下载了这个 corpus。
    Abstract This article describes the MyST corpus developed as part of the My Science Tutor project -- one of the largest collections of children's conversational speech comprising approximately 400 hours, spanning some 230K utterances across about 10.5K virtual tutor sessions by around 1.3K third, fourth and fifth grade students. 100K of all utterances have been transcribed thus far. The corpus is freely available (https://myst.cemantix.org) for non-commercial use using a creative commons license. It is also available for commercial use (https://boulderlearning.com/resources/myst-corpus/). To date, ten organizations have licensed the corpus for commercial use, and approximately 40 university and other not-for-profit research groups have downloaded the corpus. It is our hope that the corpus can be used to improve automatic speech recognition algorithms, build and evaluate conversational AI agents for education, and together help accelerate development of multimodal applications to improve children's excitement and learning about science, and help them learn remotely.
    摘要 这篇文章介绍了MyST资料集,这是My Science Tutor项目中的一个大型儿童对话语音资料集,包含约400个小时的对话语音,涵盖约230万个语音词汇,分别来自约1.3万名第三、四、五年级学生的10.5万个虚拟教学会话。目前已经转录100万个语音。MyST资料集是免费使用(https://myst.cemantix.org),通过创意共用许可证,用于非商业用途。同时,也可以用于商业用途(https://boulderlearning.com/resources/myst-corpus/),至今已有10家组织购买了商业许可证。我们希望通过这些资料来提高自动语音识别算法,建立和评估教育对话AI代理人,以及在远程学习方面提高儿童对科学的兴趣和学习。

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models

  • paper_url: http://arxiv.org/abs/2309.13345
  • repo_url: https://github.com/rucaibox/bamboo
  • paper_authors: Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen
  • for: 本研究旨在评估大型自然语言处理(NLP)模型在长文本理解任务上的能力,并提供多任务长文本测试集(BAMBOO)。
  • methods: 本研究使用了五种长文本模型,在BAMBOO上进行了实验,并评估了这些模型在不同任务上的性能。
  • results: 研究发现,现有的长文本模型在某些任务上表现出色,但在其他任务上表现较差。研究还指出了未来可以采取的方法来提高长文本模型的能力。
    Abstract Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e. question answering, hallucination detection, text sorting, language modeling, and code completion, to cover core capacities and various domains of LLMs. We conduct experiments with five long context models on BAMBOO and further discuss four key research questions of long text. We also qualitatively analyze current long context models and point out future directions for enhancing long text modeling capacities. We release our data, prompts, and code at https://github.com/RUCAIBox/BAMBOO.
    摘要 大型语言模型(LLM)已经在普通长度的NLPT任务上达到了戏剑性能。在最近的几项研究中,研究者们努力扩展了LLM的上下文长度和长文本模型化能力。为全面评估LLM的长文本能力,我们提出了BAMBOO,一个多任务长文本benchmark。BAMBOO遵循四个原则:全面评估能力、数据污染避免、自动评估精度和不同长度级别。它包括10个来自5个不同长文理解任务的数据集,例如问答、幻觉检测、文本排序、语言模型和代码完成等,以覆盖LLM的核心能力和不同领域。我们在BAMBOO上进行了5个长文本模型的实验,并讨论了长文本模型的四个关键研究问题。我们还进行了现有长文本模型的Qualitative分析,并指出了未来扩展长文本模型能力的方向。我们在GitHub上发布了数据、提示和代码,请参考https://github.com/RUCAIBox/BAMBOO。

From Text to Source: Results in Detecting Large Language Model-Generated Content

  • paper_url: http://arxiv.org/abs/2309.13322
  • repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
  • paper_authors: Wissam Antoun, Benoît Sagot, Djamé Seddah
  • for: 本研究旨在 investigate Cross-Model Detection,探讨一个基于源LM的泛型分类器是否可以探测目标LM生成的文本。
  • methods: 本研究使用了多种LM大小和家族,并评估了对分类器泛化的影响。
  • results: 研究发现,模型大小与泛型分类器效果之间存在明显的反相关关系,大LM更难于探测,特别是当泛型分类器在小LM上训练时。同时,使用相同大小LM的数据进行训练可以提高大LM的探测性能,但可能会导致小LM的性能下降。模型归因实验也表明,LM生成的文本中含有可识别的签名特征。
    Abstract The widespread use of Large Language Models (LLMs), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. Addressing these concerns necessitates the development of robust methods to detect and attribute text generated by LLMs. This paper investigates "Cross-Model Detection," evaluating whether a classifier trained to distinguish between source LLM-generated and human-written text can also detect text from a target LLM without further training. The study comprehensively explores various LLM sizes and families, and assesses the impact of conversational fine-tuning techniques on classifier generalization. The research also delves into Model Attribution, encompassing source model identification, model family classification, and model size classification. Our results reveal several key findings: a clear inverse relationship between classifier effectiveness and model size, with larger LLMs being more challenging to detect, especially when the classifier is trained on data from smaller models. Training on data from similarly sized LLMs can improve detection performance from larger models but may lead to decreased performance when dealing with smaller models. Additionally, model attribution experiments show promising results in identifying source models and model families, highlighting detectable signatures in LLM-generated text. Overall, our study contributes valuable insights into the interplay of model size, family, and training data in LLM detection and attribution.
    摘要 广泛使用大型语言模型(LLM),被夸大为能生成人类语言文本的能力,已引起关于误导和伦理问题的担忧。为解决这些问题,需要开发robust的检测和归因方法。本文研究“交叉模型检测”,检查一个基于源LLM生成文本和人类写作文本的分类器能否检测目标LLM生成的文本。研究全面探讨了不同的LLM大小和家族,以及对分类器泛化的影响。研究还探讨了模型归因,包括来源模型标识、模型家族分类和模型大小分类。我们的结果显示了一些关键发现:与分类器效果相对关系,大型LLM更难于检测,特别是当分类器被训练使用小型LLM的数据时。使用同样大小的LLM数据进行训练可以提高大型LLM的检测性能,但可能导致对小型LLM的性能下降。此外,模型归因实验显示了LLM生成文本中的可察性特征,这些特征可以用于归因LLM。总的来说,我们的研究为LLM检测和归因提供了有价值的发现。

GlotScript: A Resource and Tool for Low Resource Writing System Identification

  • paper_url: http://arxiv.org/abs/2309.13320
  • repo_url: https://github.com/cisnlp/GlotScript
  • paper_authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze
  • for: 用于identifying low-resource writing systems
  • methods: 使用存储的writing system resources和Unicode 15.0 scripts
  • results: 支持清理多语言 corpus和分析语言模型的TokenizationHere’s the same information in Simplified Chinese:
  • for: 用于识别低资源文字系统
  • methods: 使用存储的文字系统资源和Unicode 15.0 字体
  • results: 支持清理多语言 corpus和分析语言模型的Tokenization
    Abstract We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript supports cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.
    摘要 我们介绍GlotScript,一个开源资源和工具,用于低资源文字系统识别。GlotScript-R是一个提供了超过7,000种语言的验证文字系统资源。它通过将现有文字系统资源集成起来编译而成。GlotScript-T是一个可以识别所有Unicode 15.0 编码中的161种文字系统的写作系统识别工具。对于输入文本,它返回该文本的文字系统分布,并将文字系统用ISO 15924 编码来标识。我们还介绍了GlotScript的两个使用场景。首先,我们示例了GlotScript可以清洁多语言 corpus,如mC4和OSCAR。其次,我们分析了一些语言模型,如GPT-4,使用GlotScript进行分词,并提供了低资源文字和语言的覆盖率的视图。我们希望GlotScript可以成为NPLTcommunity中工作低资源语言的有用资源。GlotScript-R和GlotScript-T可以在https://github.com/cisnlp/GlotScript 上获取。

Spanish Resource Grammar version 2023

  • paper_url: http://arxiv.org/abs/2309.13318
  • repo_url: None
  • paper_authors: Olga Zamaraeva, Carlos Gómez-Rodríguez
  • for: 这个论文是为了语言研究和自然语言处理应用开发而写的。
  • methods: 这个论文使用了最新版本的Freeling morphological analyzer和tagger,并提供了手动验证的treebank和问题列表。
  • results: 这个论文提供了一个新的研究方向,并在一小部分学习 corpus 上测试了 grammar 的覆盖率和过度生成。
    Abstract We present the latest version of the Spanish Resource Grammar (SRG). The new SRG uses the recent version of Freeling morphological analyzer and tagger and is accompanied by a manually verified treebank and a list of documented issues. We also present the grammar's coverage and overgeneration on a small portion of a learner corpus, an entirely new research line with respect to the SRG. The grammar can be used for linguistic research, such as for empirically driven development of syntactic theory, and in natural language processing applications such as computer-assisted language learning. Finally, as the treebanks grow, they can be used for training high-quality semantic parsers and other systems which may benefit from precise and detailed semantics.
    摘要 我们现在发布最新版的西班牙资源语法(SRG)。新的SRG使用了最新版的Freeling morphological analyzer和标注器,并附有手动验证的树链和问题列表。我们还对一小部分学习语料进行了覆盖率和过度生成的测试,这是SRG的全新研究方向。这个语法可以用于语言科研,如逐渐驱动发展 syntax理论,以及自然语言处理应用,如计算机助教语言学习。随着树链的增长,它们可以用于训练高质量semantic parser和其他可能受益于精确和详细 semantics的系统。

Calibrating LLM-Based Evaluator

  • paper_url: http://arxiv.org/abs/2309.13308
  • repo_url: None
  • paper_authors: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
  • for: 这 paper 的目的是提出一种自动调整和人类偏好Alignment的方法,以便使用大型自然语言模型(LLM)进行自然语言生成质量评估。
  • methods: 该方法包括Multi-stage, gradient-free Approach,首先在不同的几个阶段中使用语言模型自己学习不同的几个例子,然后选择最佳表现者进行自我反调。
  • results: 对多个文本质量评估数据集进行实验,显示该方法可以有效地提高与专家评估的相关性。同时,对于各种有效的评价标准的Qualitative分析提供了深入的直观启示和观察。
    Abstract Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.
    摘要 近期大语言模型(LLM)的进步在语言生成质量评估方面,使得它们成为了无参考的自然语言评估器和人类评估器的有力竞争对手。然而,由于某些原因,很多LLM-based评估器受到了封闭的源代码或高计算需求的限制,导致它们的评估器没有得到进一步的精心调整。在这项工作中,我们提出了一种多stage、gradient-free的自动调整方法,以使得LLM-based评估器更加准确地对应人类的喜好。而不是直接模型人类喜好,我们将人类标签集成到了一个集合中,然后由语言模型自己提出了初始的评估标准。然后,我们选择了最佳表现者,并通过自我修复来重新绘制这些标准。我们在多个文本质量评估数据集上进行了多项实验,并证明了与专家评估的强相关性。我们还提供了深入的Qualitative分析,帮助理解有效的评估标准的本质。

OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2309.13297
  • repo_url: None
  • paper_authors: Siva Uday Sampreeth Chebolu, Franck Dernoncourt, Nedim Lipka, Thamar Solorio
  • for: 本研究旨在掌握用户生成的评论中的具体元素之情感分析,以提高对文本内容的情感分析和评估。
  • methods: 本研究使用了新的OATS dataset,包括三个新领域的评论,以及20,000个句子四重和13,000个评论二重。实验还包括了内部和跨领域的实验,以探索不同的ABSA子 зада业和OATS的潜力。
  • results: 本研究通过实验获得了OATSdataset的初步基线,并证明了OATS可以解决现有的ABSA领域问题,例如餐厅和笔记型评价等领域的问题。
    Abstract Aspect-based sentiment Analysis (ABSA) delves into understanding sentiments specific to distinct elements within textual content. It aims to analyze user-generated reviews to determine a) the target entity being reviewed, b) the high-level aspect to which it belongs, c) the sentiment words used to express the opinion, and d) the sentiment expressed toward the targets and the aspects. While various benchmark datasets have fostered advancements in ABSA, they often come with domain limitations and data granularity challenges. Addressing these, we introduce the OATS dataset, which encompasses three fresh domains and consists of 20,000 sentence-level quadruples and 13,000 review-level tuples. Our initiative seeks to bridge specific observed gaps: the recurrent focus on familiar domains like restaurants and laptops, limited data for intricate quadruple extraction tasks, and an occasional oversight of the synergy between sentence and review-level sentiments. Moreover, to elucidate OATS's potential and shed light on various ABSA subtasks that OATS can solve, we conducted in-domain and cross-domain experiments, establishing initial baselines. We hope the OATS dataset augments current resources, paving the way for an encompassing exploration of ABSA.
    摘要

Natural Language Processing for Requirements Formalization: How to Derive New Approaches?

  • paper_url: http://arxiv.org/abs/2309.13272
  • repo_url: https://github.com/ifak-prototypes/nlp_reform
  • paper_authors: Viju Sudhi, Libin Kutty, Robin Gröpler
    for: 本研究旨在提供一种 semi-自动化的需求ormalization方法,以帮助industry和研究人员尽可能自动化软件开发和测试过程。methods: 本研究使用自然语言处理(NLP)技术,包括创建规则集和iterative开发rule sets,以自动化需求ormalization过程。results: 研究表明,使用现有的预训练NLP模型可以减少创建规则集的努力,并且可以轻松适应特定用例和领域。两个 industriuse cases from the automotive and railway domains are used to demonstrate the effectiveness of the proposed methods.
    Abstract It is a long-standing desire of industry and research to automate the software development and testing process as much as possible. In this process, requirements engineering (RE) plays a fundamental role for all other steps that build on it. Model-based design and testing methods have been developed to handle the growing complexity and variability of software systems. However, major effort is still required to create specification models from a large set of functional requirements provided in natural language. Numerous approaches based on natural language processing (NLP) have been proposed in the literature to generate requirements models using mainly syntactic properties. Recent advances in NLP show that semantic quantities can also be identified and used to provide better assistance in the requirements formalization process. In this work, we present and discuss principal ideas and state-of-the-art methodologies from the field of NLP in order to guide the readers on how to create a set of rules and methods for the semi-automated formalization of requirements according to their specific use case and needs. We discuss two different approaches in detail and highlight the iterative development of rule sets. The requirements models are represented in a human- and machine-readable format in the form of pseudocode. The presented methods are demonstrated on two industrial use cases from the automotive and railway domains. It shows that using current pre-trained NLP models requires less effort to create a set of rules and can be easily adapted to specific use cases and domains. In addition, findings and shortcomings of this research area are highlighted and an outlook on possible future developments is given.
    摘要 industry和研究界长期希望自动化软件开发和测试过程,并且需求工程(RE)在这些步骤上扮演了基本角色。基于模型的设计和测试方法已经为软件系统的增长复杂性和可变性提供了解决方案。然而,从大量函циональ需求提供的自然语言处理(NLP)技术仍然需要大量的努力来生成需求模型。在文献中,许多基于NLP的方法已经被提出,主要基于语法特征来生成需求模型。然而,现在的NLP进步还表明可以利用semantic量来提供更好的帮助在需求正式化过程中。在这个工作中,我们将介绍和讨论一些在NLP领域的主要想法和现状技术,以帮助读者创建一套 semi-自动化需求正式化的规则和方法。我们在详细介绍了两种方法,并强调了迭代发展规则集的重要性。需求模型被表示为人类和机器可读的形式,即pseudocode。我们的方法在两个工业用例中(来自汽车和铁路领域)得到了证明,显示使用当前预训练的NLP模型需要较少的努力来创建规则集,并且可以轻松地适应特定用例和领域。此外,我们还高亮了这个研究领域的发现和缺陷,并提供了未来可能发展的前景。

A Survey of Document-Level Information Extraction

  • paper_url: http://arxiv.org/abs/2309.13249
  • repo_url: https://github.com/Don-No7/Hack-SQL
  • paper_authors: Hanwen Zheng, Sijia Wang, Lifu Huang
  • for: 本文是一篇文献综述,旨在为NLП领域的研究人员提供更多的启示,以进一步提高文档级别的自然语言处理(NLP)性能。
  • methods: 本文使用了现有的国际先进算法进行了系统性的错误分析,并识别了当前的限制和NLП领域的留下的挑战。
  • results: 根据我们的发现,标注噪音、实体核心匹配和无理解能力是文档级别IE性能的主要限制因素。
    Abstract Document-level information extraction (IE) is a crucial task in natural language processing (NLP). This paper conducts a systematic review of recent document-level IE literature. In addition, we conduct a thorough error analysis with current state-of-the-art algorithms and identify their limitations as well as the remaining challenges for the task of document-level IE. According to our findings, labeling noises, entity coreference resolution, and lack of reasoning, severely affect the performance of document-level IE. The objective of this survey paper is to provide more insights and help NLP researchers to further enhance document-level IE performance.
    摘要 文档级信息提取(IE)是自然语言处理(NLP)中关键的任务。本文进行了最新文档级IE литературе的系统性评审。此外,我们还进行了当前状态的算法评估,并确定了现有算法的局限性以及文档级IE任务中仍存在的挑战。根据我们的发现,标注噪音、实体核心归并和不足的逻辑,对文档级IE性能产生了严重的影响。本文的目标是为NLP研究人员提供更多的洞察和帮助,以进一步提高文档级IE性能。

ChEDDAR: Student-ChatGPT Dialogue in EFL Writing Education

  • paper_url: http://arxiv.org/abs/2309.13243
  • repo_url: None
  • paper_authors: Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Tak Yeon Lee, So-Yeon Ahn, Alice Oh
  • for: 这个研究旨在探讨大规模实际场景下学生和AI系统之间的交互,以推动教育领域中AI生成技术的应用。
  • methods: 这个研究使用了对212名英语为外语学生进行了一个学期长的实验,他们被要求通过对ChatGPT进行对话来修改他们的作业。研究收集了对话记录、utterance-level作业修改历史、自我评价和学生的意图,以及每个会话的前后调查记录学生的目标和总体经验。
  • results: 研究发现学生在使用生成AI时的使用模式和满意度与他们的意图有直接关系,并提出了基准结果为两个关键任务在教育上的对话系统中:意图检测和满意度估计。研究建议进一步调整教育中AI生成技术的应用,并提出了可能的使用Scenario使用ChEDDAR。ChEDDAR公共可用于https://github.com/zeunie/ChEDDAR。
    Abstract The integration of generative AI in education is expanding, yet empirical analyses of large-scale, real-world interactions between students and AI systems still remain limited. In this study, we present ChEDDAR, ChatGPT & EFL Learner's Dialogue Dataset As Revising an essay, which is collected from a semester-long longitudinal experiment involving 212 college students enrolled in English as Foreign Langauge (EFL) writing courses. The students were asked to revise their essays through dialogues with ChatGPT. ChEDDAR includes a conversation log, utterance-level essay edit history, self-rated satisfaction, and students' intent, in addition to session-level pre-and-post surveys documenting their objectives and overall experiences. We analyze students' usage patterns and perceptions regarding generative AI with respect to their intent and satisfaction. As a foundational step, we establish baseline results for two pivotal tasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. We finally suggest further research to refine the integration of generative AI into education settings, outlining potential scenarios utilizing ChEDDAR. ChEDDAR is publicly available at https://github.com/zeunie/ChEDDAR.
    摘要 整合生成AI在教育中的推广正在进行,但实际的大规模实验却仍然受到限制。本研究公布了ChEDDAR,ChatGPT & EFL Learner's Dialogue Dataset As Revising an essay,这是基于一个半年长的实验,其中212名大学生参与了英语作为外语写作课程。这些学生被要求通过对ChatGPT的对话来修改他们的文章。ChEDDAR包括对话记录、文章修改历史记录、自我评价满意度,以及学生的意图,以及每个会话的前后调查,记录了学生的目标和总体体验。我们分析学生对生成AI的使用方式和满意度的关系,并为两个关键任务在教育上的对话系统提供基线结果:检测意图和满意度估计。最后,我们建议进一步推进生成AI的教育集成,并提出了可能的应用场景,使用ChEDDAR可以在https://github.com/zeunie/ChEDDAR获取。

User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue

  • paper_url: http://arxiv.org/abs/2309.13233
  • repo_url: None
  • paper_authors: Sam Davidson, Salvatore Romeo, Raphael Shu, James Gung, Arshit Gupta, Saab Mansour, Yi Zhang
  • for: 提高自动评估新任务对话系统(TOD)的发展,避免人工评估的多个阶段和迭代过程中的阻碍。
  • methods: 使用最近发展的大型预训练语言模型(LLM)建立新的用户模拟器,通过受 Context 学习提高语言多样性,模拟人类对话伙伴的行为。
  • results: 比前一工作更高的语言多样性和语义多样性,能够与多个 TOD 系统进行有效交流,尤其是单意对话目标,而且生成的语音和语法多样性比前一工作更高。
    Abstract One of the major impediments to the development of new task-oriented dialogue (TOD) systems is the need for human evaluation at multiple stages and iterations of the development process. In an effort to move toward automated evaluation of TOD, we propose a novel user simulator built using recently developed large pretrained language models (LLMs). In order to increase the linguistic diversity of our system relative to the related previous work, we do not fine-tune the LLMs used by our system on existing TOD datasets; rather we use in-context learning to prompt the LLMs to generate robust and linguistically diverse output with the goal of simulating the behavior of human interlocutors. Unlike previous work, which sought to maximize goal success rate (GSR) as the primary metric of simulator performance, our goal is a system which achieves a GSR similar to that observed in human interactions with TOD systems. Using this approach, our current simulator is effectively able to interact with several TOD systems, especially on single-intent conversational goals, while generating lexically and syntactically diverse output relative to previous simulators that rely upon fine-tuned models. Finally, we collect a Human2Bot dataset of humans interacting with the same TOD systems with which we experimented in order to better quantify these achievements.
    摘要

Unify word-level and span-level tasks: NJUNLP’s Participation for the WMT2023 Quality Estimation Shared Task

  • paper_url: http://arxiv.org/abs/2309.13230
  • repo_url: https://github.com/njunlp/njuqe
  • paper_authors: Xiang Geng, Zhejian Lai, Yu Zhang, Shimin Tao, Hao Yang, Jiajun Chen, Shujian Huang
  • for: 这个研究是为了提出一种基于NJUQE框架的pseudo数据方法,以提高 Machine Translation 的质量预测(QE)性能。
  • methods: 该研究使用了Parallel Data从WMT翻译任务中生成pseudo MQM数据,然后使用XLMR大型模型在pseudo QE数据上进行预训练,并在实际QE数据上进行细化调整。同时,该研究jointly学习了句子级分数和单词级标签。
  • results: 该研究在英文-德文语对的 sentence-level和word-level质量预测两个子任务上达到了最佳性能,在两个子任务上的margin上提高了较大的性能。
    Abstract We introduce the submissions of the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task. Our team submitted predictions for the English-German language pair on all two sub-tasks: (i) sentence- and word-level quality prediction; and (ii) fine-grained error span detection. This year, we further explore pseudo data methods for QE based on NJUQE framework (https://github.com/NJUNLP/njuqe). We generate pseudo MQM data using parallel data from the WMT translation task. We pre-train the XLMR large model on pseudo QE data, then fine-tune it on real QE data. At both stages, we jointly learn sentence-level scores and word-level tags. Empirically, we conduct experiments to find the key hyper-parameters that improve the performance. Technically, we propose a simple method that covert the word-level outputs to fine-grained error span results. Overall, our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks by a considerable margin.
    摘要 我们介绍NJUNLP团队在WMT 2023质量估计(QE)共享任务中的提交。我们对英语-德语语对 submitting 预测,包括两个子任务:(i)句子和单词水平质量预测,以及(ii)细化错误范围检测。本年,我们进一步探索基于NJUQE框架(https://github.com/NJUNLP/njuqe)的pseudo数据方法 для QE。我们使用WMT翻译任务的平行数据生成pseudo MQM数据,然后在这些数据上预训练XLMR大型模型,然后精度调整在真实QE数据上。在两个阶段中,我们同时学习句子级分数和单词级标签。实际上,我们进行了实验来找到提高性能的关键超参数。技术上,我们提出了一种简单的方法,将单词级输出转换为细化错误范围结果。总的来说,我们的模型在英语-德语语对上的 both word-level 和细化错误范围检测子任务上 achieved 最佳成绩,差距非常明显。

COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs

  • paper_url: http://arxiv.org/abs/2309.14356
  • repo_url: None
  • paper_authors: Tiep Le, Vasudev Lal, Phillip Howard
  • for: 这篇论文的目的是提出一种可扩展的框架,用于自动生成多模态的反例,以提高自然语言处理(NLP)领域中模型对数据中的偶极相关性的耐误性。
  • methods: 这篇论文使用了文本到图像扩散模型来生成反例。
  • results: 作者通过人工评估 validate了 COCO-Counterfactuals 多模态反例集的质量,并表明了现有的多模态模型在这些反例中表现不佳。此外,作者还示出了通过使用 COCO-Counterfactuals 进行训练数据增强来提高多模态视语言模型的对外域数据的泛化能力。
    Abstract Counterfactual examples have proven to be valuable in the field of natural language processing (NLP) for both evaluating and improving the robustness of language models to spurious correlations in datasets. Despite their demonstrated utility for NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactual changes. To address this challenge, we introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. We use our framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. We validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, we demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.
    摘要 <>使用卷积神经网络生成对应的图像和文本描述的对称对话实例,以便用于评估和改进自然语言处理(NLP)模型对偶合关系在数据集中的Robustness。 DESPITE THEIR DEMONSTRATED UTILITY FOR NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactual changes. To address this challenge, we introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. We use our framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. We validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, we demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.难以创建具有最小对称变化的对应图像和文本描述的对称对话实例,这些实例在自然语言处理(NLP)领域已经证明具有检验和改进模型Robustness的价值。 DESPITE THEIR DEMONSTRATED UTILITY FOR NLP, multimodal counterfactual examples have been relatively unexplored。 To address this challenge, we introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. We use our framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. We validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, we demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.