cs.CL - 2023-10-27

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks

  • paper_url: http://arxiv.org/abs/2310.18538
  • repo_url: None
  • paper_authors: Mohammadreza Pourreza, Davood Rafiei
  • for: 本研究目的是evaluating the performance of text-to-SQL models on several prominent cross-domain benchmarks, and re-evaluating top-performing models to assess their true performance.
  • methods: 该研究使用了 manual evaluation and equivalent expression rewriting to evaluate the SQL queries and models.
  • results: 研究发现,due to the multiple interpretations of the provided samples, attaining a perfect performance on these benchmarks is unfeasible. additionally, the true performance of the models was underestimated, and their relative performance changed after re-evaluation. Most notably, a recent GPT4-based model surpassed the gold standard reference queries in the Spider benchmark in human evaluation, highlighting the importance of interpreting benchmark evaluations cautiously.
    Abstract Text-to-SQL benchmarks play a crucial role in evaluating the progress made in the field and the ranking of different models. However, accurately matching a model-generated SQL query to a reference SQL query in a benchmark fails for various reasons, such as underspecified natural language queries, inherent assumptions in both model-generated and reference queries, and the non-deterministic nature of SQL output under certain conditions. In this paper, we conduct an extensive study of several prominent cross-domain text-to-SQL benchmarks and re-evaluate some of the top-performing models within these benchmarks, by both manually evaluating the SQL queries and rewriting them in equivalent expressions. Our evaluation reveals that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples. Furthermore, we find that the true performance of the models is underestimated and their relative performance changes after a re-evaluation. Most notably, our evaluation reveals a surprising discovery: a recent GPT4-based model surpasses the gold standard reference queries in the Spider benchmark in our human evaluation. This finding highlights the importance of interpreting benchmark evaluations cautiously, while also acknowledging the critical role of additional independent evaluations in driving advancements in the field.
    摘要

On the Automatic Generation and Simplification of Children’s Stories

  • paper_url: http://arxiv.org/abs/2310.18502
  • repo_url: None
  • paper_authors: Maria Valentini, Jennifer Weber, Jesus Salcido, Téa Wright, Eliana Colunga, Katharina Kann
  • for: 这个研究的目的是开发一个自动生成儿童教育材料的系统,以提高儿童的学习效果。
  • methods: 研究者使用了一些流行的大语言模型(LLMs)来生成儿童教育材料,并评估了这些模型的 lexical 和 readability 水平是否适合儿童。
  • results: 研究者发现,虽然 LLMs 的能力在不断提高,但它们目前还没有能力限制自己的词汇水平适合更年轻的儿童。在第二个实验中,研究者explored the ability of state-of-the-art lexical simplification models to generalize to the domain of children’s stories, and created an efficient pipeline for their automatic generation.
    Abstract With recent advances in large language models (LLMs), the concept of automatically generating children's educational materials has become increasingly realistic. Working toward the goal of age-appropriate simplicity in generated educational texts, we first examine the ability of several popular LLMs to generate stories with properly adjusted lexical and readability levels. We find that, in spite of the growing capabilities of LLMs, they do not yet possess the ability to limit their vocabulary to levels appropriate for younger age groups. As a second experiment, we explore the ability of state-of-the-art lexical simplification models to generalize to the domain of children's stories and, thus, create an efficient pipeline for their automatic generation. In order to test these models, we develop a dataset of child-directed lexical simplification instances, with examples taken from the LLM-generated stories in our first experiment. We find that, while the strongest-performing current lexical simplification models do not perform as well on material designed for children due to their reliance on large language models behind the scenes, some models that still achieve fairly strong results on general data can mimic or even improve their performance on children-directed data with proper fine-tuning, which we conduct using our newly created child-directed simplification dataset.
    摘要 As a second experiment, we explore the ability of state-of-the-art lexical simplification models to generalize to the domain of children's stories. We develop a dataset of child-directed lexical simplification instances, using examples from the LLM-generated stories in our first experiment. We find that while the strongest-performing current lexical simplification models do not perform as well on material designed for children, some models that achieve strong results on general data can mimic or even improve their performance on children-directed data with proper fine-tuning.我们使用最新的大语言模型(LLMs),目标是自动生成儿童教育材料。为了实现适合不同年龄层的简化,我们首先评估各种流行的LLMs是否能够自动生成适合不同年龄层的故事。我们发现,虽然LLMs在过去几年内做出了很大的进步,但它们还没有拥有适合儿童年龄层的词汇量。作为第二个实验,我们研究了当前最佳的 lexical simplification 模型是否能够在儿童故事领域得到普遍化。我们开发了一个儿童指向的 lexical simplification 示例集,其中的例子来自我们的第一个实验中的 LLM-生成的故事。我们发现,当前最强的 lexical simplification 模型在面向儿童的数据上表现不如其他数据上,这是因为它们在后台使用大语言模型。但是,一些在普遍数据上表现良好的模型可以通过我们的特制的儿童指向的简化示例集进行细化,从而实现良好的表现。

Publicly Detectable Watermarking for Language Models

  • paper_url: http://arxiv.org/abs/2310.18491
  • repo_url: None
  • paper_authors: Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang
  • for: 本研究旨在构建可证明的语音模型水印方案,以便在公共可读性或可靠性下进行证明。
  • methods: 本研究使用私钥进行水印,并使用公钥进行水印检测。我们的协议是首个不嵌入生成文本中的统计信号的语音模型水印方案,而是直接使用拒绝抽样来嵌入公共可靠性的 криптографиic signature。我们证明了我们的建构符合强式形式安全保证和私钥水印中的多个欢迎性特性。
  • results: 我们的水印方案在7B参数范围内进行实验,并证明了我们的正式声明。我们的实验结果表明,我们的水印方案可以保持文本质量,同时符合正式要求。
    Abstract We construct the first provable watermarking scheme for language models with public detectability or verifiability: we use a private key for watermarking and a public key for watermark detection. Our protocol is the first watermarking scheme that does not embed a statistical signal in generated text. Rather, we directly embed a publicly-verifiable cryptographic signature using a form of rejection sampling. We show that our construction meets strong formal security guarantees and preserves many desirable properties found in schemes in the private-key watermarking setting. In particular, our watermarking scheme retains distortion-freeness and model agnosticity. We implement our scheme and make empirical measurements over open models in the 7B parameter range. Our experiments suggest that our watermarking scheme meets our formal claims while preserving text quality.
    摘要 我们构建了首个可证明的文本标记 schemes for 语言模型,其中使用私钥进行标记并使用公钥进行标记检测。我们的协议是首个不在生成的文本中嵌入统计信号的 watermarking schemes,而是直接使用拒绝抽象来嵌入公共可验证的 криптографиic 签名。我们证明了我们的构建符合强制ormal security guarantees 和 preserve many desirable properties found in private-key watermarking setting。特别是,我们的文本标记 schemes preserved distortion-freeness 和 model agnosticity。我们实现了我们的协议并对 open models 在 7B 参数范围进行了实验。我们的实验表明,我们的文本标记 schemes 符合我们的ormal claims 而 preserve text quality。

PeTailor: Improving Large Language Model by Tailored Chunk Scorer in Biomedical Triple Extraction

  • paper_url: http://arxiv.org/abs/2310.18463
  • repo_url: None
  • paper_authors: Mingchen Li, M. Chen, Huixue Zhou, Rui Zhang
  • for: 本研究旨在提高自动抽取生物医学实体和其互动的能力,因为现有的专家标注标准数据集的有限性。
  • methods: 本研究提出了一种基于检索的语言框架(PETAI-LOR),该框架通过修改chunk scorer来适应语言模型(LM)的特定需求。此外,我们还介绍了一个专家标注的生物医学 triple 抽取数据集(GM-CIHT),该数据集涵盖非药治疗和通用生物医学领域。
  • results: 我们的实验表明,PETAI-LOR在GM-CIHT上实现了状态 искусственный智能的表现。
    Abstract The automatic extraction of biomedical entities and their interaction from unstructured data remains a challenging task due to the limited availability of expert-labeled standard datasets. In this paper, we introduce PETAI-LOR, a retrieval-based language framework that is augmented by tailored chunk scorer. Unlike previous retrieval-augmented language models (LM) that retrieve relevant documents by calculating the similarity between the input sentence and the candidate document set, PETAILOR segments the sentence into chunks and retrieves the relevant chunk from our pre-computed chunk-based relational key-value memory. Moreover, in order to comprehend the specific requirements of the LM, PETAI-LOR adapt the tailored chunk scorer to the LM. We also introduce GM-CIHT, an expert annotated biomedical triple extraction dataset with more relation types. This dataset is centered on the non-drug treatment and general biomedical domain. Additionally, we investigate the efficacy of triple extraction models trained on general domains when applied to the biomedical domain. Our experiments reveal that PETAI-LOR achieves state-of-the-art performance on GM-CIHT
    摘要 自动提取生物医学实体和其交互从未结构化数据中 Remains 是一个挑战性的任务,因为专家标注标准数据集的可用性有限。在本文中,我们介绍 PETAI-LOR,一种基于检索的语言框架,该框架通过专门设计的块分词器进行增强。与过去的检索增强语言模型(LM)不同,PETAI-LOR 不是通过计算输入句子和候选文档集之间的相似性来 Retrieval 相关文档,而是通过将句子分成块,然后从我们预计算出的块基于关键值对存储中提取相关块。此外,为了适应特定的LM要求,PETAI-LOR 可以根据LM进行适应tailored块评分器。我们还介绍了GM-CIHT,一个专家标注的生物医学三元EXTRACT数据集,该数据集涵盖非药治疗和普通生物医学领域。此外,我们还调查了将生物医学领域应用于通用领域提取模型的可行性。我们的实验表明,PETAI-LOR 在GM-CIHT上实现了状态机器人的表现。

Do Not Harm Protected Groups in Debiasing Language Representation Models

  • paper_url: http://arxiv.org/abs/2310.18458
  • repo_url: None
  • paper_authors: Chloe Qinyu Zhu, Rickard Stureborg, Brandon Fain
  • for: 这篇论文旨在探讨语言模型中的偏见和不公平对待,以及如何透过干预技术解除偏见。
  • methods: 本论文使用了四种干预技术,包括word embeddings、 adversarial training、 debiasing word embeddings和 adversarial debiasing。
  • results: 研究发现,干预技术可以减少偏见,但是这些技术可能会对保护的群体造成不良影响,包括性别、种族和年龄等。
    Abstract Language Representation Models (LRMs) trained with real-world data may capture and exacerbate undesired bias and cause unfair treatment of people in various demographic groups. Several techniques have been investigated for applying interventions to LRMs to remove bias in benchmark evaluations on, for example, word embeddings. However, the negative side effects of debiasing interventions are usually not revealed in the downstream tasks. We propose xGAP-DEBIAS, a set of evaluations on assessing the fairness of debiasing. In this work, We examine four debiasing techniques on a real-world text classification task and show that reducing biasing is at the cost of degrading performance for all demographic groups, including those the debiasing techniques aim to protect. We advocate that a debiasing technique should have good downstream performance with the constraint of ensuring no harm to the protected group.
    摘要 语言表示模型(LRM)通过实际数据训练可能捕捉和增强不良偏见,导致各种人口组群体受到不公正待遇。一些技术已经研究了对LRMs进行修正以去除偏见,但这些修正的负面影响通常不会在下游任务中表现出来。我们提出xGAP-DEBIAS,一种评估去偏见的评价方法。在这种工作中,我们研究了一个真实世界文本分类任务中四种去偏见技术,并显示了减少偏见的代价是对所有人口组群体,包括被保护的群体,进行性能下降。我们强调,一种去偏见技术应该在保证不会对保护的群体造成伤害的前提下保持良好的下游性能。

T5 meets Tybalt: Author Attribution in Early Modern English Drama Using Large Language Models

  • paper_url: http://arxiv.org/abs/2310.18454
  • repo_url: None
  • paper_authors: Rebecca M. M. Hicke, David Mimno
  • for: 这个论文探讨了大语言模型在文学领域中的应用,具体来说是用于早期现代英语戏剧作者识别。
  • methods: 这个论文使用了一个精度调整后的t5-large模型,并对几种基线模型进行比较,包括逻辑回归、支持向量机和归一化delta。
  • results: 研究发现,这个精度调整后的t5-large模型在小段文本识别作者方面表现出色,并且超过了所有测试基线模型。然而,研究还发现了一些作者在模型的预训练数据中的存在对预测结果产生了困难评估的影响。
    Abstract Large language models have shown breakthrough potential in many NLP domains. Here we consider their use for stylometry, specifically authorship identification in Early Modern English drama. We find both promising and concerning results; LLMs are able to accurately predict the author of surprisingly short passages but are also prone to confidently misattribute texts to specific authors. A fine-tuned t5-large model outperforms all tested baselines, including logistic regression, SVM with a linear kernel, and cosine delta, at attributing small passages. However, we see indications that the presence of certain authors in the model's pre-training data affects predictive results in ways that are difficult to assess.
    摘要 大型语言模型在许多自然语言处理领域中显示出了突破性潜力。我们在这里考虑使用这些模型来进行类型推断,具体而言是在 Early Modern English drama 中进行作者识别。我们发现了一些有希望的结果,以及一些担心的结果:大型语言模型能够对短段文本准确地预测作者,但也容易将文本错误归属给特定的作者。我们发现了一些证据表明模型的预设数据中的作者存在影响预测结果的方式,但这些影响难以评估。

  • paper_url: http://arxiv.org/abs/2310.18440
  • repo_url: None
  • paper_authors: Rosamond Thalken, Edward H. Stiglitz, David Mimno, Matthew Wilkens
  • for: 法律逻辑分析的 классификация,即法律哲学分析。
  • methods: 使用生成语言模型(LMs)进行文档分类任务,并对不同的LM模型进行系统性测试。
  • results: 发现生成模型在不受 instrucion(i.e. 提示)的情况下表现不佳,但是在对标注过的数据集进行微调后,得到了最佳的结果,并通过应用这些预测来研究历史时期的法律哲学趋势,这与知名的历史质量论相一致,同时还指出了一些可能需要进一步修正的方面。
    Abstract Generative language models (LMs) are increasingly used for document class-prediction tasks and promise enormous improvements in cost and efficiency. Existing research often examines simple classification tasks, but the capability of LMs to classify on complex or specialized tasks is less well understood. We consider a highly complex task that is challenging even for humans: the classification of legal reasoning according to jurisprudential philosophy. Using a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts, we systematically test the performance of a variety of LMs. We find that generative models perform poorly when given instructions (i.e. prompts) equal to the instructions presented to human annotators through our codebook. Our strongest results derive from fine-tuning models on the annotated dataset; the best performing model is an in-domain model, LEGAL-BERT. We apply predictions from this fine-tuned model to study historical trends in jurisprudence, an exercise that both aligns with prominent qualitative historical accounts and points to areas of possible refinement in those accounts. Our findings generally sound a note of caution in the use of generative LMs on complex tasks without fine-tuning and point to the continued relevance of human annotation-intensive classification methods.
    摘要 现代生成语言模型(LMs)在文档分类任务中越来越受到广泛使用,承诺可以大幅提高成本和效率。现有研究通常研究简单的分类任务,但生成模型在复杂或专业化任务上的能力更少被了解。我们考虑了一个非常复杂的任务:用法律哲学来分类法律理解。使用一个新的历史美国最高法院判决 opacity 的注释者队伍编制的数据集,我们系统地测试了多种LMs的性能。我们发现,当给生成模型提供相同的指令(i.e. 提示)时,生成模型表现很差。我们最好的结果来自于在注释过的数据集上练习模型,最佳表现的模型是适应于法律领域的 LEGAL-BERT。我们使用这个练习后的模型进行历史趋势的研究,这与著名的qualitative历史质量相符,并指出了可能的改进点。我们的发现通常表达了对生成LMs在复杂任务中无需练习的使用存在警告,并指出了人工注释Intensive分类方法的持续 relevance。

Expanding the Set of Pragmatic Considerations in Conversational AI

  • paper_url: http://arxiv.org/abs/2310.18435
  • repo_url: None
  • paper_authors: S. M. Seals, Valerie L. Shalin
  • for: 这篇论文主要是为了探讨当前对话AI系统的表现有多好,却未能满足用户的需求。
  • methods: 论文提出了一些实用上的限制,并通过示例表明了这些限制的缺陷。
  • results: 论文提出了一种类型化的对话AI系统的设计和评估方法,以解决现有系统的实用上的缺陷。I hope that helps! Let me know if you have any other questions.
    Abstract Despite considerable performance improvements, current conversational AI systems often fail to meet user expectations. We discuss several pragmatic limitations of current conversational AI systems. We illustrate pragmatic limitations with examples that are syntactically appropriate, but have clear pragmatic deficiencies. We label our complaints as "Turing Test Triggers" (TTTs) as they indicate where current conversational AI systems fall short compared to human behavior. We develop a taxonomy of pragmatic considerations intended to identify what pragmatic competencies a conversational AI system requires and discuss implications for the design and evaluation of conversational AI systems.
    摘要 尽管现有的对话AI系统已经做出了很大的表现改进,但它们仍然不能满足用户的期望。我们讨论了现有对话AI系统的各种各样的限制。我们使用合适的语法示例来 illustrate these limitations, but these examples have clear pragmatic deficiencies. 我们称这些问题为“图灵测试触发器”(TTTs),因为它们表明现有的对话AI系统与人类行为相比存在着缺陷。我们开发了对话AI系统的pragma考虑的分类,以确定这些系统所需的pragma能力,并讨论了这些分类的影响对对话AI系统的设计和评估。

SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes

  • paper_url: http://arxiv.org/abs/2310.18431
  • repo_url: None
  • paper_authors: Adam D. Lelkes, Eric Loreaux, Tal Schuster, Ming-Jun Chen, Alvin Rajkomar
  • for: This paper aims to provide a new dataset for natural language inference (NLI) tasks to extract social and behavioral determinants of health (SDOH) from clinical notes.
  • methods: The paper uses a dataset of publicly available clinical notes and formulates SDOH extraction as an NLI task, with binary textual entailment labels obtained from human raters.
  • results: The authors evaluate both “off-the-shelf” entailment models and models fine-tuned on their data, and find that their dataset appears more challenging than commonly used NLI datasets.Here is the same information in Simplified Chinese text:
  • for: 这篇论文的目的是提供一个新的自然语言推理(NLI)任务,以提取医疗记录中的社会和行为Determinants of health(SDOH)。
  • methods: 论文使用了公共可用的医疗记录数据集,将SDOH抽取作为NLI任务,并使用人工评分者提供的二分文本推理标签。
  • results: 作者评估了一些“卖在架”的推理模型以及特定于其数据集的模型,并发现其数据集与常用的NLI数据集相比更加具有挑战性。
    Abstract Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients' information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference (NLI) task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.
    摘要 社会和行为determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients' information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference (NLI) task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.Here's the translation in Traditional Chinese:社会和行为determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients' information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference (NLI) task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.

Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning

  • paper_url: http://arxiv.org/abs/2310.18417
  • repo_url: None
  • paper_authors: Aditi Chaudhary, Arun Sampath, Ashwin Sheshadri, Antonios Anastasopoulos, Graham Neubig
  • for: 这个研究的目的是为了帮助创建语言教学课程,尤其是 для那些没有充分的资源和专业知识的教师。
  • methods: 这篇论文使用自然语言文库来自动发现和可见化语法描述。它使用文本来回答 morphosyntax 和 semantics 问题,以帮助教师更好地教授印度语言 kannada 和 marathi。
  • results: 这篇论文的结果表明,使用自然语言文库来自动发现和可见化语法描述可以帮助教师更好地创建语言教学课程,并且这些材料被教育专业人员评估为有用。
    Abstract One of the challenges in language teaching is how best to organize rules regarding syntax, semantics, or phonology in a meaningful manner. This not only requires content creators to have pedagogical skills, but also have that language's deep understanding. While comprehensive materials to develop such curricula are available in English and some broadly spoken languages, for many other languages, teachers need to manually create them in response to their students' needs. This is challenging because i) it requires that such experts be accessible and have the necessary resources, and ii) describing all the intricacies of a language is time-consuming and prone to omission. In this work, we aim to facilitate this process by automatically discovering and visualizing grammar descriptions. We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary). We apply this method for teaching two Indian languages, Kannada and Marathi, which, unlike English, do not have well-developed resources for second language learning. To assess the perceived utility of the extracted material, we enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
    摘要 Translated into Simplified Chinese:一个挑战在语言教学中是如何有效地组织语法、 semantics 或音律规则的方式。这不仅需要内容创作人具备教学技能,还需要对这种语言有深刻的理解。而且,为了开发这些课程资料,英语和一些广泛使用的语言有相关的资源,但对于其他语言,教师需要手动创建响应学生需求的资料。这是因为i) 需要访问这些专家和有必要的资源,ii) 描述语言的细节是时间consuming 和易于缺少。在这个工作中,我们希望通过自动发现和视觉化语法描述来促进这个过程。我们从自然文本 corpus 中提取描述,回答有关 morphosyntax (学习word order、一致、格emarking 或 word formation)和 semantics (学习词汇)的问题。我们对几种印度语言,如 kannada 和 Marathi 进行应用,这些语言与英语不同,没有很好的第二语言学习资源。为了评估提取的材料的实际用途,我们征得北美语言教育专业人士的帮助进行手动评估,他们认为这些材料具有教学和学生评估的潜在用途。

FP8-LM: Training FP8 Large Language Models

  • paper_url: http://arxiv.org/abs/2310.18313
  • repo_url: https://github.com/azure/ms-amp
  • paper_authors: Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng
  • for: 这个论文探讨FP8低位数据格式在大型自然语言模型(LLM)的高效训练中的应用。
  • methods: 作者提出了一个新的FP8自动混合精度框架,用于训练LLM模型。这个框架逐渐地使用8位数据格式,包括梯度和优化器状态,以实现混合精度和分布式并行训练。
  • results: 实验结果显示,在使用GPT-175B模型在H100 GPU平台进行训练时,作者的FP8混合精度训练框架可以实现42%的实际内存使用减少和64%的BF16框架(即Megatron-LM)的运行速度,超过Nvidia Transformer Engine的速度。此外,这种混合精度训练方法可以应用于其他任务,如LLM指令优化和人工回馈学习,从而降低精度训练成本。作者的FP8低精度训练框架已经公开开源于GitHub(https://github.com/Azure/MS-AMP)。
    Abstract In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
    摘要

An Approach to Automatically generating Riddles aiding Concept Attainment

  • paper_url: http://arxiv.org/abs/2310.18290
  • repo_url: None
  • paper_authors: Niharika Sri Parasa, Chaitali Diwan, Srinath Srinivasa
  • for: The paper aims to enhance learner engagement in online learning environments by applying the Concept Attainment Model to build conceptual riddles.
  • methods: The paper uses a combination of natural language processing and the Concept Attainment Model to create factual triples from learning resources, classify them based on their uniqueness to a concept, and generate riddles based on the Concept Attainment Model’s format.
  • results: The human evaluation of the riddles obtained encouraging results, indicating the effectiveness of the proposed approach in enhancing learner engagement.Here’s the simplified Chinese text for the three information points:
  • for: 本研究旨在在线学习环境中提高学习者的 engagment,通过应用概念获取模型建立概念游戏。
  • methods: 本研究使用自然语言处理和概念获取模型将学习资源转换为事实三重,根据概念的唯一性进行分类,并根据概念获取模型的格式生成游戏。
  • results: 人类评价的结果显示,提案的方法具有吸引学习者的潜力。
    Abstract One of the primary challenges in online learning environments, is to retain learner engagement. Several different instructional strategies are proposed both in online and offline environments to enhance learner engagement. The Concept Attainment Model is one such instructional strategy that focuses on learners acquiring a deeper understanding of a concept rather than just its dictionary definition. This is done by searching and listing the properties used to distinguish examples from non-examples of various concepts. Our work attempts to apply the Concept Attainment Model to build conceptual riddles, to deploy over online learning environments. The approach involves creating factual triples from learning resources, classifying them based on their uniqueness to a concept into `Topic Markers' and `Common', followed by generating riddles based on the Concept Attainment Model's format and capturing all possible solutions to those riddles. The results obtained from the human evaluation of riddles prove encouraging.
    摘要 一个主要挑战在在线学习环境中是保持学生的参与度。多种不同的教学策略在在线和OFFLINE环境中被提出,以增强学生的参与度。概念把握模型是一种教学策略,强调学生深入理解概念,而不仅仅是其字面意思。这是通过搜索和列出不同概念的例子和非例子中的特征来实现的。我们尝试将概念把握模型应用于建立概念的谜题,并在在线学习环境中部署。该方法包括从学习资源中提取事实三元组,将其分类为概念的唯一特征和公共特征,然后根据概念把握模型的格式生成谜题,并捕捉所有的解决方案。人工评估结果表明,谜题的效果是有挑战性的。

MalFake: A Multimodal Fake News Identification for Malayalam using Recurrent Neural Networks and VGG-16

  • paper_url: http://arxiv.org/abs/2310.18263
  • repo_url: None
  • paper_authors: Adhish S. Sujan, Ajitha. V, Aleena Benny, Amiya M. P., V. S. Anoop
  • for: 这篇研究的目的是为了发展一个能够有效地识别假新闻的模型,尤其是在印度的地方语言中。
  • methods: 这篇研究使用多modalities的特征提取法和深度学习分类模型来识别假新闻。
  • results: 这篇研究发现,使用多modalities的特征提取法和深度学习分类模型可以更高度准确地识别假新闻,并且在Malayalam语言中进行了首次实证。
    Abstract The amount of news being consumed online has substantially expanded in recent years. Fake news has become increasingly common, especially in regional languages like Malayalam, due to the rapid publication and lack of editorial standards on some online sites. Fake news may have a terrible effect on society, causing people to make bad judgments, lose faith in authorities, and even engage in violent behavior. When we take into the context of India, there are many regional languages, and fake news is spreading in every language. Therefore, providing efficient techniques for identifying false information in regional tongues is crucial. Until now, little to no work has been done in Malayalam, extracting features from multiple modalities to classify fake news. Multimodal approaches are more accurate in detecting fake news, as features from multiple modalities are extracted to build the deep learning classification model. As far as we know, this is the first piece of work in Malayalam that uses multimodal deep learning to tackle false information. Models trained with more than one modality typically outperform models taught with only one modality. Our study in the Malayalam language utilizing multimodal deep learning is a significant step toward more effective misinformation detection and mitigation.
    摘要 在最近几年,网络上新闻的浏览量有所扩大。假新闻在当地语言 like 马拉雅利姆语中变得越来越普遍,尤其是在一些在线站点上不具备编辑标准的情况下。假新闻可能对社会产生坏处,让人们做出错误的判断,失去对权威机构的信任,甚至发生暴力行为。在印度国情下,有很多的地方语言,假新闻在每种语言中广泛传播。因此,为了有效地检测假新闻,在马拉雅利姆语中提供有效的技术是非常重要。直到现在,我们知道的是,在马拉雅利姆语中使用多Modalities 的深度学习模型来检测假新闻是第一次。使用多Modalities 的特征可以提高假新闻检测的准确率,因为从多个模式中提取的特征用于建立深度学习分类模型。我们的研究表明,使用多Modalities 的深度学习模型在马拉雅利姆语中可以有效地检测假新闻。这是一项重要的研究,可以帮助我们更好地检测和解决假新闻。

Revising with a Backward Glance: Regressions and Skips during Reading as Cognitive Signals for Revision Policies in Incremental Processing

  • paper_url: http://arxiv.org/abs/2310.18229
  • repo_url: https://github.com/briemadu/revreg
  • paper_authors: Brielen Madureira, Pelin Çelikkol, David Schlangen
  • for: 这个论文旨在研究如何使用人类阅读眼动追踪数据来优化增量处理器的修订策略。
  • methods: 这个论文使用了人类阅读眼动追踪数据,并使用了普通的混合效应模型来分析人类的阅读习惯。
  • results: 研究发现,人类阅读眼动追踪数据中的回退和跳过可能serve as useful predictors for revisions in BiLSTMs and Transformer models,并且这些结果适用于多种语言。
    Abstract In NLP, incremental processors produce output in instalments, based on incoming prefixes of the linguistic input. Some tokens trigger revisions, causing edits to the output hypothesis, but little is known about why models revise when they revise. A policy that detects the time steps where revisions should happen can improve efficiency. Still, retrieving a suitable signal to train a revision policy is an open problem, since it is not naturally available in datasets. In this work, we investigate the appropriateness of regressions and skips in human reading eye-tracking data as signals to inform revision policies in incremental sequence labelling. Using generalised mixed-effects models, we find that the probability of regressions and skips by humans can potentially serve as useful predictors for revisions in BiLSTMs and Transformer models, with consistent results for various languages.
    摘要 在自然语言处理(NLP)中,逐步处理器生成输出,基于进来的语言输入前缀。一些token触发修订,导致输出假设中的修订,但是不多少是为何模型修订这件事情都不太清楚。一个政策可以提高效率是在哪些时间步骤中进行修订。然而,找到适合训练修订政策的信号仍然是一个开放的问题,因为这些信号不自然地出现在数据集中。在这项工作中,我们 investigate了人类阅读眼动追踪数据中的回退和跳过是否可以作为修订政策的信号。使用通用混合效应模型,我们发现人类的回退和跳过概率可能可以作为BiLSTM和Transformer模型中的修订预测器,具有一致的结果。

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

  • paper_url: http://arxiv.org/abs/2310.18208
  • repo_url: https://github.com/penfever/archetype
  • paper_authors: Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire
  • for: 本研究旨在解决现有深度学习方法 для semantic column type annotation (CTA) 中的重要缺点,包括类型 fixed 在训练时间、大量训练样本和高 inference 成本。
  • methods: 本研究使用大语言模型来解决 CTA 问题,并提出了一种简单、实用的方法 ArcheType,包括 context sampling、prompt serialization、model querying 和 label remapping。
  • results: 本研究在 zero-shot 和 fine-tuned CTA 问题上达到了新的州Of-the-art 性能,包括三个新的领域特定的benchmark,并发布了相关的代码和数据。
    Abstract Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve column type annotation problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes new state-of-the-art performance on both zero-shot and fine-tuned CTA, including three new domain-specific benchmarks, which we release, along with the code to reproduce our results at https://github.com/penfever/ArcheType.
    摘要 现有的深度学习方法 дляsemantic column type annotation(CTA)具有重要的缺点:它们依赖于固定的semantic type,需要训练样本数量很多,并且在运行时会产生大量的计算成本。此外,它们在新的数据集上表现不佳,即使类型保持不变。大型语言模型在各种任务上表现出了强的零shot分类能力,在这篇论文中,我们探索了它们在CTA中的使用。我们介绍了ArcheType,一种简单、实用的方法,可以使大型语言模型解决column type annotation问题,无需任何训练样本。我们分别离去每个方法的组成部分,并证明了改进context sampling和label remapping可以提供最大的改进。ArcheType在零shot和精心调整的CTA中成功地设置新的状态纪录,包括三个新的域特定的benchmark,我们在https://github.com/penfever/ArcheType中发布了这些benchmark和 reproduce我们的结果的代码。

INA: An Integrative Approach for Enhancing Negotiation Strategies with Reward-Based Dialogue System

  • paper_url: http://arxiv.org/abs/2310.18207
  • repo_url: https://github.com/zishan-ai/neg
  • paper_authors: Zishan Ahmad, Suman Saurabh, Vaishakh Sreekanth Menon, Asif Ekbal, Roshni Ramnani, Anutosh Maitra
    for:* The paper proposes a novel negotiation dialogue agent for online marketplaces, designed to negotiate on price and other factors such as item inclusion/exclusion in a bundle deal.methods:* The agent uses a new semi-automated data creation method that combines defining negotiation intents, actions, and intent-action simulation to generate potential dialogue flows.* The agent employs a set of novel rewards tailored for the negotiation task to train the Integrative Negotiation Agent (INA).results:* The proposed approach and reward system significantly enhance the agent’s negotiation capabilities, allowing it to engage in integrative negotiations and dynamically adjust prices and item inclusions/exclusions in a bundle deal.
    Abstract In this paper, we propose a novel negotiation dialogue agent designed for the online marketplace. Our agent is integrative in nature i.e, it possesses the capability to negotiate on price as well as other factors, such as the addition or removal of items from a deal bundle, thereby offering a more flexible and comprehensive negotiation experience. We create a new dataset called Integrative Negotiation Dataset (IND) to enable this functionality. For this dataset creation, we introduce a new semi-automated data creation method, which combines defining negotiation intents, actions, and intent-action simulation between users and the agent to generate potential dialogue flows. Finally, the prompting of GPT-J, a state-of-the-art language model, is done to generate dialogues for a given intent, with a human-in-the-loop process for post-editing and refining minor errors to ensure high data quality. We employ a set of novel rewards, specifically tailored for the negotiation task to train our Negotiation Agent, termed as the Integrative Negotiation Agent (INA). These rewards incentivize the chatbot to learn effective negotiation strategies that can adapt to various contextual requirements and price proposals. By leveraging the IND, we train our model and conduct experiments to evaluate the effectiveness of our reward-based dialogue system for negotiation. Our results demonstrate that the proposed approach and reward system significantly enhance the agent's negotiation capabilities. The INA successfully engages in integrative negotiations, displaying the ability to dynamically adjust prices and negotiate the inclusion or exclusion of items in a bundle deal
    摘要 本文提出了一种新的谈判对话机器人,适用于在线市场场所。我们的机器人具有整合性,即可以谈判价格以及其他因素,如交易包中的物品添加或删除,从而提供更加灵活和全面的谈判体验。我们创建了一个新的整合谈判数据集(IND),以实现这种功能。为了创建IND,我们提出了一种新的半自动化数据创建方法,该方法结合定义谈判意图、行为和用户和机器人之间的意图动作模拟,以生成潜在的对话流程。最后,我们使用GPT-J,一种现代自然语言处理模型,来提示对话,并进行人类在 Loop 过程中的修改和微调,以确保数据质量高。我们采用一组特定于谈判任务的新奖励,以训练我们的谈判机器人,称为整合谈判机器人(INA)。这些奖励激励机器人学习有效的谈判策略,能够适应不同的情况和价格建议。通过利用IND,我们训练我们的模型,并进行实验来评估我们的奖励基于对话系统的效果。我们的结果表明,我们的方法和奖励系统可以显著提高机器人的谈判能力。INA成功地参与了整合谈判,展现了可以动态调整价格并谈判交易包中的物品 inclusion 或 exclusion 的能力。

Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media

  • paper_url: http://arxiv.org/abs/2310.18205
  • repo_url: https://github.com/mbzuai-nlp/x-claim
  • paper_authors: Shubham Mittal, Megha Sundriyal, Preslav Nakov
  • for: 这篇论文是为了提高社交媒体文本中Checkworthy声明的识别率而写的。
  • methods: 这篇论文使用了新的数据集X-CLAIM,包含5种印度语言和英语的7000个实际声明,以及现有的encoder-only语言模型和GPT系列的生成大语言模型。
  • results: 研究发现,使用多种语言进行训练可以超过零扩展传递和翻译数据进行训练的性能,并且小型encoder-only语言模型在低资源语言上表现比GPT系列更好。
    Abstract Claim span identification (CSI) is an important step in fact-checking pipelines, aiming to identify text segments that contain a checkworthy claim or assertion in a social media post. Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem, and the scarce research on this topic so far has only focused on English. Here we aim to bridge this gap by creating a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English. We report strong baselines with state-of-the-art encoder-only language models (e.g., XLM-R) and we demonstrate the benefits of training on multiple languages over alternative cross-lingual transfer methods such as zero-shot transfer, or training on translated data, from a high-resource language such as English. We evaluate generative large language models from the GPT series using prompting methods on the X-CLAIM dataset and we find that they underperform the smaller encoder-only language models for low-resource languages.
    摘要 “宣称 span 识别(CSI)是 фак-检查管道中的重要步骤,目的是寻找社交媒体文章中可信worthy的声明或asserttion。despite its importance to journalists and human fact-checkers, it remains a severely understudied problem, and the scarce research on this topic so far has only focused on English. Here we aim to bridge this gap by creating a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English. We report strong baselines with state-of-the-art encoder-only language models (e.g., XLM-R) and we demonstrate the benefits of training on multiple languages over alternative cross-lingual transfer methods such as zero-shot transfer, or training on translated data, from a high-resource language such as English. We evaluate generative large language models from the GPT series using prompting methods on the X-CLAIM dataset and we find that they underperform the smaller encoder-only language models for low-resource languages.”

Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN

  • paper_url: http://arxiv.org/abs/2310.18169
  • repo_url: None
  • paper_authors: Neeraj Kumar, Ankur Narang, Brejesh Lall
  • for: 该研究旨在提出一种基于Diffusion GAN的方法(Prosodic Diff-TTS),用于基于样式描述和内容文本的输入生成高效的语音样本,仅需4个释除步骤。
  • methods: 该方法利用了新的条件式词干层normalization技术,将样式嵌入 integrate into多头注意力基本Encoder和Mel spectrogram Decoder结构中,以生成语音。样式嵌入由预训练BERT模型在auxiliary任务上练习,如抑制、速度、情感、性别分类。
  • results: 该研究在多个多种语音数据集上进行了证明,包括LibriTTS和PromptSpeech数据集,并通过多个量化度量测试生成的准确率和MOS来证明其效果。
    Abstract In this paper, we present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps. It leverages the novel conditional prosodic layer normalization to incorporate the style embeddings into the multi head attention based phoneme encoder and mel spectrogram decoder based generator architecture to generate the speech. The style embedding is generated by fine tuning the pretrained BERT model on auxiliary tasks such as pitch, speaking speed, emotion,gender classifications. We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
    摘要 在这篇论文中,我们提出了一种扩散GAN基本方法(叫做Prosodic Diff-TTS),用于根据样式描述和内容文本生成相应的高精度语音,并且只需要4个释除步骤。它利用了新的 conditional prosodic layer normalization来将样式嵌入 incorporated 到多头注意力基本架构中的phoneme encoder和mel spectrogram decoder基本生成器中,以生成语音。样式嵌入由先前热身BERT模型的 fine-tuning 在auxiliary task such as pitch, speaking speed, emotion, gender classifications中进行。我们在多个 speakers的 LibriTTS 和 PromptSpeech 数据集上证明了我们提出的架构的可行性,并使用多个量化度量来评估生成的准确性和MOS。

MPrompt: Exploring Multi-level Prompt Tuning for Machine Reading Comprehension

  • paper_url: http://arxiv.org/abs/2310.18167
  • repo_url: None
  • paper_authors: Guoxin Chen, Yiming Qian, Bowen Wang, Liangzhi Li
  • for: 这篇论文是为了提出一种轻量级的Prompt tuning方法,以提高预训练语言模型(PLMs)在新 dataset上的表现。
  • methods: 该方法使用了多级Prompt,包括任务特定、领域特定和上下文特定的Prompt,以提高输入语义理解的精度。另外,该方法还提出了一个独立性约束,以避免域特定Prompt中重复的信息。
  • results: 在12个不同的benchmark上进行了广泛的实验,并实现了与当前最佳方法的平均提升率为1.94%。
    Abstract The large language models have achieved superior performance on various natural language tasks. One major drawback of such approaches is they are resource-intensive in fine-tuning new datasets. Soft-prompt tuning presents a resource-efficient solution to fine-tune the pre-trained language models (PLMs) while keeping their weight frozen. Existing soft prompt methods mainly focus on designing the input-independent prompts that steer the model to fit the domain of the new dataset. Those methods often ignore the fine-grained information about the task and context of the text. In this paper, we propose a multi-level prompt tuning (MPrompt) method for machine reading comprehension. It utilizes prompts at task-specific, domain-specific, and context-specific levels to enhance the comprehension of input semantics at different granularities. We also propose an independence constraint to steer each domain-specific prompt to focus on information within its domain to avoid redundancy. Moreover, we present a prompt generator that incorporates context-related knowledge in the prompt generation to enhance contextual relevancy. We conducted extensive experiments on 12 benchmarks of various QA formats and achieved an average improvement of 1.94\% over the state-of-the-art methods.
    摘要 大型自然语言模型已经在不同的自然语言任务上实现了出色的性能。然而,这些方法具有资源占用很大的缺点,需要较多的训练数据来调整新的数据集。软提示调整方法提供了一种资源有效的解决方案,可以在保持模型权重固定的情况下,对预训练语言模型(PLMs)进行调整。现有的软提示方法主要关注设计独立的输入提示,以使模型适应新数据集的领域。这些方法通常忽略了文本的任务和上下文细节信息。在这篇论文中,我们提出了一种多级提示调整(MPrompt)方法,用于机器阅读理解。它利用提示在任务特定、领域特定和上下文特定三个级别来提高输入 semantics 的理解。我们还提出了一种独立约束,以确保每个领域特定的提示专注于自己的领域内容,以避免重复。此外,我们提出了一种 incorporating 上下文相关知识的提示生成器,以提高上下文相关性。我们在 12 个不同的benchmark上进行了广泛的实验,并实现了与状态 искус法方法的平均提升率为1.94%。

Elevating Code-mixed Text Handling through Auditory Information of Words

  • paper_url: http://arxiv.org/abs/2310.18155
  • repo_url: None
  • paper_authors: Mamta, Zishan Ahmad, Asif Ekbal
  • for: handles code-mixed textual data with auditory information
  • methods: pre-training step based on masked-language-modelling with SOUNDEX representations (SAMLM) and a new input method
  • results: improved robustness towards adversarial attacks and better classification results over popular baselines for code-mixed tasksHere is the simplified Chinese version of the three points:
  • for: 处理混合语言文本数据,使用听音信息
  • methods: 基于隐藏语言模型的预训练步骤,使用SOUNDEX表示法(SAMLM)和一种新的输入方法
  • results: 提高了对 adversarial 攻击的Robustness,以及对 code-mixed 任务的基eline 性能
    Abstract With the growing popularity of code-mixed data, there is an increasing need for better handling of this type of data, which poses a number of challenges, such as dealing with spelling variations, multiple languages, different scripts, and a lack of resources. Current language models face difficulty in effectively handling code-mixed data as they primarily focus on the semantic representation of words and ignore the auditory phonetic features. This leads to difficulties in handling spelling variations in code-mixed text. In this paper, we propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX. Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model. Through experimentation on various code-mixed datasets (of different languages) for sentiment, offensive and aggression classification tasks, we establish that our novel language modeling approach (SAMLM) results in improved robustness towards adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM based approach also results in better classification results over the popular baselines for code-mixed tasks. We use the explainability technique, SHAP (SHapley Additive exPlanations) to explain how the auditory features incorporated through SAMLM assist the model to handle the code-mixed text effectively and increase robustness against adversarial attacks \footnote{Source code has been made available on \url{https://github.com/20118/DefenseWithPhonetics}, \url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}.
    摘要 随着code-mixed数据的普及,处理这类数据的需求日益增加,但这也存在许多挑战,如处理拼写变化、多语言、不同的字符集和资源不足等。现有语言模型在处理code-mixed文本时存在困难,因为它们主要关注单词的 semantics 表示,忽略了听音特征。这导致了处理拼写变化的困难。在这篇论文中,我们提出了一种有效的方法,使用听音信息来创建适用于处理code-mixed文本数据的语言模型。我们的方法包括在遮盖语言模型的预训练阶段基于MASKED-LANGUAGE-MODELING,以及一种新的输入数据提供方法。通过对不同语言的code-mixed数据集进行 sentiment、攻击和侵略等任务的实验,我们证明了我们的新的语言模型方法(SAMLM)能够更好地鲁棒化对code-mixed文本的攻击。此外,我们的SAMLM基于方法还在code-mixed任务上得到了更好的分类结果,比 популяр的基elines更好。我们使用SHAP(SHapley Additive exPlanations)技术来解释如何通过SAMLM incorporating 听音特征来处理code-mixed文本,从而提高模型对code-mixed文本的鲁棒性和抗击攻击能力。详细的源代码已经在 上发布。

Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs

  • paper_url: http://arxiv.org/abs/2310.18152
  • repo_url: None
  • paper_authors: Yijian Qin, Xin Wang, Ziwei Zhang, Wenwu Zhu
    for: 这篇论文是为了解决现有的大语言模型(LLM)在文本嵌入图(TAG)中的缺陷,提高LLM的理解和预测能力。methods: 这篇论文提出了一种名为Disentangled Graph-Text Learner(DGTL)模型,通过专门设计的分离图神经网络(GNN)层,使LLM可以更好地捕捉文本嵌入图中的复杂关系。results: 实验证明,提出的DGTL模型可以在文本嵌入图中实现superior或相当于现有基线的性能,并且可以提供自然语言的解释,因此显著提高了模型的可读性。
    Abstract Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs such as citation networks, e-commerce networks and social networks has attracted considerable attention in the web community. Recently, large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, the existing works focus on harnessing the potential of LLMs solely relying on prompts to convey graph structure information to LLMs, thus suffering from insufficient understanding of the complex structural relationships within TAGs. To address this problem, in this paper we present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers, enabling LLMs to capture the intricate relationships hidden in text-attributed graphs from multiple structural factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing computational costs and allowing much more flexibility in combining with different LLM models. Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines. Additionally, we also demonstrate that our DGTL model can offer natural language explanations for predictions, thereby significantly enhancing model interpretability.
    摘要 文本归属图(TAG)在网络上非常普遍,研究人员对这些图像(如引用网络、购物网络和社交网络)的研究吸引了广泛的关注。最近,大型自然语言模型(LLM)在各种任务上表现出了非常出色的能力。然而,现有的工作强调仅通过提示来使LLM对图像进行理解,因此忽略了TAG中复杂的结构关系的问题。为解决这个问题,我们在这篇论文中提出了卷积图文学习者(DGTL)模型,可以增强LLM对TAG的理解和预测能力。我们的提议的DGTL模型通过适应的分离卷积神经网络层来捕捉TAG中多种结构因素中的复杂关系,使LLM能够从多个角度理解TAG的结构。此外,DGTL模型可以与预训练的LLM结合使用, thereby reducing computational costs and allowing for much more flexibility in combining with different LLM models。实验评估表明,我们的提议的DGTL模型可以在达到或与当前基eline相当的性能。此外,我们还示出了DGTL模型可以提供自然语言的解释,从而显著提高模型可读性。

DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues

  • paper_url: http://arxiv.org/abs/2310.18130
  • repo_url: https://github.com/zidixiu/delphi
  • paper_authors: David Q. Sun, Artem Abzaliev, Hadas Kotek, Zidi Xiu, Christopher Klein, Jason D. Williams
  • For: This paper aims to systematically examine how large language models (LLMs) respond to questions related to ongoing debates and controversial issues.* Methods: The authors propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs Dataset. They evaluate different LLMs using a subset of this dataset to understand how they handle controversial issues and the stances they adopt.* Results: The research reveals challenges concerning knowledge recency, safety, fairness, and bias in LLMs’ interaction with controversial issues, and contributes to our understanding of how these models handle complex societal debates.Here’s the text in Simplified Chinese:* For: 这篇论文目标是系统地检查大语言模型(LLM)对ongoing debates和争议问题的回答。* Methods: 作者提出了一种基于Quora Question Pairs Dataset的争议问题集的新建构,以评估不同LLM对争议问题的处理和立场。* Results: 研究发现LLM在处理争议问题时存在知识新鲜度、安全、公正性和偏见等挑战,这些挑战对于LLM在处理复杂社会问题的理解具有重要意义。
    Abstract Controversy is a reflection of our zeitgeist, and an important aspect to any discourse. The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions. Consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. However, few such datasets exist in providing human-annotated labels reflecting the contemporary discussions. To foster research in this area, we propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs Dataset. This dataset presents challenges concerning knowledge recency, safety, fairness, and bias. We evaluate different LLMs using a subset of this dataset, illuminating how they handle controversial issues and the stances they adopt. This research ultimately contributes to our understanding of LLMs' interaction with controversial issues, paving the way for improvements in their comprehension and handling of complex societal debates.
    摘要 争议是我们时代精神的反映,是任何讨论的重要方面。大语言模型(LLM)作为对话系统的出现,使人们倾向于依赖这些系统以解答他们的各种问题。因此,系统地检查 LLM 如何回答与当前讨论相关的问题是非常重要的。然而,现在还没有多少 datasets 提供了当今社会讨论的人工标注数据。为推动这一领域的研究,我们提出了一种新的争议问题集合,基于已公布的 Quora 问题对 dataset。这个 dataset 存在知识新鲜度、安全性、公平性和偏见等挑战。我们使用一部分这个 dataset 评估不同的 LLM,揭示它们如何处理争议问题,以及它们所采取的立场。这项研究最终会促进我们对 LLM 与复杂社会讨论的理解,为其更好地处理和理解社会争议的能力做出贡献。

Mind the Gap: Automated Corpus Creation for Enthymeme Detection and Reconstruction in Learner Arguments

  • paper_url: http://arxiv.org/abs/2310.18098
  • repo_url: https://github.com/webis-de/emnlp-23
  • paper_authors: Maja Stahl, Nick Düsterhus, Mei-Hua Chen, Henning Wachsmuth
  • for: 这篇论文主要是为了提高学生写作论据的能力,帮助学生更好地搜寻和重建论据。
  • methods: 该论文提出了两个新任务来提高学生论据质量:enthymeme detection和enthymeme reconstruction。它们使用自然语言处理技术来自动生成论据实例,并通过人工研究证明了这些实例的质量。
  • results: 该论文通过实验表明,使用该方法可以生成高质量的论据实例,并且这些实例的自然语言表达与学生原始写作的语言相似。此外,该论文还提出了初步的检测和重建论据的方法,以便进一步研究这些任务的可能性。
    Abstract Writing strong arguments can be challenging for learners. It requires to select and arrange multiple argumentative discourse units (ADUs) in a logical and coherent way as well as to decide which ADUs to leave implicit, so called enthymemes. However, when important ADUs are missing, readers might not be able to follow the reasoning or understand the argument's main point. This paper introduces two new tasks for learner arguments: to identify gaps in arguments (enthymeme detection) and to fill such gaps (enthymeme reconstruction). Approaches to both tasks may help learners improve their argument quality. We study how corpora for these tasks can be created automatically by deleting ADUs from an argumentative text that are central to the argument and its quality, while maintaining the text's naturalness. Based on the ICLEv3 corpus of argumentative learner essays, we create 40,089 argument instances for enthymeme detection and reconstruction. Through manual studies, we provide evidence that the proposed corpus creation process leads to the desired quality reduction, and results in arguments that are similarly natural to those written by learners. Finally, first baseline approaches to enthymeme detection and reconstruction demonstrate the corpus' usefulness.
    摘要 写出强大的论据可能对学习者来说是一项挑战。它需要选择并将多个论据性言Unit (ADU) 组织成逻辑和一致的方式,并决定哪些ADU可以被暗示,即欠Entymemes。然而,当重要的ADU缺失时,读者可能无法跟踪思维或理解论据的主要点。这篇论文提出了两个新任务来提高学习者的论据质量:识别论据缺失 (enthymeme检测) 和填充这些缺失 (enthymeme重建).我们研究了如何通过自动创建 corpora来实现这两个任务。基于 ICLEv3 Argumentative learner essays 论文库,我们创建了40,089个论据实例。通过手动研究,我们提供了证据,表明我们的 corpus 创建过程导致了期望的质量降低,并且结果是与学习者写作的论据类似的自然。最后,我们提出了首个基eline Approaches to enthymeme检测和重建,这证明了 corpus 的用用。

Lost in Translation – Multilingual Misinformation and its Evolution

  • paper_url: http://arxiv.org/abs/2310.18089
  • repo_url: None
  • paper_authors: Dorian Quelle, Calvin Cheng, Alexandre Bovet, Scott A. Hale
  • for: 本研究探讨了在多语言环境中流传的谣言的频率和 dynamics,通过分析了250,000多个不同语言的事实核查。
  • methods: 该研究使用了事实核查作为谣言传播的代理,并使用多语言句子嵌入来表示事实核查。研究还使用了群erset扩展来分类相似的CLAIM,并分析了不同语言之间CLAIM的连接和短路。
  • results: 研究发现,虽然大多数谣言CLAIM只被核查一次,但11.7%的CLAIM(相当于21,000多个)被多次核查。研究还发现,33%的重复CLAIM跨语言传播,表明一些谣言可以跨越语言障碍。然而,研究还发现,谣言在同一语言中更容易传播。通过分析不同语言之间CLAIM的连接和短路,研究发现CLAIM逐渐发展和变化,并且在 crossing 语言时更加明显。
    Abstract Misinformation and disinformation are growing threats in the digital age, spreading rapidly across languages and borders. This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of over 250,000 unique fact-checks spanning 95 languages. First, we find that while the majority of misinformation claims are only fact-checked once, 11.7%, corresponding to more than 21,000 claims, are checked multiple times. Using fact-checks as a proxy for the spread of misinformation, we find 33% of repeated claims cross linguistic boundaries, suggesting that some misinformation permeates language barriers. However, spreading patterns exhibit strong homophily, with misinformation more likely to spread within the same language. To study the evolution of claims over time and mutations across languages, we represent fact-checks with multilingual sentence embeddings and cluster semantically similar claims. We analyze the connected components and shortest paths connecting different versions of a claim finding that claims gradually drift over time and undergo greater alteration when traversing languages. Overall, this novel investigation of multilingual misinformation provides key insights. It quantifies redundant fact-checking efforts, establishes that some claims diffuse across languages, measures linguistic homophily, and models the temporal and cross-lingual evolution of claims. The findings advocate for expanded information sharing between fact-checkers globally while underscoring the importance of localized verification.
    摘要 “误信和伪信在数字时代增长为潜在的威胁,迅速在语言和国界之间传播。这篇论文通过分析超过250,000个唯一的事实核查来研究多语言误信的普遍性和动态。我们发现大多数误信声明只被核查一次,但11.7%(相当于 más than 21,000)的声明被重复核查。使用事实核查作为误信传播的代理,我们发现33%的重复声明跨语言传播,这表明一些误信可以跨越语言障碍。然而,误信的传播模式具有强的同语群效应,误信更可能在同一语言中传播。为了研究声明的时间发展和语言过渡的变化,我们使用多语言句子嵌入表示事实核查,并对具有相似含义的声明进行聚类分析。我们分析了声明之间的连接组件和语言之间的短路,发现声明逐渐演化,并在语言之间传播时更容易发生变化。总的来说,这项研究提供了关键的发现,证实了重复核查的重要性,同时也强调了地方化验证的重要性。”

A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

  • paper_url: http://arxiv.org/abs/2310.18073
  • repo_url: None
  • paper_authors: Xinyu Wang, Lin Gui, Yulan He
  • for: 这篇论文主要是关于文档结构化的研究,旨在提出一个新的 dataset,ESGDoc,包含了563家公司的1093份 ESG年报,从2001年到2022年。
  • methods: 该论文提出了一种新的框架,用于文档结构化,它包括三个步骤:(1)根据文本阅读顺序和字体大小构建初始树;(2)模型每个树节(或文本块),基于它所处的子树上的信息;(3)修改原始树,通过对每个树节进行适当的操作(保留、删除或移动)。
  • results: 该框架可以更好地处理文档的不同结构和长度,并且比前一代基eline的方法快得多。实验结果表明,我们的方法可以更高效地处理文档,并且可以适应不同的文档长度。
    Abstract Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modelling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.
    摘要
  1. Constructing an initial tree of text blocks based on reading order and font sizes.2. Modeling each tree node (or text block) independently by considering its contextual information captured in a node-centric subtree.3. Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move).This construction-modeling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modeling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.

Multi-grained Evidence Inference for Multi-choice Reading Comprehension

  • paper_url: http://arxiv.org/abs/2310.18070
  • repo_url: None
  • paper_authors: Yilin Zhao, Hai Zhao, Sufeng Duan
  • for: 多选机器阅读理解(MRC)是一项具有挑战性的任务,需要机器能够根据提供的选项回答问题。
  • methods: 我们提出了一种新的通用模型增强方法,名为多重粒度证据推理器(Mugen),用于弥补机器无法直接从给定的繁杂、噪音 passage 中提取准确证据的不足。Mugen 将在不同粒度上提取证据:粗粒度、中粒度和细粒度证据,并将证据与原始 passage 集成,实现了四个多选 MRC benchmark 上显著和一致的性能提升。
  • results: 我们的方法在四个多选 MRC benchmark 上实现了显著和一致的性能提升。
    Abstract Multi-choice Machine Reading Comprehension (MRC) is a major and challenging task for machines to answer questions according to provided options. Answers in multi-choice MRC cannot be directly extracted in the given passages, and essentially require machines capable of reasoning from accurate extracted evidence. However, the critical evidence may be as simple as just one word or phrase, while it is hidden in the given redundant, noisy passage with multiple linguistic hierarchies from phrase, fragment, sentence until the entire passage. We thus propose a novel general-purpose model enhancement which integrates multi-grained evidence comprehensively, named Multi-grained evidence inferencer (Mugen), to make up for the inability. Mugen extracts three different granularities of evidence: coarse-, middle- and fine-grained evidence, and integrates evidence with the original passages, achieving significant and consistent performance improvement on four multi-choice MRC benchmarks.
    摘要

“Honey, Tell Me What’s Wrong”, Global Explanation of Textual Discriminative Models through Cooperative Generation

  • paper_url: http://arxiv.org/abs/2310.18063
  • repo_url: None
  • paper_authors: Antoine Chaffin, Julien Delaunay
  • for: 这篇论文的目的是提出一种全球和模型无关的解释方法,用于文本分类器中。
  • methods: 这种方法基于合作生成的文本,不需要输入数据集,可以在数据缺失时提供解释。
  • results: 实验表明,这种方法可以准确地描述分类器对输入空间中的行为,并且在输入数据不具体化时表现更好于使用输入数据的方法。
    Abstract The ubiquity of complex machine learning has raised the importance of model-agnostic explanation algorithms. These methods create artificial instances by slightly perturbing real instances, capturing shifts in model decisions. However, such methods rely on initial data and only provide explanations of the decision for these. To tackle these problems, we propose Therapy, the first global and model-agnostic explanation method adapted to text which requires no input dataset. Therapy generates texts following the distribution learned by a classifier through cooperative generation. Because it does not rely on initial samples, it allows to generate explanations even when data is absent (e.g., for confidentiality reasons). Moreover, conversely to existing methods that combine multiple local explanations into a global one, Therapy offers a global overview of the model behavior on the input space. Our experiments show that although using no input data to generate samples, Therapy provides insightful information about features used by the classifier that is competitive with the ones from methods relying on input samples and outperforms them when input samples are not specific to the studied model.
    摘要 “复杂机器学习的普遍使得模型无关解释算法的重要性提高。这些方法通过微量修改真实实例而创造人工实例,捕捉模型决策的变化。然而,这些方法仅依赖于初始数据,只能提供这些数据的决策说明。为解决这些问题,我们提议了疗法(Therapy),是首个全球、模型无关的解释方法,不需要输入数据集。疗法通过与分类器一起生成文本,学习分类器的分布。因为不依赖于初始样本,疗法可以在没有数据时产生解释(例如,保持隐私原则)。此外,不同于现有的方法,将多个本地解释合并成一个全局解释,疗法提供了输入空间上模型行为的全面视图。我们的实验表明,使用没有输入数据生成样本,疗法可以提供有用的特征信息,与使用输入数据生成样本的方法竞争,并在输入数据不是特定于研究模型时表现更好。”

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

  • paper_url: http://arxiv.org/abs/2310.18046
  • repo_url: https://github.com/kvt0012/viclevr
  • paper_authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen
  • for: 本研究旨在提高越南语言Visual Question Answering(VQA)系统的性能,并探讨现代视觉理解系统的强点和局限性。
  • methods: 本研究使用了一个新的多模态混合方法,称为 PhoVIT,该方法可以基于问题来确定图像中的对象。PhoVIT使用了 transformers 来同时进行文本和视觉数据的推理,并在早期模型阶段将两种模式融合。
  • results: 实验结果显示,我们的提议的模型在四个评价指标中均达到了当前最佳性能。
    Abstract In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. The accompanying code and dataset have been made publicly accessible at \url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate advancements within the research community, fostering the development of more multimodal fusion algorithms, specifically tailored to address the nuances of low-resource languages, exemplified by Vietnamese.
    摘要 Recently, Visual Question Answering (VQA) has gained significant attention due to its diverse applications, such as intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires the effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. The accompanying code and dataset have been made publicly accessible at [url=https://github.com/kvt0012/ViCLEVR]. This provision seeks to stimulate advancements within the research community, fostering the development of more multimodal fusion algorithms, specifically tailored to address the nuances of low-resource languages, exemplified by Vietnamese.

On General Language Understanding

  • paper_url: http://arxiv.org/abs/2310.18038
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: David Schlangen
  • for: 这篇论文的目的是为了探讨人工智能语言处理领域内的语言理解问题,以及现有测量模型质量的方法是否具有足够的有效性。
  • methods: 这篇论文使用了一种模型,用于描述语言理解是一种多方面的现象,兼包含个人主义和社会过程。
  • results: 这篇论文的结论是,不同的语言使用场景类型具有不同的特点,而语言理解是一种多方面的现象,需要考虑个人主义和社会过程。此外,选择的理解指标会影响测量模型质量的限制,并且开启了对NLP使用的伦理考虑。
    Abstract Natural Language Processing prides itself to be an empirically-minded, if not outright empiricist field, and yet lately it seems to get itself into essentialist debates on issues of meaning and measurement ("Do Large Language Models Understand Language, And If So, How Much?"). This is not by accident: Here, as everywhere, the evidence underspecifies the understanding. As a remedy, this paper sketches the outlines of a model of understanding, which can ground questions of the adequacy of current methods of measurement of model quality. The paper makes three claims: A) That different language use situation types have different characteristics, B) That language understanding is a multifaceted phenomenon, bringing together individualistic and social processes, and C) That the choice of Understanding Indicator marks the limits of benchmarking, and the beginnings of considerations of the ethics of NLP use.
    摘要 自然语言处理(NLP)自认为是一个经验主义的,甚至是直接经验主义的领域,然而最近它似乎涉及到必要的本质主义辩论("大语言模型是否理解语言,以及如何量度它们?")。这不是偶合:在这里,就如 everywhere else,证据不够特征化理解。为了解决这问题,这篇论文提出了一个理解模型,以便评估现有测量模型质量的问题。论文提出了三个主张:A)不同的语言使用情况类型有不同的特征;B)语言理解是多方面的现象,既具有个人主义的特征,又具有社会过程的特征;C)选择理解指标标志着测量的限制,也标志着NLP使用的伦理考虑的开始。

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2310.18023
  • repo_url: None
  • paper_authors: Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anstasopoulos, Marcos Zampieri
  • for: 这篇论文的目的是为了提出一个新的三种语言混合数据集(SentMix-3L),用于 sentiment analysis 的计算模型训练。
  • methods: 该论文使用了 GPT-3.5 作为预训练模型,并进行了对 SentMix-3L 的全面评估。
  • results: 研究发现,使用 GPT-3.5 的零shot提问方法可以在 SentMix-3L 上超越所有基于 transformer 的模型。
    Abstract Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
    摘要 <>将文本翻译成简化字符串。<>研究人员已经广泛研究了语言混合现象,在文本或语音中混合两种或更多的语言。许多数据集已经建立,用于训练计算机模型。虽然混合多种语言很常见,但大多数可用的数据集只包含两种语言的混合。在这篇论文中,我们介绍了一个新的 sentiment 分析数据集 SentMix-3L,包含三种语言孟加拉语、英语和印地语的混合数据。我们进行了全面的评估,并显示了 GPT-3.5 预训练模型在 SentMix-3L 上的零批训练性能超过所有 transformer 模型。

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

  • paper_url: http://arxiv.org/abs/2310.18018
  • repo_url: None
  • paper_authors: Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre
  • for: 本文 argue that classical NLP task evaluation using annotated benchmarks is facing a serious problem, specifically the worst kind of data contamination.
  • methods: 本文提出了不同级别的数据污染水平,并呼吁社区共同努力,包括开发自动和半自动检测数据 benchmark 中模型训练时的污染程度的方法,以及建议将污染数据导致的科学结论列入涂抹名单。
  • results: 本文表明,当一个大自然语言模型(LLM)在测试分割上训练,然后在同一个benchmark上评估时,会导致模型性能的过高估计,从而导致科学结论的错误公布,同时正确的结论被抛弃。这种情况可能导致科学研究的假阳性结论,并且可能对社会造成不良影响。
    Abstract In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.
    摘要 在这份位点论文中,我们Arguments that the traditional evaluation of Natural Language Processing (NLP) tasks using annotated benchmarks is facing a crisis. The most severe data contamination occurs when a Large Language Model (LLM) is trained on the test set of a benchmark and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not easy to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task compared to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and advocates for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.

Does Role-Playing Chatbots Capture the Character Personalities? Assessing Personality Traits for Role-Playing Chatbots

  • paper_url: http://arxiv.org/abs/2310.17976
  • repo_url: https://github.com/LC1332/Chat-Haruhi-Suzumiya
  • paper_authors: Xintao Wang, Quan Tu, Yaying Fei, Ziang Leng, Cheng Li
  • for: 这篇论文旨在探讨如何使用大规模预训练语言模型来评估角色扮演聊天机器人的人格特质。
  • methods: 该论文提出了一种开放结束式采访方法,用于评估角色扮演聊天机器人的人格特质,并对32个使用ChatHaruhi库创建的角色扮演聊天机器人进行了评估。
  • results: 研究结果显示,使用大规模预训练语言模型创建的角色扮演聊天机器人可以准确表现出对应的人格特质,与人类所认可的人格特质的一致率为82.8%。此外,论文还提出了可能的形成聊天机器人人格的策略。因此,这篇论文为Role-playing聊天机器人的研究提供了一个基础性的研究。
    Abstract The emergence of large-scale pretrained language models has revolutionized the capabilities of new AI application, especially in the realm of crafting chatbots with distinct personas. Given the "stimulus-response" nature of chatbots, this paper unveils an innovative open-ended interview-style approach for personality assessment on role-playing chatbots, which offers a richer comprehension of their intrinsic personalities. We conduct personality assessments on 32 role-playing chatbots created by the ChatHaruhi library, across both the Big Five and MBTI dimensions, and measure their alignment with human perception. Evaluation results underscore that modern role-playing chatbots based on LLMs can effectively portray personality traits of corresponding characters, with an alignment rate of 82.8% compared with human-perceived personalities. Besides, we also suggest potential strategies for shaping chatbots' personalities. Hence, this paper serves as a cornerstone study for role-playing chatbots that intersects computational linguistics and psychology. Our resources are available at https://github.com/LC1332/Chat-Haruhi-Suzumiya
    摘要 大规模预训语言模型的出现对新的人工智能应用程序带来了革命性的变革,特别是在游戏角色聊天机器人的领域。由于聊天机器人的“刺激-应答”性质,这篇论文推出了一种创新的开端式 интервью式人格测试方法,可以更深入地了解角色聊天机器人的内在人格特质。我们对使用ChatHaruhi库创建的32个角色聊天机器人进行了人格测试,包括Big Five和MBTI维度,并与人类的认知进行比较。结果显示,现代基于LLMs的角色聊天机器人可以有效表达对应的人格特质,与人类认知的人格Alignment率为82.8%。此外,我们还提出了可能的聊天机器人人格模型的形成策略。因此,本论文作为计算语言学和心理学交叉领域的基础研究,可以为角色聊天机器人的开发提供启示。我们的资源可以在https://github.com/LC1332/Chat-Haruhi-Suzumiya 查看。

Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed Languages

  • paper_url: http://arxiv.org/abs/2310.17953
  • repo_url: None
  • paper_authors: Peng Xie, XingYuan Liu, ZiWei Chen, Kani Chen, Yang Wang
  • for: 这项研究旨在提高英语自动语音识别(ASR)中的人类水平稳定性和准确率,特别是在小语言和混合语言语音识别方面。
  • methods: 这项研究使用了自己收集的混合粤语和英语音频数据集(MCE)来训练了自适应的Whisper模型(Whisper-MCE),并提出了一种新的评价机制来评估模型在小语言和混合语言上的效果。
  • results: 研究表明, compare to基线的Whisper-大型v2模型,Whisper-MCE模型能够更好地捕捉原始音频的内容,实现更高的识别精度,并且具有更快的识别速度,特别是在混合语言任务中表现出色。
    Abstract Recently Whisper has approached human-level robustness and accuracy in English automatic speech recognition (ASR), while in minor language and mixed language speech recognition, there remains a compelling need for further improvement. In this work, we present the impressive results of Whisper-MCE, our finetuned Whisper model, which was trained using our self-collected dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile, considering word error rate (WER) poses challenges when it comes to evaluating its effectiveness in minor language and mixed-language contexts, we present a novel rating mechanism. By comparing our model to the baseline whisper-large-v2 model, we demonstrate its superior ability to accurately capture the content of the original audio, achieve higher recognition accuracy, and exhibit faster recognition speed. Notably, our model outperforms other existing models in the specific task of recognizing mixed language.
    摘要 最近,Whisper 在英语自动语音识别(ASR)中达到了人类水平的Robustness和准确率,而在小语言和杂语言语音识别方面仍然有很大的改进空间。在这项工作中,我们发布了我们自收集的数据集,混合粤语和英语音频数据集(MCE),并使用这些数据集来训练我们的 Whisper 模型,并对其进行了迁移。尽管 word error rate(WER)在小语言和杂语言上存在评估效果的挑战,我们则提出了一种新的评价机制。通过对我们的模型与基eline whisper-large-v2 模型进行比较,我们表明了我们的模型在捕捉原始音频内容的能力更高, recognition 率更高,并且速度更快。值得一提是,我们的模型在杂语言识别任务中表现出色,胜过其他现有的模型。

SOUL: Towards Sentiment and Opinion Understanding of Language

  • paper_url: http://arxiv.org/abs/2310.17924
  • repo_url: https://github.com/damo-nlp-sg/soul
  • paper_authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing
  • for: 评估语言模型在情感分析领域的能力,探讨语言模型是否能够理解语言中的情感和意见。
  • methods: 提出了一种新的任务叫做情感和意见理解语言(SOUL),SOUL包括两个子任务:评论理解(RC)和证明生成(JG)。
  • results: 实验结果表明,SOUL是现有语言模型很难解决的任务,与人类表现相比,语言模型的性能差距可达27%。此外,与人类专家和GPT-4进行评估表明,小语言模型在生成有理根据的证明方面存在限制。这些结果强调了现有语言模型在情感分析领域存在的复杂性,并提出了进一步发展情感分析的需求。
    Abstract Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims to evaluate sentiment understanding through two subtasks: Review Comprehension (RC) and Justification Generation (JG). RC seeks to validate statements that focus on subjective information based on a review text, while JG requires models to provide explanations for their sentiment predictions. To enable comprehensive evaluation, we annotate a new dataset comprising 15,028 statements from 3,638 reviews. Experimental results indicate that SOUL is a challenging task for both small and large language models, with a performance gap of up to 27% when compared to human performance. Furthermore, evaluations conducted with both human experts and GPT-4 highlight the limitations of the small language model in generating reasoning-based justifications. These findings underscore the challenging nature of the SOUL task for existing models, emphasizing the need for further advancements in sentiment analysis to address its complexities. The new dataset and code are available at https://github.com/DAMO-NLP-SG/SOUL.
    摘要 sentiment分析是一个已经广泛应用的自然语言处理任务,其中情感质量分类是该领域最受欢迎的任务之一。然而,尽管先前训练的语言模型在这个领域取得了成功,但它们经常无法捕捉 sentiment分析的更广泛复杂性。为了解决这个问题,我们提出了一个新的任务,即语言情感理解(SOUL)。SOUL的目的是评估语言情感理解的能力,通过两个子任务:评论理解(RC)和证明生成(JG)。RC检验基于评论文本中主观信息的准确性,而JG要求模型为其情感预测提供解释。为了实现全面的评估,我们注释了一个新的数据集,包含15,028个语句,来自3,638篇评论。实验结果表明,SOUL是现有模型的一个挑战性任务,与人类表现的差距可达27%。此外,通过人类专家和GPT-4的评估,我们发现小语言模型在生成理由基于的证明方面存在限制。这些发现强调现有模型在情感分析的复杂性方面存在困难,需要进一步的进步,以更好地捕捉 sentiment分析的复杂性。新的数据集和代码可以在https://github.com/DAMO-NLP-SG/SOUL上获取。

3D-Aware Visual Question Answering about Parts, Poses and Occlusions

  • paper_url: http://arxiv.org/abs/2310.17914
  • repo_url: https://github.com/xingruiwang/3d-aware-vqa
  • paper_authors: Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan Yuille
  • for: 推动3D视Question Answering领域的进步,提高VQA模型对3D场景的理解。
  • methods: 提出了3D-aware VQA任务,并设计了Super-CLEVR-3D数据集,用于挑战VQA模型的 Compositional Reasoning能力。
  • results: 提出了PO3D-VQA模型,结合概率神经 симвоlic Program Execution 和深度神经网络,实现了3D生成表示和可靠视觉识别。实验结果显示PO3D-VQA模型在3D-aware VQA任务中表现出色,但还有一定的性能差距与2D VQA标准准样本比较, indicating that 3D-aware VQA remains an important open research area。
    Abstract Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We address 3D-aware VQA from both the dataset and the model perspective. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. Second, we propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.
    摘要 尽管视觉问答(VQA)已经快速进步,现有的数据集和模型主要是测试二维空间中的逻辑能力。然而,VQA模型也需要理解三维视觉场景的结构,如支持导航或操作任务。这包括理解三维物体姿态、部件和遮挡。在这项工作中,我们引入三维逻辑VQA任务,它挑战需要对视觉场景的三维结构进行复杂的推理。我们从数据集和模型两个角度解决3D-aware VQA。首先,我们介绍Super-CLEVR-3D数据集,它包含对物体部件、姿态和遮挡进行复杂的推理的问题。其次,我们提出PO3D-VQA模型,它结合了可靠的神经网络符号表示和深度神经网络的3D生成表示来实现可靠的视觉识别和逻辑推理。我们的实验结果表明,PO3D-VQA模型在3D-aware VQA任务上表现出色,但我们还观察到与2D VQA标准chmark相比,3D-aware VQA任务的性能仍然存在显著的差距,因此3D-aware VQA仍然是一个重要的未解决问题。

TarGEN: Targeted Data Generation with Large Language Models

  • paper_url: http://arxiv.org/abs/2310.17876
  • repo_url: None
  • paper_authors: Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra
  • for: 这个论文旨在提供一种基于大语言模型(LLM)的多步指示策略,用于生成高质量的人工数据集。
  • methods: 该策略基于LLM,并且不需要特定任务实例,因此可以广泛应用于不同任务。另外, authors 还提出了一种自修复技术,以便LLM在数据创建过程中纠正错误标签。
  • results: 通过在8个SuperGLUE任务上训练不同类型的语言模型,包括编码器-解码器、编码器和解码器等,authors 发现 TarGEN 可以生成高质量的人工数据集,并且与原始数据集相比,模型在 TarGEN 数据集上训练后表现约1-2%点更高。
    Abstract The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique's effectiveness, we emulate 8 tasks from the SuperGLUE benchmark and finetune various language models, including encoder-only, encoder-decoder, and decoder-only models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on datasets generated by TarGEN perform approximately 1-2% points better than those trained on original datasets (82.84% via syn. vs. 81.12% on og. using Flan-T5). When incorporating instruction tuning, the performance increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals that the synthetic dataset demonstrates similar or higher levels of dataset complexity and diversity. Furthermore, the synthetic dataset displays a bias level that aligns closely with the original dataset. Finally, when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for quality data generation and reducing the human efforts to create complex benchmarks.
    摘要 LLMS 的快速进步已经引起了数据生成技术的兴趣,以生成多样化和高质量的synthetic dataset。然而,这些synthetic dataset经常受到缺乏多样性和附加噪音的问题困扰。在本文中,我们提出了 TarGEN,一种多步提示策略,通过 LLMS 来生成高质量的synthetic dataset。 TarGEN 的优点在于它不需要特定任务实例,因此其可以应用于任务复制以外的场景。我们还在 TarGEN 中添加了一种自修复技术,使 LLMS 能够在数据创建过程中纠正错误标签,以确保可靠的标签。为评估我们的技术效果,我们在SuperGLUEbenchmark中模拟了8个任务,并使用不同的语言模型进行训练。我们发现,使用 TarGEN 生成的synthetic dataset,模型在原始测试集上的性能约为1-2%点高于使用原始数据训练的模型(82.84% via syn. vs. 81.12% on og. using Flan-T5)。当将 instrucion tuning incorporated 时,模型在synthetic数据上的性能提高到84.54% vs. 81.49% on original data by Flan-T5。我们对synthetic dataset和原始 dataset进行了全面的分析,发现synthetic dataset中的多样性和复杂性与原始 dataset相似或更高,并且噪音水平与原始 dataset相似。最后,当 T5-3B 在我们的synthetic SuperGLUE dataset上进行预训练后,在OpenLLM领头占据了4.14%点的优势。我们希望 TarGEN 可以帮助生成高质量的数据,并减少人类创建复杂的benchmark所需的努力。

From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models

  • paper_url: http://arxiv.org/abs/2310.17857
  • repo_url: https://github.com/dongjunkang/vim
  • paper_authors: Dongjun Kang, Joonsuk Park, Yohan Jo, JinYeong Bak
  • for: 预测人们对问题和行为的意见和选择在现实场景中有助于各领域,如政治和市场营销。
  • methods: 我们提出使用值批入大型自然语言模型(LLM)预测意见和行为,并提出了值批入方法(VIM),包括对话生成和问答方法,通过细化训练将目标价值分布注入到 LLM 中。
  • results: 我们对四个任务进行了一系列实验,发现使用值批入 LLM substantially 超过基eline,同时也发现使用值批入 LLM 可以更好地预测人们的意见和行为。
    Abstract Being able to predict people's opinions on issues and behaviors in realistic scenarios can be helpful in various domains, such as politics and marketing. However, conducting large-scale surveys like the European Social Survey to solicit people's opinions on individual issues can incur prohibitive costs. Leveraging prior research showing influence of core human values on individual decisions and actions, we propose to use value-injected large language models (LLM) to predict opinions and behaviors. To this end, we present Value Injection Method (VIM), a collection of two methods -- argument generation and question answering -- designed to inject targeted value distributions into LLMs via fine-tuning. We then conduct a series of experiments on four tasks to test the effectiveness of VIM and the possibility of using value-injected LLMs to predict opinions and behaviors of people. We find that LLMs value-injected with variations of VIM substantially outperform the baselines. Also, the results suggest that opinions and behaviors can be better predicted using value-injected LLMs than the baseline approaches.
    摘要 可以预测人们对问题和行为的意见在现实场景中是有帮助的,例如在政治和市场营销等领域。然而,进行大规模的民意调查,如欧洲社会调查,以获取人们对个别问题的意见可能会付出昂贵的代价。我们建议使用核心人类价值的影响于个人决策和行为的先前研究,并使用价值插入大语言模型(LLM)来预测意见和行为。为此,我们提出了价值插入方法(VIM),包括两种方法——论点生成和问答——用于在 LL M 中插入目标价值分布。我们then进行了四个任务的 série of experiments 来测试 VIM 的效果和使用价值插入 LL M 来预测人们的意见和行为的可能性。我们发现,使用 VIM 对 LL M 进行 fine-tuning 后,其表现substantially outperform baseline。此外,结果还表明,使用价值插入 LL M 可以更好地预测人们的意见和行为, чем基eline Approaches。

SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation

  • paper_url: http://arxiv.org/abs/2310.18376
  • repo_url: None
  • paper_authors: Adrián Bazaga, Pietro Liò, Gos Micklem
  • for: 这个论文旨在解决文本到SQL翻译 зада务中的难题,即将自然语言问题转换成可执行的SQL查询。
  • methods: 该论文提出了一种名为SQLformer的新的Transformer架构,用于实现文本到SQL翻译任务。该模型预测SQL查询为抽象 syntax tree(AST),并在核心层采用了结构卷积激活。
  • results: 对于文本到SQL Spider测试集,SQLformer显示出了最新的表现,并且在适应不同数据库和查询任务中具有良好的泛化能力。
    Abstract In recent years, there has been growing interest in text-to-SQL translation, which is the task of converting natural language questions into executable SQL queries. This technology is important for its potential to democratize data extraction from databases. However, some of its key hurdles include domain generalisation, which is the ability to adapt to previously unseen databases, and alignment of natural language questions with the corresponding SQL queries. To overcome these challenges, we introduce SQLformer, a novel Transformer architecture specifically crafted to perform text-to-SQL translation tasks. Our model predicts SQL queries as abstract syntax trees (ASTs) in an autoregressive way, incorporating structural inductive bias in the encoder and decoder layers. This bias, guided by database table and column selection, aids the decoder in generating SQL query ASTs represented as graphs in a Breadth-First Search canonical order. Comprehensive experiments illustrate the state-of-the-art performance of SQLformer in the challenging text-to-SQL Spider benchmark. Our implementation is available at https://github.com/AdrianBZG/SQLformer
    摘要 近年来,文本到SQL翻译技术已经受到了越来越多的关注,这是将自然语言问题转换成可执行的SQL查询的任务。这种技术可以帮助普通人从数据库中提取数据。然而,这个技术的一些关键挑战包括领域总结,即将数据库中的数据映射到自然语言中,以及自然语言问题与相应的SQL查询的对应。为了解决这些挑战,我们提出了SQLformer,一种专门为文本到SQL翻译任务设计的Transformer架构。我们的模型预测SQL查询为抽象语法树(AST),并在树和树之间具有指导性的结构卷积。这种卷积引导了数据库表和列的选择,帮助解码器生成SQL查询AST表示为图形,并在深度优先搜索中遍历。我们的实现可以在https://github.com/AdrianBZG/SQLformer上获取。Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.