cs.CL - 2023-10-23

GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions

  • paper_url: http://arxiv.org/abs/2310.15405
  • repo_url: None
  • paper_authors: Ting-Yao Hsu, Chieh-Yang Huang, Ryan Rossi, Sungchul Kim, C. Lee Giles, Ting-Hao K. Huang
  • for: This paper aims to evaluate the effectiveness of using large language models (LLMs) as a cost-effective, reference-free method for assessing the quality of scientific figure captions.
  • methods: The authors constructed a human evaluation dataset called SCICAP-EVAL, which contains human judgments for 3,600 scientific figure captions, and used LLMs like GPT-4 and GPT-3 to score each caption based on its potential to aid reader understanding.
  • results: The results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students rankings.
    Abstract There is growing interest in systems that generate captions for scientific figures. However, assessing these systems output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students rankings
    摘要 有增长的兴趣在对科学图表的描述文本进行自动生成。然而,评估这些系统的输出是一项重要的挑战。人工评估需要专业知识和成本高昂,而自动评估则基于经常低质量的作者写的描述文本。本文研究使用大型自然语言模型(LLM)作为一种可靠、无参考的方法来评估图表描述文本。我们首先构建了SCICAP-EVAL数据集,其包含3,600个科学图表描述文本的人类评估结果,包括原始描述文本和机器生成的描述文本,对arXiv文章进行了600个。然后,我们向GPT-4和GPT-3等LMM prompted,评分每个描述文本的可能性以助理理解,基于相关的figure-提及段落。结果显示,GPT-4作为零批评判定器,与所有其他模型相比,表现出色,甚至超过了由计算机科学和信息学学士学生进行的评估,其Kendall相关度分数为0.401。

  • paper_url: http://arxiv.org/abs/2310.15398
  • repo_url: None
  • paper_authors: Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu
  • for: 研究人员想要了解NLG系统应该如何行为,以便确定合适的NLG系统行为是否具有公平性。
  • methods: 研究人员采用了五个案例研究,在NLG系统输入中偏让不同类型的身份语言特征(名称、角色、地点、方言和风格),以探索偏让和适应之间的矛盾。
  • results: 研究人员发现,适应的动机包括社会规范、文化差异、特定特征信息和适应,而偏让的动机包括拟合主义、观点认为NLG系统应该保持一致,以及对false assumption的担忧。这些发现表明定义合适NLG系统行为的问题仍存在开放的挑战。
    Abstract Fairness-related assumptions about what constitutes appropriate NLG system behaviors range from invariance, where systems are expected to respond identically to social groups, to adaptation, where responses should instead vary across them. We design and conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs to illuminate tensions around invariance and adaptation. We outline people's expectations of system behaviors, and surface potential caveats of these two contrasting yet commonly-held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around defining what constitutes fair NLG system behavior.
    摘要 “对于恰当的自然语言生成系统行为的公平相关假设, Range from不变性, where systems are expected to respond identically to all social groups, to adaptability, where responses should instead vary across them. We design and conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs to illuminate tensions around invariance and adaptation. We outline people's expectations of system behaviors, and surface potential caveats of these two contrasting yet commonly-held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around defining what constitutes fair NLG system behavior.”Note: The translation is done using the Simplified Chinese writing system, which is used in mainland China and Singapore. The Traditional Chinese writing system is used in Taiwan, Hong Kong, and other parts of the world.

GD-COMET: A Geo-Diverse Commonsense Inference Model

  • paper_url: http://arxiv.org/abs/2310.15383
  • repo_url: None
  • paper_authors: Mehar Bhatia, Vered Shwartz
  • for: 提高AI系统的多元化和包容性,以服务于不同背景的用户。
  • methods: 基于COMET模型,开发了地域多样化版本GD-COMET,可以涵盖广泛的文化知识。
  • results: 通过人类评估和外测,GD-COMET能够捕捉和生成具有文化特征的通用常识知识,展示其在NLProc应用中的潜在优势和包容性。
    Abstract With the increasing integration of AI into everyday life, it's becoming crucial to design AI systems that serve users from diverse backgrounds by making them culturally aware. In this paper, we present GD-COMET, a geo-diverse version of the COMET commonsense inference model. GD-COMET goes beyond Western commonsense knowledge and is capable of generating inferences pertaining to a broad range of cultures. We demonstrate the effectiveness of GD-COMET through a comprehensive human evaluation across 5 diverse cultures, as well as extrinsic evaluation on a geo-diverse task. The evaluation shows that GD-COMET captures and generates culturally nuanced commonsense knowledge, demonstrating its potential to benefit NLP applications across the board and contribute to making NLP more inclusive.
    摘要 随着人工智能日益普遍化到日常生活中,设计能满足多元背景用户的AI系统已成为非常重要。在这篇论文中,我们提出了GD-COMET模型,这是一种基于地理多样化的COMET常识推理模型。GD-COMET不仅超越了西方常识知识,还能处理广泛的文化知识。我们通过对5个多元文化的人类评估和地理多样化任务的外显性评估,证明GD-COMET能够捕捉和生成文化差异化的常识知识,表明其在NLG、NLP和其他应用领域中的潜力。

A Review of Reinforcement Learning for Natural Language Processing, and Applications in Healthcare

  • paper_url: http://arxiv.org/abs/2310.18354
  • repo_url: None
  • paper_authors: Ying Liu, Haozhu Wang, Huixue Zhou, Mingchen Li, Yu Hou, Sicheng Zhou, Fang Wang, Rama Hoetzlein, Rui Zhang
  • for: 本文提供了一个RL在自然语言处理(NLP)领域的回顾,涵盖RL技术的发展、挑战和应用在医疗领域。
  • methods: 本文详细介绍了RL在NLP任务中的应用,包括对话系统、机器翻译、问答系统、文本摘要和信息提取等。
  • results: 本文评论了RL-NLP系统中的伦理考虑和偏见问题。
    Abstract Reinforcement learning (RL) has emerged as a powerful approach for tackling complex medical decision-making problems such as treatment planning, personalized medicine, and optimizing the scheduling of surgeries and appointments. It has gained significant attention in the field of Natural Language Processing (NLP) due to its ability to learn optimal strategies for tasks such as dialogue systems, machine translation, and question-answering. This paper presents a review of the RL techniques in NLP, highlighting key advancements, challenges, and applications in healthcare. The review begins by visualizing a roadmap of machine learning and its applications in healthcare. And then it explores the integration of RL with NLP tasks. We examined dialogue systems where RL enables the learning of conversational strategies, RL-based machine translation models, question-answering systems, text summarization, and information extraction. Additionally, ethical considerations and biases in RL-NLP systems are addressed.
    摘要 复制RL在医疗决策中 Emerged as a powerful approach, tackling complex medical decision-making problems such as treatment planning, personalized medicine, and optimizing the scheduling of surgeries and appointments. It has gained significant attention in the field of Natural Language Processing (NLP) due to its ability to learn optimal strategies for tasks such as dialogue systems, machine translation, and question-answering. This paper presents a review of the RL techniques in NLP, highlighting key advancements, challenges, and applications in healthcare. The review begins by visualizing a roadmap of machine learning and its applications in healthcare. And then it explores the integration of RL with NLP tasks. We examined dialogue systems where RL enables the learning of conversational strategies, RL-based machine translation models, question-answering systems, text summarization, and information extraction. Additionally, ethical considerations and biases in RL-NLP systems are addressed.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Specialist or Generalist? Instruction Tuning for Specific NLP Tasks

  • paper_url: http://arxiv.org/abs/2310.15326
  • repo_url: https://github.com/DavidFanzz/Generalist_or_Specialist
  • paper_authors: Chufan Shi, Yixuan Su, Cheng Yang, Yujiu Yang, Deng Cai
  • for: 本研究旨在探讨大语言模型(LLMs)同时完成多种自然语言处理(NLP)任务的能力。
  • methods: 研究采用了宽泛涵盖的通用 instrucion 调整技术,以提高 LLMs 的总体性能。
  • results: 实验表明,将通用 instrucion 调整与专家模型相结合,可以提高模型的性能,尤其是当任务覆盖率广泛时。此外,研究还发现,通用 instrucion 调整可以提高模型理解和逻辑能力,但对于需要 фактиче知识的任务,通用数据中含有幻想信息可能会有负面影响。
    Abstract The potential of large language models (LLMs) to simultaneously perform a wide range of natural language processing (NLP) tasks has been the subject of extensive research. Although instruction tuning has proven to be a data-efficient method for transforming LLMs into such generalist models, their performance still lags behind specialist models trained exclusively for specific tasks. In this paper, we investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model. We hypothesize that its efficacy depends on task specificity and skill requirements. Our experiments assess four target tasks with distinct coverage levels, revealing that integrating generalist instruction tuning consistently enhances model performance when the task coverage is broad. The effect is particularly pronounced when the amount of task-specific training data is limited. Further investigation into three target tasks focusing on different capabilities demonstrates that generalist instruction tuning improves understanding and reasoning abilities. However, for tasks requiring factual knowledge, generalist data containing hallucinatory information may negatively affect the model's performance. Overall, our work provides a systematic guide for developing specialist models with general instruction tuning. Our code and other related resources can be found at https://github.com/DavidFanzz/Generalist_or_Specialist.
    摘要 大型自然语言处理(NLP)模型(LLMs)的潜在能力同时执行多种NLP任务已经得到了广泛的研究。虽然教程调整被证明是一种数据效率的方法来转化LLMs成为通用模型,但其性能仍然落后于专门为特定任务训练的模型。在这篇论文中,我们 investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model. We hypothesize that its efficacy depends on task specificity and skill requirements. Our experiments assess four target tasks with distinct coverage levels, revealing that integrating generalist instruction tuning consistently enhances model performance when the task coverage is broad. The effect is particularly pronounced when the amount of task-specific training data is limited. Further investigation into three target tasks focusing on different capabilities demonstrates that generalist instruction tuning improves understanding and reasoning abilities. However, for tasks requiring factual knowledge, generalist data containing hallucinatory information may negatively affect the model's performance. Overall, our work provides a systematic guide for developing specialist models with general instruction tuning. Our code and other related resources can be found at .

LXMERT Model Compression for Visual Question Answering

  • paper_url: http://arxiv.org/abs/2310.15325
  • repo_url: https://github.com/ghazaleh-mahmoodi/lxmert_compression
  • paper_authors: Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh Eetemadi
  • for: 本研究用于evaluating whether trainable subnetworks exist in LXMERT when fine-tuned on the VQA task, 以及investigating how much pruning can be done without significant loss in accuracy.
  • methods: 本研究使用LXMERT模型进行微调,并对其进行剪辑以降低大小。
  • results: 实验结果表明,可以剪辑LXMERT模型40%-60%的大小,无需 significannot loss in accuracy。
    Abstract Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks. According to the lottery ticket hypothesis, NLP and computer vision models contain smaller subnetworks capable of being trained in isolation to full performance. In this paper, we combine these observations to evaluate whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA task. In addition, we perform a model size cost-benefit analysis by investigating how much pruning can be done without significant loss in accuracy. Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.
    摘要 大规模预训练模型如LXMERT在文本图像对进行学习跨模态表示的应用越来越普遍。根据彩票假设,NLP和计算机视觉模型含有可以独立进行训练的更小子网络,以达到完整性表现。本文将这些观察结合,以评估LXMERT在VQA任务上是否包含可训练的子网络。此外,我们还进行了模型大小成本效果分析, investigate how much pruning can be done without significant loss in accuracy.我们的实验结果表明,LXMERT可以被有效地剪除40%-60%的大小,但loss in accuracy只有3%。

Exploring the Potential of Large Language Models in Generating Code-Tracing Questions for Introductory Programming Courses

  • paper_url: http://arxiv.org/abs/2310.15317
  • repo_url: None
  • paper_authors: Aysa Xuemo Fan, Ranran Haoran Zhang, Luc Paquette, Rui Zhang
  • for: 这 paper 探讨了大型语言模型 (LLMs) 在 introductory programming 课程中生成代码跟踪问题的应用。
  • methods: 作者设计了targeted prompts,使 GPT4 生成基于代码片断和描述的代码跟踪问题。
  • results: 研究发现 LLMS 可以生成多样化的代码跟踪问题,并提供了一个Unique dataset of human-和 LLM-generated tracing questions,对教育和 NLP 研究领域都是一个有价值的资源。
    Abstract In this paper, we explore the application of large language models (LLMs) for generating code-tracing questions in introductory programming courses. We designed targeted prompts for GPT4, guiding it to generate code-tracing questions based on code snippets and descriptions. We established a set of human evaluation metrics to assess the quality of questions produced by the model compared to those created by human experts. Our analysis provides insights into the capabilities and potential of LLMs in generating diverse code-tracing questions. Additionally, we present a unique dataset of human and LLM-generated tracing questions, serving as a valuable resource for both the education and NLP research communities. This work contributes to the ongoing dialogue on the potential uses of LLMs in educational settings.
    摘要 在这篇论文中,我们探讨了大语言模型(LLMs)在初级编程课程中生成代码跟踪问题的应用。我们设计了特定的提示,引导GPT4生成基于代码片段和描述的代码跟踪问题。我们确定了一组用于评估模型生成的问题质量的人类评估指标。我们的分析提供了LLMs在生成多样化代码跟踪问题的能力和潜力的深入了解。此外,我们提供了一个独特的人类和LLM生成的跟踪问题集,作为教育和NLP研究领域的价值资源。这项工作贡献于LLMs在教育设置中的潜在用途的对话。

Probing Representations for Document-level Event Extraction

  • paper_url: http://arxiv.org/abs/2310.15316
  • repo_url: https://github.com/githubarry/docie-probing
  • paper_authors: Barry Wang, Xinya Du, Claire Cardie
  • for: 这个研究旨在应用 probing 框架来解释深度神经网络模型在文档级信息EXTRACTION 应用中的表示。
  • methods: 该研究使用了 eight embedding probes 来分析表示 superficiale, semantic, 和 event-understanding 能力相关于文档级事件EXTRACTION。
  • results: 研究发现,使用 LLM-based 文档级 IE 方法学习的表示可以微妙地提高 argument detection 和标注,但只能微妙地提高事件级任务,同时存在文档长度和句子间对话的问题。
    Abstract The probing classifiers framework has been employed for interpreting deep neural network models for a variety of natural language processing (NLP) applications. Studies, however, have largely focused on sentencelevel NLP tasks. This work is the first to apply the probing paradigm to representations learned for document-level information extraction (IE). We designed eight embedding probes to analyze surface, semantic, and event-understanding capabilities relevant to document-level event extraction. We apply them to the representations acquired by learning models from three different LLM-based document-level IE approaches on a standard dataset. We found that trained encoders from these models yield embeddings that can modestly improve argument detections and labeling but only slightly enhance event-level tasks, albeit trade-offs in information helpful for coherence and event-type prediction. We further found that encoder models struggle with document length and cross-sentence discourse.
    摘要 《探索类分类器框架》已经应用于深度神经网络模型的多种自然语言处理(NLP)应用中。研究主要集中在句子级NLP任务上。这项工作是首次应用探索方法来分析文档级信息提取(IE)中所学习的表示。我们设计了八个嵌入探索器来分析表示的表面、 semantic和事件理解能力。我们将它们应用于基于三种不同LLM(深度学习模型)文档级IE方法学习的表示集。我们发现训练过的encoder从这些模型中得到的嵌入可以轻微提高Argument检测和标注,但只能轻微提高事件级任务,同时存在信息帮助性的交易。我们还发现encoder模型对文档长度和相关句子流程表示不稳定。

Adaptive End-to-End Metric Learning for Zero-Shot Cross-Domain Slot Filling

  • paper_url: http://arxiv.org/abs/2310.15294
  • repo_url: https://github.com/switchsyj/adae2ml-xsf
  • paper_authors: Yuanjun Shi, Linzhi Wu, Minglai Shao
  • for: 这篇论文是针对 zero-shot slot filling 的应用,即在训练时未见过的领域中进行插值。
  • methods: 本文提出了一个适应性的终端式度量学习方案,包括聚合式联合学习框架、内容相互联系的软式标签表示和槽级对称呈现学习,以实现效率、一致性和通用性。
  • results: 实验结果显示,提出的方法在公共评分数据集上具有较好的超越性,较以前的统计学习方法和其他竞争基准。
    Abstract Recently slot filling has witnessed great development thanks to deep learning and the availability of large-scale annotated data. However, it poses a critical challenge to handle a novel domain whose samples are never seen during training. The recognition performance might be greatly degraded due to severe domain shifts. Most prior works deal with this problem in a two-pass pipeline manner based on metric learning. In practice, these dominant pipeline models may be limited in computational efficiency and generalization capacity because of non-parallel inference and context-free discrete label embeddings. To this end, we re-examine the typical metric-based methods, and propose a new adaptive end-to-end metric learning scheme for the challenging zero-shot slot filling. Considering simplicity, efficiency and generalizability, we present a cascade-style joint learning framework coupled with context-aware soft label representations and slot-level contrastive representation learning to mitigate the data and label shift problems effectively. Extensive experiments on public benchmarks demonstrate the superiority of the proposed approach over a series of competitive baselines.
    摘要 Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may vary depending on the specific dialect or region.

On the Dimensionality of Sentence Embeddings

  • paper_url: http://arxiv.org/abs/2310.15285
  • repo_url: https://github.com/WM-SEMERU/SecureReqNet
  • paper_authors: Hongwei Wang, Hongming Zhang, Dong Yu
  • for: 这个论文主要目的是为了研究句子嵌入的维度。
  • methods: 该论文使用了一种两步训练方法,首先将编码器和池化器分别优化,以减少句子嵌入维度下的性能损失。
  • results: 实验结果表明,该方法可以在七种STS任务和七种句子分类任务中显著提高低维度句子嵌入的性能。
    Abstract Learning sentence embeddings is a fundamental problem in natural language processing. While existing research primarily focuses on enhancing the quality of sentence embeddings, the exploration of sentence embedding dimensions is limited. Here we present a comprehensive and empirical analysis of the dimensionality of sentence embeddings. First, we demonstrate that the optimal dimension of sentence embeddings is usually smaller than the default value. Subsequently, to compress the dimension of sentence embeddings with minimum performance degradation, we identify two components contributing to the overall performance loss: the encoder's performance loss and the pooler's performance loss. Therefore, we propose a two-step training method for sentence representation learning models, wherein the encoder and the pooler are optimized separately to mitigate the overall performance loss in low-dimension scenarios. Experimental results on seven STS tasks and seven sentence classification tasks demonstrate that our method significantly improves the performance of low-dimensional sentence embeddings.
    摘要 Translated into Simplified Chinese:学习句子表示是自然语言处理领域的基本问题。而现有研究主要集中在提高句子表示质量,对句子表示维度的探索却有限。我们在这里提供了一个全面的实验分析,探讨句子表示维度的优化问题。首先,我们示出了句子表示维度的优化值通常小于默认值。然后,为了压缩句子表示维度而不导致性能下降,我们分解了两个 contribuuting to the overall performance loss:encoder的表现损失和pooler的表现损失。因此,我们提出了一种两步训练方法 для句子表示学习模型,其中encoder和pooler分别优化以mitigate the overall performance loss in low-dimension scenarios。实验结果表明,我们的方法可以在七种STS任务和七种句子分类任务中显著提高低维度句子表示的性能。

Efficient Algorithms for Recognizing Weighted Tree-Adjoining Languages

  • paper_url: http://arxiv.org/abs/2310.15276
  • repo_url: None
  • paper_authors: Alexandra Butoi, Tim Vieira, Ryan Cotterell, David Chiang
  • for: 本研究是关于树连接语言的类型的语言形式学的研究,具体来说是使用不同的两级形式alisms,包括上下文自由格式(CFG)、推pushdown自动机(PDA)等,来 caracterize 树连接语言。
  • methods: 本研究使用了semiring-weighted版本的上述两级形式alisms,并设计了新的算法来计算字符串总和(所有 derive 的字符串的权重)和所有总和(所有 derive 的字符串的权重)。
  • results: 研究发现,对于 linear indexed grammars(LIG),我们的算法比 Vijay-Shanker 和 Weir(1989)的算法更时间高效,具体来说是 $\mathcal{O}(n|\mathcal{N}|)$ 高效,并且对于 embedded pushdown automata(EPDA),我们的算法比 Alonso et al.(2001)的算法更空间高效和时间高效,具体来说是 $\mathcal{O}(|\Gamma|^2)$ 高效和 $\mathcal{O}(|\Gamma|^3)$ 高效。此外,本研究还提供了首次的 PAA 字符串总和和所有总和算法。
    Abstract The class of tree-adjoining languages can be characterized by various two-level formalisms, consisting of a context-free grammar (CFG) or pushdown automaton (PDA) controlling another CFG or PDA. These four formalisms are equivalent to tree-adjoining grammars (TAG), linear indexed grammars (LIG), pushdown-adjoining automata (PAA), and embedded pushdown automata (EPDA). We define semiring-weighted versions of the above two-level formalisms, and we design new algorithms for computing their stringsums (the weight of all derivations of a string) and allsums (the weight of all derivations). From these, we also immediately obtain stringsum and allsum algorithms for TAG, LIG, PAA, and EPDA. For LIG, our algorithm is more time-efficient by a factor of $\mathcal{O}(n|\mathcal{N}|)$ (where $n$ is the string length and $|\mathcal{N}|$ is the size of the nonterminal set) and more space-efficient by a factor of $\mathcal{O}(|\Gamma|)$ (where $|\Gamma|$ is the size of the stack alphabet) than the algorithm of Vijay-Shanker and Weir (1989). For EPDA, our algorithm is both more space-efficient and time-efficient than the algorithm of Alonso et al. (2001) by factors of $\mathcal{O}(|\Gamma|^2)$ and $\mathcal{O}(|\Gamma|^3)$, respectively. Finally, we give the first PAA stringsum and allsum algorithms.
    摘要 “树连接语言的类型可以通过不同的二级形式主义来描述,包括 контекст自由格式 (CFG) 或推动式自动 machine (PDA) 控制另一个 CFG 或 PDA。这四种形式主义都等价于树连接语法 (TAG)、线性索引语法 (LIG)、推动式连接自动机 (PAA) 和嵌入式推动自动机 (EPDA)。我们定义了 Semiring 权重版本的上述二级形式主义,并设计了新的算法以计算其字串权重 (字串 derivation 的权重) 和所有权重 (所有 derivation 的权重)。从这些,我们也立即获得了 TAG、LIG、PAA 和 EPDA 的字串权重和所有权重算法。对 LIG,我们的算法在时间效率方面比 Vijay-Shanker 和 Weir (1989) 的算法快速 $\mathcal{O}(n|\mathcal{N}|)$ (其中 $n$ 是字串长度,$|\mathcal{N}|$ 是非预定元素集的大小),在空间效率方面比 $\mathcal{O}(|\Gamma|)$ (其中 $|\Gamma|$ 是栈字母的大小)。对 EPDA,我们的算法在时间效率和空间效率方面都比 Alonso et al. (2001) 的算法快速 $\mathcal{O}(|\Gamma|^2)$ 和 $\mathcal{O}(|\Gamma|^3)$, 分别。最后,我们提供了 PAA 的字串权重和所有权重算法。”

GradSim: Gradient-Based Language Grouping for Effective Multilingual Training

  • paper_url: http://arxiv.org/abs/2310.15269
  • repo_url: https://github.com/boschresearch/gradsim
  • paper_authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze
  • for: 本研究旨在提高低资源语言处理模型的性能,通过多语言训练共享知识。
  • methods: 本文提出了一种基于梯度相似性的语言分组方法,称为GradSim。
  • results: 对三个多语言benchmark数据集进行了实验,并得到了与其他相似度度量相比较大的性能提升,以及与跨语言模型性能更高的相关性。此外,我们还发现了数据集主题的重要性,以及低层转换器模型中的语言特征和高层模型中的任务特征之间的关系。
    Abstract Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteristics or data distributions are not compatible. In this paper, we propose GradSim, a language grouping method based on gradient similarity. Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains compared to other similarity measures and it is better correlated with cross-lingual model performance. As a result, we set the new state of the art on AfriSenti, a benchmark dataset for sentiment analysis on low-resource African languages. In our extensive analysis, we further reveal that besides linguistic features, the topics of the datasets play an important role for language grouping and that lower layers of transformer models encode language-specific features while higher layers capture task-specific information.
    摘要 大多数语言在世界 pose low-resource 挑战 для自然语言处理模型。通过多语言训练,知识可以共享于语言。然而,不是所有语言都有积极影响,而是开放的研究问题是如何选择最适合的语言组合 для多语言训练,以避免语言之间的负面干扰。在这篇论文中,我们提出了 GradSim,基于梯度相似性的语言分组方法。我们在三个多样化的多语言标准 benchmark 数据集上进行了实验,发现它比其他相似度度量更大,与跨语言模型性能更高相关。因此,我们设置了新的 state of the art 在 AfriSenti,一个低资源非洲语言 Sentiment 分析的标准数据集上。在我们的广泛分析中,我们发现除了语言特征之外,数据集主题也扮演着重要的角色于语言分组,以及低层转换器模型中存在语言特有的特征,而高层转换器模型则捕捉到任务特定的信息。

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

  • paper_url: http://arxiv.org/abs/2310.15262
  • repo_url: None
  • paper_authors: Injy Hamed, Nizar Habash, Ngoc Thang Vu
  • for: 本研究旨在比较三种扩充方法的效果,以提高Code-switching(CSW)文本生成的质量。
  • methods: 本研究使用了三种扩充方法:lexical replacements、linguistic theories和back-translation(BT),并在egyptian arabic-english CSW上进行了评估。
  • results: 研究结果显示,BT和CSW预测基于lexical replacement的方法在机器翻译和扩充 tasks中表现最佳,而linguistic theories和随机lexical replacement在缺乏CSW平行数据的情况下也能够达到类似的效果。
    Abstract Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.
    摘要 优化策略(CSW)文本生成已经受到了加大的关注,以解决数据缺乏的问题。随着这种兴趣的增长,我们需要更加全面的比较研究不同的扩充方法。在这个工作中,我们比较了三种流行的方法:lexical replacements、语言理论和回译(BT),在埃及阿拉伯语-英语 CSW 上进行了评估。我们通过人工评估来评估这些方法在机器翻译和扩充质量方面的效果。我们发现,BT和CSW预测基于 lexical replacement 的方法在两个任务上表现最佳。语言理论和随机 lexical replacement 在缺乏 CSW 平行数据的情况下表现出色,两者在两个任务上具有相似的效果。

Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention

  • paper_url: http://arxiv.org/abs/2310.15258
  • repo_url: https://github.com/negar-foroutan/multilingual-code-switched-reasoning
  • paper_authors: Negar Foroutan, Mohammadreza Banaei, Karl Aberer, Antoine Bosselut
  • for: 本研究探讨了多语言语言模型(MultiLMs)是否可以在不同语言下传递逻辑理解能力。
  • methods: 我们使用了多语言语言模型,并在不同语言下进行了逻辑理解测试。我们使用了两种测试方案:一是保持语言和问题语言相同,但是在新语言中进行逻辑理解测试(i.e., 逻辑理解仍然是单语言的,但模型需要在不同语言之间传递学习的逻辑理解能力);二是将语言和问题语言混合使用(我们称之为“交叉语言理解”)。
  • results: 我们在两个逻辑理解数据集上进行了测试,发现 MultiLMs 可以在单语言 Setting 中传递逻辑理解能力,但在交叉语言 Setting 中却难以传递逻辑理解能力。基于这一观察,我们提出了一种新的注意力机制,使用专门的参数集来促进交叉语言注意力,这有效提高了逻辑理解性能,最高提高14%和4%。
    Abstract In this work, we study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language. We evaluate the cross-lingual reasoning abilities of MultiLMs in two schemes: (1) where the language of the context and the question remain the same in the new languages that are tested (i.e., the reasoning is still monolingual, but the model must transfer the learned reasoning ability across languages), and (2) where the language of the context and the question is different (which we term code-switched reasoning). On two logical reasoning datasets, RuleTaker and LeapOfThought, we demonstrate that although MultiLMs can transfer reasoning ability across languages in a monolingual setting, they struggle to transfer reasoning abilities in a code-switched setting. Following this observation, we propose a novel attention mechanism that uses a dedicated set of parameters to encourage cross-lingual attention in code-switched sequences, which improves the reasoning performance by up to 14% and 4% on the RuleTaker and LeapOfThought datasets, respectively.
    摘要 在这项研究中,我们研究了多语言语言模型(MultiLMs)是否可以在不同语言中传递逻辑推理能力。我们在两种方案中评估了多语言模型的跨语言逻辑能力:(1)Context和问题在新语言测试中保持同一种语言(i.e., 逻辑仍然是单语言的,但模型需要将学习的逻辑能力传递到其他语言),以及(2)Context和问题在新语言测试中是不同的(我们称之为code-switched reasoning)。在两个逻辑推理数据集上,RuleTaker和LeapOfThought上,我们发现 although MultiLMs可以在单语言设置中传递逻辑能力,在code-switched设置中它们很难传递逻辑能力。根据这一观察,我们提出了一种新的注意力机制,使用专门的参数集来强制跨语言注意力在code-switched序列中,这有效提高了逻辑表现,在RuleTaker和LeapOfThought数据集上提高了14%和4%。

Large Language Models are Visual Reasoning Coordinators

  • paper_url: http://arxiv.org/abs/2310.15166
  • repo_url: https://github.com/cliangyu/cola
  • paper_authors: Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu
  • for: 这paper的目的是提出一种新的视觉理解模型协调方法,使得多种视觉语言模型(VLM)能够协同工作,提高视觉理解能力。
  • methods: 这paper使用了一种新的协调方法,即通过自然语言交流来协调多种VLM的特殊和补充性能力。具体来说,这paper使用了一个大型语言模型(LLM)来协调多种VLM。
  • results: 实验表明,这paper的提案可以在视觉问答(VQA)、外知ledge VQA、视觉包容性和视觉空间理解等任务上达到状态对齐的性能。此外,这paper还表明了在零和几个shot设置下,不需要微调,即使没有预训练也可以达到比较出色的性能。
    Abstract Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.
    摘要 “视觉逻辑需要多Modal感知和常识世界认知。最近,多种视觉语言模型(VLM)已经被提出,具有优秀的常识逻辑能力在不同领域。然而,如何利用这些补充的VLM的集合力仍然是 rarely explored。现有的方法,如集成,仍然困难于将这些模型集成到所需的高阶通信中。在这项工作中,我们提出了 Cola,一种新的思路,协调多种VLM的视觉逻辑。我们的关键发现是,一个大型语言模型(LLM)可以高效地协调多种VLM,通过促进自然语言交流,利用它们的不同和补充的能力。我们的实验表明,我们的 instruction tuning 变体,Cola-FT,在视觉问答(VQA)、外知VQA、视觉推论和视觉空间逻辑任务上具有状态机器的性能。此外,我们还证明了我们的 zero shot 和几少shot 设置下的变体,Cola-Zero,具有竞争性的性能,无需 fine-tuning。通过系统性的抽象研究和视觉化,我们 validate that coordinator LLM实际上理解了指令提示,以及各种VLM的分别能力;它然后协调它们,以实现了优秀的视觉逻辑能力。”

Function Vectors in Large Language Models

  • paper_url: http://arxiv.org/abs/2310.15213
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau
  • for: 这个论文旨在描述一种内置在自然语言处理器中的简单神经机制,该机制可以表示输入-输出函数为一个向量。
  • methods: 该论文使用 causal mediation analysis 来研究这种机制在各种具有上下文学习(ICL)任务中的表现。
  • results: 研究发现,这种机制可以在不同的任务、模型和层上表现出强烈的 causal effect,并且可以在不同的上下文中工作,包括零上下文和自然语言设置。此外,研究还发现,这种机制可以在不同的任务和模型之间进行 Semantic vector composition,并且可以创建新的复杂任务的 vectors。
    Abstract We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Taken together, our findings suggest that LLMs contain internal abstractions of general-purpose functions that can be invoked in a variety of contexts.
    摘要 (我们报告了一个简单的神经机制,它在 autoregressive transformer 语言模型(LM)中表示输入-输出函数 como a vector。我们使用 causal mediation analysis 在内部学习(ICL)任务上,发现一小数 attention heads 传递了内部学习的对象,我们称之为函数 вектор(FV)。FV 具有对上下文变化的强健性,可以在零传播和自然文本设定下触发任务,而不需要与 ICL 上下文完全相似的训练。我们在不同的任务、模型和层次上测试 FV,发现它们在中层层次上具有强大的 causal 效果。我们对 FV 的内部结构进行了研究,发现它们通常含有输出空间函数的信息,但这个信息 alone 不够以重建 FV。最后,我们在 FV 中进行 semantic vector 作用,发现它们可以在一定程度上被加和,导致新的复杂任务。总之,我们的发现表明 LLMs 内部含有一些通用函数的内部抽象,可以在不同的上下文中运行。)

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

  • paper_url: http://arxiv.org/abs/2310.15147
  • repo_url: https://github.com/lfy79001/sqleval
  • paper_authors: Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu
  • for: 评估大语言模型(LLM)的能力,特别是理解长文本Context。
  • methods: 使用复杂的synthetic任务作为评估方法,并提出了S3Eval评估集。
  • results: S3Eval的性能与实际 benchmark like Big-Bench Hard(BBH)之间存在强相关关系,并且通过深入分析,探索了模型的性能特点。
    Abstract The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like reasoning and long-context understanding. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 100K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a synthetic benchmark, S3Eval enables the creation of any number of evaluation examples that are theoretically invisible to LLMs, mitigating the test set contamination issue. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval performance and scores of real-world benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval for evaluation of LLMs. The in-depth analysis also uncover additional insights, including performance drop when the answer is sparsely distributed or located in the middle context, as well as some counter-intuitive trends of model performance.
    摘要 大量语言模型(LLM)的快速发展已导致模型能力的大幅提升,如理解和长文本理解。然而,随着 LLM 可以处理更长的文本,评估其所获得的能力变得更加困难,因为文本的长度(例如 100K 个Token)已经超出了人可靠地评估的时间范围。在这篇论文中,我们提出使用复杂的 sintetic 任务作为评估方法,并提出了 S3Eval,一个可扩展、可控的评估集。作为一个 sintetic 标准,S3Eval 允许创建任何数量的评估示例,这些示例对 LLM 来说是 theoretically 不可见的,因此可以解决测试集污染问题。 sintetic 的特点使得用户可以完全控制数据集,通过调整文本长度和任务难度来系统地探索 LLM 的能力。我们的实验表明,S3Eval 的表现与 BBH 等实际世界标准的分数之间存在强相关性,这表明使用 S3Eval 进行 LLM 评估是可靠的。进一步分析还揭示了一些逻辑的发现,包括答案 sparse 分布或中间文本位置的性能下降,以及一些Counter-intuitive 的模型性能趋势。

SpecTr: Fast Speculative Decoding via Optimal Transport

  • paper_url: http://arxiv.org/abs/2310.15141
  • repo_url: None
  • paper_authors: Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu
  • For: The paper is written to provide a principled understanding of speculative decoding through the lens of optimal transport, and to develop a new autoregressive sampling algorithm called SpecTr that achieves speedup in decoding while ensuring quality of the output.* Methods: The paper uses optimal transport with membership cost as a framework for understanding speculative decoding, and proposes a new draft selection algorithm that is computed via linear programming and has a best-known runtime of exponential in k.* Results: The proposed SpecTr algorithm achieves a wall clock speedup of 2.13X and a further 1.37X speedup over speculative decoding on standard benchmarks, while ensuring no quality degradation in the decoded output.
    Abstract Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
    摘要 自然语言任务中, autoregressive 采样已经取得了状态机器的最佳结果。然而, autoregressive 采样每个 tokens 一个一个生成,因此速度慢,甚至在某些任务中是禁止的。一种加速采样的方法是“speculative decoding”:使用一个小型模型来采样一个“稿件”(块或字符串),然后将这个稿件中的所有 tokens 通过大语言模型并行计算分数。然后,根据一种统计方法,选择稿件中的一部分 tokens,并将其他 tokens 拒绝。这种方法可以保证输出符合大语言模型的分布。在这个工作中,我们提供了 speculative decoding 的理解的基础,通过 optimal transport(OT)的成员成本。这种框架可以视为 maximal-coupling 问题的扩展。这个新的形式可以让我们将 speculative decoding 方法扩展到允许多个候选者(k),从而获得更好的最佳成员成本。我们表明了选择最佳稿件算法(交通计划)可以通过线性程序计算,其最佳运行时间为 $k$ 的指数增长。此外,我们还提出了一个有效的稿件选择算法,其接受概率为 $(1-1/e)$-多倍加。此外,它的计算时间接近单个字符串的领域范围。使用这个新的稿件选择算法,我们开发了一种新的 autoregressive 采样算法 called SpecTr,它可以在采样时提高速度,并保证输出质量不受影响。我们实验表明,对于当前的状态机器,我们的方法可以 achieve 增加 Wall clock 速度的2.13倍,并且在标准 benchmark 上进一步提高了 speculative decoding 的速度1.37倍。

Quantifying the Dialect Gap and its Correlates Across Languages

  • paper_url: http://arxiv.org/abs/2310.15135
  • repo_url: None
  • paper_authors: Anjali Kantharuban, Ivan Vulić, Anna Korhonen
  • for: 本研究旨在评估现有最佳大语言模型(LLM)在不同地区方言的应用中的性能,以及 dialect gap 与经济、社会和语言因素的相关性。
  • methods: 本研究使用了两个高度使用应用程序:自动翻译和语音识别。研究还分析了不同语言和地区方言之间的关系,以及数据集的构建方式和大小对模型性能的影响。
  • results: 研究发现,不同语言和地区方言之间存在显著的 dialect gap,并且这种差距与经济、社会和语言因素有相关性。此外,研究还发现了不同模型和语言之间的数据集大小和构建方式对模型性能的影响。
    Abstract Historically, researchers and consumers have noticed a decrease in quality when applying NLP tools to minority variants of languages (i.e. Puerto Rican Spanish or Swiss German), but studies exploring this have been limited to a select few languages. Additionally, past studies have mainly been conducted in a monolingual context, so cross-linguistic trends have not been identified and tied to external factors. In this work, we conduct a comprehensive evaluation of the most influential, state-of-the-art large language models (LLMs) across two high-use applications, machine translation and automatic speech recognition, to assess their functionality on the regional dialects of several high- and low-resource languages. Additionally, we analyze how the regional dialect gap is correlated with economic, social, and linguistic factors. The impact of training data, including related factors like dataset size and its construction procedure, is shown to be significant but not consistent across models or languages, meaning a one-size-fits-all approach cannot be taken in solving the dialect gap. This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
    摘要 历史上,研究人员和消费者们已经注意到使用自然语言处理工具处理少数语言变体时(如波多黎各西班牙语或瑞士德语)的质量下降,但学术研究这一点尚未得到全面的探讨和证明。在本研究中,我们进行了大量语言模型(LLM)的全面评估,在两个高用应用中(机器翻译和自动声音识别),以评估这些模型在各地方言方面的功能。此外,我们还分析了各语言方言之间的联系,并考虑了经济、社会和语言因素的关系。我们发现,训练数据的影响是显著的,但不是一致的, meaning一个“一般化”的方法无法解决方言差距。本研究将为 dialectal NLP 领域奠基,揭示了不同语言方言之间的明显差异,并标识了可能的解决方案。

Location-Aware Visual Question Generation with Lightweight Models

  • paper_url: http://arxiv.org/abs/2310.15129
  • repo_url: None
  • paper_authors: Nicholas Collin Suwono, Justin Chih-Yao Chen, Tun Min Hung, Ting-Hao Kenneth Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun
  • for: 本研究旨在生成与特定地理位置相关的有趣问题(LocaVQG),以提高对地理位置相关信息的理解和利用。
  • methods: 我们提出了一种数据生成管道,使用GPT-4生成多样化和复杂的问题,以及一种轻量级模型,能够在边缘设备(如手机)上适应LocaVQG任务。
  • results: 我们的提议方法在人工评估中(如参与度、基础性、 coherence)和自动评估指标(如 BERTScore、 ROUGE-2)中表现出色,并进行了广泛的ablation研究以证明我们的方法的有效性。
    Abstract This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.
    摘要 这个研究引入了一个新的任务:位置感知视觉问题生成(LocaVQG),旨在从特定地理位置相关的数据中生成有趣的问题。 Specifically,我们使用环境图像和GPS坐标来表示位置感知信息。为解决这个任务,我们提出了一个数据生成管道,利用GPT-4生成多样化和复杂的问题。然后,我们目标是学习一个轻量级的模型,能够 Addressing the LocaVQG task and fit on an edge device such as a mobile phone. To this end, we propose a method that can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines in terms of human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). In addition, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.

How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation

  • paper_url: http://arxiv.org/abs/2310.15114
  • repo_url: https://github.com/hlt-mt/fbk-fairseq
  • paper_authors: Marco Gaido, Dennis Fucci, Matteo Negri, Luisa Bentivogli
  • for: 这篇论文的目的是提高语音翻译(ST)模型中的性别偏见问题。
  • methods: 该论文使用一种“多性别”神经网络模型,将说话人的性别信息作为外部metadata integrate into ST模型中,以提高模型对 feminine 形式的表达准确性。
  • results: 该研究表明,使用“多性别”模型可以提高语音翻译模型对 feminine 形式的表达准确性,并且在很多情况下可以超过 gender-specific 模型的表现。
    Abstract When translating from notional gender languages (e.g., English) into grammatical gender languages (e.g., Italian), the generated translation requires explicit gender assignments for various words, including those referring to the speaker. When the source sentence does not convey the speaker's gender, speech translation (ST) models either rely on the possibly-misleading vocal traits of the speaker or default to the masculine gender, the most frequent in existing training corpora. To avoid such biased and not inclusive behaviors, the gender assignment of speaker-related expressions should be guided by externally-provided metadata about the speaker's gender. While previous work has shown that the most effective solution is represented by separate, dedicated gender-specific models, the goal of this paper is to achieve the same results by integrating the speaker's gender metadata into a single "multi-gender" neural ST model, easier to maintain. Our experiments demonstrate that a single multi-gender model outperforms gender-specialized ones when trained from scratch (with gender accuracy gains up to 12.9 for feminine forms), while fine-tuning from existing ST models does not lead to competitive results.
    摘要 当翻译从不具有性别特征的语言(如英语)到具有性别特征的语言(如意大利语)时,生成的翻译需要显式地分配性别标识符 для多个词汇,包括指示说话人的词汇。当源句子不会提供说话人的性别信息时,Speech Translation(ST)模型会根据可能偏导的声音特征或默认到 masculine gender,这是现有训练 Corpora 中最常见的性别。为了避免这种偏见和不包容的行为, speaker-related 表达的性别分配应该被指导由外部提供的说话人 gender metadata。而在这篇论文中,我们的目标是通过将说话人 gender metadata integrate into a single "多性别" neural ST 模型,这样更容易维护。我们的实验表明,一个多性别模型在从零开始训练时(与 feminine 形式准确率提高达 12.9)可以超越 gender-specific 模型,而 fine-tuning 从现有 ST 模型不会达到竞争性result。

Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

  • paper_url: http://arxiv.org/abs/2310.15113
  • repo_url: https://github.com/dmort27/chatgpts-wugs
  • paper_authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen
  • for: 这研究旨在检验最新一代大语言模型(ChatGPT)是否具备人类语言能力。
  • methods: 研究者采用了贝尔科(Berko,1958)的“含蓄测试”方法,使用四种语言(英语、德语、тами语和土耳其语)的新领域数据进行测试。
  • results: 研究发现,ChatGPT在英语方面表现异常差,与专门设计的系统相比,其表现较差。总的来说,这些结果表明,对ChatGPT的语言能力宣称可能是提前的和误导的。
    Abstract Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.
    摘要

GRENADE: Graph-Centric Language Model for Self-Supervised Representation Learning on Text-Attributed Graphs

  • paper_url: http://arxiv.org/abs/2310.15109
  • repo_url: https://github.com/bigheiniu/GRENADE
  • paper_authors: Yichuan Li, Kaize Ding, Kyumin Lee
  • for: 本研究旨在提出一种基于自我监督学习的文本嵌入表示学习方法,以便在不同下游任务中创建表示更加具有表达力和泛化能力。
  • methods: GRENADE使用了两种特циализирован的自我监督学习算法:图中心对照学习和图中心知识匹配。这两种算法有助于GRENADE捕捉文本 semantics 以及文本嵌入图中的结构上下文信息。
  • results: 对比其他状态之前的方法,GRENADE在多个实验中表现出优于状态之前的方法。GRENADE的实现可以在 \url{https://github.com/bigheiniu/GRENADE} 上找到。
    Abstract Self-supervised representation learning on text-attributed graphs, which aims to create expressive and generalizable representations for various downstream tasks, has received increasing research attention lately. However, existing methods either struggle to capture the full extent of structural context information or rely on task-specific training labels, which largely hampers their effectiveness and generalizability in practice. To solve the problem of self-supervised representation learning on text-attributed graphs, we develop a novel Graph-Centric Language model -- GRENADE. Specifically, GRENADE exploits the synergistic effect of both pre-trained language model and graph neural network by optimizing with two specialized self-supervised learning algorithms: graph-centric contrastive learning and graph-centric knowledge alignment. The proposed graph-centric self-supervised learning algorithms effectively help GRENADE to capture informative textual semantics as well as structural context information on text-attributed graphs. Through extensive experiments, GRENADE shows its superiority over state-of-the-art methods. Implementation is available at \url{https://github.com/bigheiniu/GRENADE}.
    摘要 自然语言文本图像上的自我监督学习,旨在创造表达力强、通用的表示方法,在不同下游任务中得到广泛应用。然而,现有方法 Either struggle to capture the full extent of structural context information or rely on task-specific training labels, which greatly hinders their effectiveness and generalizability in practice. To solve the problem of self-supervised representation learning on text-attributed graphs, we develop a novel Graph-Centric Language model -- GRENADE. Specifically, GRENADE exploits the synergistic effect of both pre-trained language model and graph neural network by optimizing with two specialized self-supervised learning algorithms: graph-centric contrastive learning and graph-centric knowledge alignment. The proposed graph-centric self-supervised learning algorithms effectively help GRENADE to capture informative textual semantics as well as structural context information on text-attributed graphs. Through extensive experiments, GRENADE shows its superiority over state-of-the-art methods. 实现可以在 \url{https://github.com/bigheiniu/GRENADE} 中找到。

LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis

  • paper_url: http://arxiv.org/abs/2310.15100
  • repo_url: https://github.com/sjdai/llm-thematic-analysis
  • paper_authors: Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku
  • for: 这个研究的目的是提出一个人工智能-人类合作框架,用于实现内容学习(ICL)的主题分析(TA)。
  • methods: 这个框架使用大型自然语言模型(LLM)与人类合作,将LLM作为TA的代码库生成器。
  • results: 实验结果显示,该框架可以提供与人类编码者相同的编码质量,但可以减少TA的劳动和时间需求。
    Abstract Thematic analysis (TA) has been widely used for analyzing qualitative data in many disciplines and fields. To ensure reliable analysis, the same piece of data is typically assigned to at least two human coders. Moreover, to produce meaningful and useful analysis, human coders develop and deepen their data interpretation and coding over multiple iterations, making TA labor-intensive and time-consuming. Recently the emerging field of large language models (LLMs) research has shown that LLMs have the potential replicate human-like behavior in various tasks: in particular, LLMs outperform crowd workers on text-annotation tasks, suggesting an opportunity to leverage LLMs on TA. We propose a human-LLM collaboration framework (i.e., LLM-in-the-loop) to conduct TA with in-context learning (ICL). This framework provides the prompt to frame discussions with a LLM (e.g., GPT-3.5) to generate the final codebook for TA. We demonstrate the utility of this framework using survey datasets on the aspects of the music listening experience and the usage of a password manager. Results of the two case studies show that the proposed framework yields similar coding quality to that of human coders but reduces TA's labor and time demands.
    摘要

Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization

  • paper_url: http://arxiv.org/abs/2310.15080
  • repo_url: https://github.com/llm-eff/fedpeptao
  • paper_authors: Tianshi Che, Ji Liu, Yang Zhou, Jiaxiang Ren, Jiwen Zhou, Victor S. Sheng, Huaiyu Dai, Dejing Dou
  • for: 这篇论文的目的是提出一种Parameter-efficient prompt Tuning方法,以实现大型自然语言模型(LLMs)的联合训练。
  • methods: 这篇论文使用了一种具有协调优化的快速问题调整方法(FedPepTAO),包括一种高效的偏好问题调整方法和一种适应优化方法来解决客户端漂移问题。
  • results: 实验结果显示,FedPepTAO比基eline方法高达60.8%的精度和97.59%的训练时间效率。
    Abstract Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. However, the training process of Large Language Models (LLMs) generally incurs the update of significant parameters, which limits the applicability of FL techniques to tackle the LLMs in real scenarios. Prompt tuning can significantly reduce the number of parameters to update, but it either incurs performance degradation or low training efficiency. The straightforward utilization of prompt tuning in the FL often raises non-trivial communication costs and dramatically degrades performance. In addition, the decentralized data is generally non-Independent and Identically Distributed (non-IID), which brings client drift problems and thus poor performance. This paper proposes a Parameter-efficient prompt Tuning approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and effective FL of LLMs. First, an efficient partial prompt tuning approach is proposed to improve performance and efficiency simultaneously. Second, a novel adaptive optimization method is developed to address the client drift problems on both the device and server sides to enhance performance further. Extensive experiments based on 10 datasets demonstrate the superb performance (up to 60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training time) of FedPepTAO compared with 9 baseline approaches. Our code is available at https://github.com/llm-eff/FedPepTAO.
    摘要 联邦学习(FL)是一种有前途的方法,启用了分布式数据的合作模型训练。然而, LLM 训练过程通常需要更新许多参数,这限制了FL技术在实际场景中的应用。干扰调整可以减少参数更新的数量,但它会导致性能下降或训练效率下降。 straightforward utilization of prompt tuning in FL often raises non-trivial communication costs and dramatically degrades performance。此外,分布式数据通常不是独立同分布(non-IID),这会导致客户端偏移问题,从而影响性能。本文提出了一种高效的参数精炼干扰调整方法,以及一种适应优化方法,以解决客户端偏移问题并提高性能。实验结果表明,相比9个基eline方法,FedPepTAO可以达到60.8%的准确率和97.59%的训练时间效率。我们的代码可以在https://github.com/llm-eff/FedPepTAO 中找到。

Affective and Dynamic Beam Search for Story Generation

  • paper_url: http://arxiv.org/abs/2310.15079
  • repo_url: https://github.com/tenghaohuang/affgen
  • paper_authors: Tenghao Huang, Ehsan Qasemi, Bangzheng Li, He Wang, Faeze Brahman, Muhao Chen, Snigdha Chaturvedi
  • for: 这篇论文旨在提出一种能生成有趣的故事的模型,它可以用于娱乐、教育、治疗和认知学等领域。
  • methods: 该模型使用了两种新的技术:动态杆大小和情感重新排序。动态杆大小使得故事中的词语更加不predictable,使得故事更加有趣。情感重新排序根据情感强度对句子候选者进行排序。
  • results: 我们的实验表明,AffGen在生成有趣且情感强度高的故事方面表现出色,比现有的基eline模型更好。我们的剥离分析和分析也提供了AffGen的优势和缺点。
    Abstract Storytelling's captivating potential makes it a fascinating research area, with implications for entertainment, education, therapy, and cognitive studies. In this paper, we propose Affective Story Generator (AffGen) for generating interesting narratives. AffGen introduces "intriguing twists" in narratives by employing two novel techniques-Dynamic Beam Sizing and Affective Reranking. Dynamic Beam Sizing encourages less predictable, more captivating word choices using a contextual multi-arm bandit model. Affective Reranking prioritizes sentence candidates based on affect intensity. Our empirical evaluations, both automatic and human, demonstrate AffGen's superior performance over existing baselines in generating affectively charged and interesting narratives. Our ablation study and analysis provide insights into the strengths and weaknesses of AffGen.
    摘要 storytelling的吸引力潜力使得它成为了一个非常有趣的研究领域,它的应用领域包括娱乐、教育、治疗和认知学。在这篇论文中,我们提出了情感故事生成器(AffGen),用于生成有趣的故事。AffGen通过两种新的技术——动态杆大小和情感重新排序——在故事中引入了不可预期的转折点。动态杆大小使用了上下文multi-arm bandit模型,以便更好地采用趋势的选择。情感重新排序将句子候选者按照情感强度进行排序。我们的实际评估和人类评价都表明AffGen在生成有情感强度和有趣的故事方面的表现远胜了现有的基线。我们的剥离分析和分析提供了AffGen的优势和缺点。

‘Don’t Get Too Technical with Me’: A Discourse Structure-Based Framework for Science Journalism

  • paper_url: http://arxiv.org/abs/2310.15077
  • repo_url: https://github.com/ronaldahmed/scitechnews
  • paper_authors: Ronald Cardenas, Bingsheng Yao, Dakuo Wang, Yufang Hou
  • for: 这个论文的目的是支持自动化科学新闻报道(Automatic Science Journalism),通过构建一个真实世界数据集(SciTechNews)和提出一种新的技术框架,帮助报道技术发现inea的报道更加准确、简洁和易于理解。
  • methods: 这个论文使用了一种新的技术框架,它将论文的话语结构与元数据结合起来,以便在生成报道时提供指导。此外,论文还使用了一些基eline方法(如Alpaca和ChatGPT)进行比较。
  • results: 根据extensive的自动和人工实验结果,这个论文的技术框架在生成媒体报道的内容计划、简化信息选择和生成报道的layman’s style中表现出色,相比之下baseline方法(如Alpaca和ChatGPT)的表现较差。
    Abstract Science journalism refers to the task of reporting technical findings of a scientific paper as a less technical news article to the general public audience. We aim to design an automated system to support this real-world task (i.e., automatic science journalism) by 1) introducing a newly-constructed and real-world dataset (SciTechNews), with tuples of a publicly-available scientific paper, its corresponding news article, and an expert-written short summary snippet; 2) proposing a novel technical framework that integrates a paper's discourse structure with its metadata to guide generation; and, 3) demonstrating with extensive automatic and human experiments that our framework outperforms other baseline methods (e.g. Alpaca and ChatGPT) in elaborating a content plan meaningful for the target audience, simplifying the information selected, and producing a coherent final report in a layman's style.
    摘要 科学新闻报道指的是将科学论文中的技术发现报道为对大众读者更加简洁的新闻文章。我们的目标是通过自动化系统支持这个实际任务(自动科学新闻),包括:1)构建了一个真实世界数据集(SciTechNews),该数据集包含公开available的科学论文、其对应的新闻文章和专家写的简短概要摘要;2)提出了一种新的技术框架,该框架将论文的话语结构与元数据集成一体,以指导生成;以及3)通过广泛的自动和人类实验,我们的框架比基eline方法(如Alpaca和ChatGPT)在为目标读者制定内容计划、简化选择的信息和生成易于理解的报道而表现出优异。

TableQAKit: A Comprehensive and Practical Toolkit for Table-based Question Answering

  • paper_url: http://arxiv.org/abs/2310.15075
  • repo_url: None
  • paper_authors: Fangyu Lei, Tongxu Luo, Pengqi Yang, Weihao Liu, Hanwen Liu, Jiahe Lei, Yiming Huang, Yifan Wei, Shizhu He, Jun Zhao, Kang Liu
  • for: This paper is written for researchers and developers working on table-based question answering (TableQA) tasks, as well as those interested in natural language processing and machine learning.
  • methods: The paper introduces TableQAKit, an open-source toolkit that provides a unified platform for TableQA, including plentiful datasets and popular methods for this task, as well as large language models (LLMs).
  • results: The paper reports that using the modules in TableQAKit achieves new state-of-the-art (SOTA) results on some datasets, and provides an LLM-based TableQA Benchmark for evaluating the role of LLMs in TableQA.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为研究表格问答(TableQA)任务的研究人员和开发者所写的,以及关注自然语言处理和机器学习领域的人员。
  • methods: 论文介绍了 TableQAKit,一个开源的工具集,提供了表格问答的一站式平台,包括丰富的数据集和表格问答任务中流行的方法,以及大语言模型(LLMs)。
  • results: 论文报告了使用 TableQAKit 模块时达到了一些数据集的新状态的艺术(SOTA)结果,并提供了基于 LLMs 的表格问答benchmark,用于评估 LLMs 在表格问答中的角色。
    Abstract Table-based question answering (TableQA) is an important task in natural language processing, which requires comprehending tables and employing various reasoning ways to answer the questions. This paper introduces TableQAKit, the first comprehensive toolkit designed specifically for TableQA. The toolkit designs a unified platform that includes plentiful TableQA datasets and integrates popular methods of this task as well as large language models (LLMs). Users can add their datasets and methods according to the friendly interface. Also, pleasantly surprised using the modules in this toolkit achieves new SOTA on some datasets. Finally, \tableqakit{} also provides an LLM-based TableQA Benchmark for evaluating the role of LLMs in TableQA. TableQAKit is open-source with an interactive interface that includes visual operations, and comprehensive data for ease of use.
    摘要 tables-based 问答 (TableQA) 是自然语言处理中的一项重要任务,需要理解表格并运用多种逻辑方法来回答问题。这篇文章介绍了 TableQAKit,是特地为 TableQA 设计的首个通用工具箱。工具箱包括丰富的 TableQA 数据集和整合了流行的这个任务方法以及大语言模型(LLM)。用户可以根据易用的界面添加自己的数据集和方法。此外,使用 modules 在这个工具箱中也可以实现新的 SOTA 成绩在某些数据集上。最后,\tableqakit{} 还提供了基于 LLM 的 TableQA 评估标准,用于评估 LLM 在 TableQA 中的角色。TableQAKit 是开源的,具有交互式界面,包括视觉操作和完整的数据,以便使用。

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

  • paper_url: http://arxiv.org/abs/2310.15066
  • repo_url: https://github.com/pluslabnlp/envision
  • paper_authors: Te-Lin Wu, Yu Zhou, Nanyun Peng
    for: 本文主要旨在提高phrase grounding模型的能力于本地化活动对象,以便它们可以更好地帮助人类完成任务。methods: 本文提出了一种基于语言模式和视觉模式的解决方案,包括学习对象变化的角色,提取对象更加准确,以及利用前后条件来识别对象。results: 对Ego4D和Epic-Kitchens数据集进行了广泛的实验,并得到了显著的提升。相比于传统方法,本文的方法可以提高TREK-150-OPE-Det本地化+跟踪任务的标准 metric 54%以上,TREK-150-OPE跟踪任务的标准 metric 7%以上,以及Ego4D SCOD任务的平均精度(AP)上的提升3%以上。
    Abstract The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions to the environment without being told exactly what/where to ground (e.g., localizing and tracking the `sponge` in video from the instruction "Dip the `sponge` into the bucket."). While existing works approach this problem from a pure vision perspective, we investigate to which extent the textual modality (i.e., task instructions) and their interaction with visual modality can be beneficial. Specifically, we propose to improve phrase grounding models' ability on localizing the active objects by: (1) learning the role of `objects undergoing change` and extracting them accurately from the instructions, (2) leveraging pre- and post-conditions of the objects during actions, and (3) recognizing the objects more robustly with descriptional knowledge. We leverage large language models (LLMs) to extract the aforementioned action-object knowledge, and design a per-object aggregation masking technique to effectively perform joint inference on object phrases and symbolic knowledge. We evaluate our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments demonstrate the effectiveness of our proposed framework, which leads to>54% improvements in all standard metrics on the TREK-150-OPE-Det localization + tracking task, >7% improvements in all standard metrics on the TREK-150-OPE tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD task.
    摘要 人工智能代理人需要能够从自己的视角活动地基于任务说明进行操作或协助人类。一个重要的进步是将关键的活动对象localize和跟踪到环境中,而不需要指定具体的位置或物体。现有的工作通过纯视觉方式解决这个问题,我们则 investigate到文本模式(即任务说明)和视觉模式之间的交互如何提高地面对象localization的能力。具体来说,我们提出了以下三个方法来改进地面对象localization模型:1. 学习对象变化的角色和准确地从说明中提取活动对象。2. 利用操作前后对象的条件,以便更好地识别活动对象。3. 通过描述知识来更加稳定地识别活动对象。我们利用大型自然语言模型(LLMs)提取对象变化的知识,并设计了每个对象的权重聚合屏蔽技术,以实现效果地进行对象短语和符号知识的共同推理。我们在Ego4D和Epic-Kitchens数据集上进行了广泛的实验,结果表明我们提出的框架具有明显的优势,在TREK-150-OPE-Det本地化+跟踪任务上提高了>54%的标准指标,在TREK-150-OPE跟踪任务上提高了>7%的标准指标,并在Ego4D SCOD任务上提高了>3%的平均准确率。

SLOG: A Structural Generalization Benchmark for Semantic Parsing

  • paper_url: http://arxiv.org/abs/2310.15040
  • repo_url: https://github.com/bingzhilee/slog
  • paper_authors: Bingzhi Li, Lucia Donatelli, Alexander Koller, Tal Linzen, Yuekun Yao, Najoung Kim
  • for: 评估语言模型对新复杂表达的泛化能力
  • methods: 使用COGS数据集(Kim和Linzen,2020)的17个结构泛化 случа例进行评估
  • results: 使用Transformer模型(包括预训练模型)的泛化精度只达40.6%,而结构意识 parser只达70.8%,与现有模型在COGS上的准确率有很大差异,说明SLOG数据集能够强调模型对结构泛化的能力的不足。
    Abstract The goal of compositional generalization benchmarks is to evaluate how well models generalize to new complex linguistic expressions. Existing benchmarks often focus on lexical generalization, the interpretation of novel lexical items in syntactic structures familiar from training; structural generalization tasks, where a model needs to interpret syntactic structures that are themselves unfamiliar from training, are often underrepresented, resulting in overly optimistic perceptions of how well models can generalize. We introduce SLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with 17 structural generalization cases. In our experiments, the generalization accuracy of Transformer models, including pretrained ones, only reaches 40.6%, while a structure-aware parser only achieves 70.8%. These results are far from the near-perfect accuracy existing models achieve on COGS, demonstrating the role of SLOG in foregrounding the large discrepancy between models' lexical and structural generalization capacities.
    摘要 “目的是评估模型如何通过新的复杂语言表达扩展其泛化能力。现有的测试 benchmark 常常注重词汇泛化,即在训练中 Familiar 的语法结构中新的词汇的理解;而结构泛化任务,即模型需要理解不 familar 的语法结构,则 часто被忽略,导致评估模型的泛化能力过于优化。我们引入 SLOG,一个延展 COGS(Kim 和 Linzen,2020)的semantic parsing dataset,包含17个结构泛化例子。在我们的实验中,Transformer 模型,包括预训练的,只达到40.6%的泛化精度,而结构意识的 parser 则达到70.8%。这些结果远远低于现有模型在 COGS 上的 near-perfect 精度,强调 SLOG 在抛出模型的词汇和结构泛化能力之间的大差。”

Statistical Depth for Ranking and Characterizing Transformer-Based Text Embeddings

  • paper_url: http://arxiv.org/abs/2310.15010
  • repo_url: https://github.com/pkseeg/tte_depth
  • paper_authors: Parker Seegmiller, Sarah Masud Preum
  • for: 该论文旨在提供一种 Statistical depth 方法,用于衡量高维文本表示Matrix中文本的中心性。
  • methods: 该论文使用 transformer-based text embedding (TTE) depth 方法,并在 NLP 管道中进行模型化和分布推断。
  • results: 研究人员通过使用 TTE depth 方法进行启发式学习提问选择,并发现这种方法可以在 six 种文本分类任务中提高性能。此外,研究人员还使用 TTE depth 方法和相关的rank sum test来描述人工生成和自然语言 corpora 的分布,发现五种最近的人工数据生成过程会导致 associative 人类生成文本的分布偏移。
    Abstract The popularity of transformer-based text embeddings calls for better statistical tools for measuring distributions of such embeddings. One such tool would be a method for ranking texts within a corpus by centrality, i.e. assigning each text a number signifying how representative that text is of the corpus as a whole. However, an intrinsic center-outward ordering of high-dimensional text representations is not trivial. A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution. We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines. We first define TTE depth and an associated rank sum test for determining whether two corpora differ significantly in embedding space. We then use TTE depth for the task of in-context learning prompt selection, showing that this approach reliably improves performance over statistical baseline approaches across six text classification tasks. Finally, we use TTE depth and the associated rank sum test to characterize the distributions of synthesized and human-generated corpora, showing that five recent synthetic data augmentation processes cause a measurable distributional shift away from associated human-generated text.
    摘要 《 transformer-based 文本嵌入 popularity 的 Popularity 导致了更好的统计工具 для Measuring 文本嵌入 Distribution 。一种 Such tool 是一种方法 для rank 文本在 corpus 中的中心性,即 assigning 每个文本一个数字,表示该文本是 corpus 中的代表性。然而,高维文本表示的内在中心-外向顺序不是rivial。我们采用了一种统计深度来 Measuring 高维文本表示中的中心性,称为 transformer-based text embedding (TTE)深度。我们首先定义 TTE 深度和一种相关的rank sum test ,用于 Determining 两个 corpus 在 embedding 空间是否有 statistically significant difference。然后,我们使用 TTE 深度进行 NLP 管道中的模型和分布式推理任务。我们发现,这种 Approach 可靠地提高性能,在六种文本分类任务中,比基eline Approach 更好。最后,我们使用 TTE 深度和相关的rank sum test来 Characterize 人工生成和synthesized corpora 的分布,发现 five recent synthetic data augmentation 过程导致了 measurable distributional shift away from associated human-generated text。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the original text and may not capture all the nuances and idiomatic expressions of the original text.

Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

  • paper_url: http://arxiv.org/abs/2310.15007
  • repo_url: None
  • paper_authors: Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye
  • for: The paper is focused on the task of document-level membership inference for real-world large language models (LLMs), which involves inferring whether the LLM has seen a given document during training or not.
  • methods: The authors propose a practical, black-box method to predict document-level membership using commonly used data sources for training and the model release date. They also propose a procedure for the development and evaluation of document-level membership inference for LLMs.
  • results: The authors show that their methodology performs very well, reaching an impressive AUC of 0.856 for books and 0.678 for papers. They also show that their approach outperforms sentence-level membership inference attacks used in the privacy literature for the document-level membership task. Additionally, they find that smaller models like OpenLLaMA-3B are approximately as sensitive to their approach as larger models like OpenLLaMA-7B.
    Abstract With large language models (LLMs) poised to become embedded in our daily lives, questions are starting to be raised about the dataset(s) they learned from. These questions range from potential bias or misinformation LLMs could retain from their training data to questions of copyright and fair use of human-generated text. However, while these questions emerge, developers of the recent state-of-the-art LLMs become increasingly reluctant to disclose details on their training corpus. We here introduce the task of document-level membership inference for real-world LLMs, i.e. inferring whether the LLM has seen a given document during training or not. First, we propose a procedure for the development and evaluation of document-level membership inference for LLMs by leveraging commonly used data sources for training and the model release date. We then propose a practical, black-box method to predict document-level membership and instantiate it on OpenLLaMA-7B with both books and academic papers. We show our methodology to perform very well, reaching an impressive AUC of 0.856 for books and 0.678 for papers. We then show our approach to outperform the sentence-level membership inference attacks used in the privacy literature for the document-level membership task. We finally evaluate whether smaller models might be less sensitive to document-level inference and show OpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach. Taken together, our results show that accurate document-level membership can be inferred for LLMs, increasing the transparency of technology poised to change our lives.
    摘要 LLMs (大型自然语言模型) 即将成为我们日常生活中的一部分,因此有人开始提出关于它们学习数据集的问题。这些问题包括 LLMs 可能从学习数据集中招吸到的偏见或误information,以及人类生成的文本是否符合版权和公平使用的问题。然而,开发者们在发布最新的状态艺术LLMs时变得越来越不愿意披露它们的学习数据集详细信息。我们在这里介绍了对实际世界 LLMS 的文档级会员推理任务,即判断一个给定的文档是否在 LLMS 的训练数据中出现过。我们首先提出了对实际世界 LLMS 的文档级会员推理任务的开发和评估方法,然后提出了一种实用的黑盒方法来预测文档级会员,并在 OpenLLaMA-7B 上实现了这种方法。我们的方法性能很高,达到了 0.856 的 AUC 值 для书籍和 0.678 的 AUC 值 для学术论文。我们的方法还超过了在隐私领域中使用的句子级会员推理攻击,并且我们发现 OpenLLaMA-3B 和 OpenLLaMA-7B 的敏感度几乎相同。总之,我们的结果表明,可以准确地推理实际世界 LLMS 中文档的会员性,从而提高技术的透明度。

When Language Models Fall in Love: Animacy Processing in Transformer Language Models

  • paper_url: http://arxiv.org/abs/2310.15004
  • repo_url: https://github.com/hannamw/lms-in-love
  • paper_authors: Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
  • for: 本研究旨在探讨语言模型是否能够正确地处理生物体的动态性,以及它们如何处理不同类型的生物体。
  • methods: 研究人员使用了开源的语言模型,并对其进行了训练和测试,以评估其对生物体动态性的处理能力。
  • results: 研究发现,语言模型在处理典型生物体时行为类似于人类,但在处理不典型生物体时,其处理能力较差。尽管 context indicating atypical animacy 很短,但语言模型仍可以从 subtle clues 中感受到动态性的含义,并改变其行为。
    Abstract Animacy - whether an entity is alive and sentient - is fundamental to cognitive processing, impacting areas such as memory, vision, and language. However, animacy is not always expressed directly in language: in English it often manifests indirectly, in the form of selectional constraints on verbs and adjectives. This poses a potential issue for transformer language models (LMs): they often train only on text, and thus lack access to extralinguistic information from which humans learn about animacy. We ask: how does this impact LMs' animacy processing - do they still behave as humans do? We answer this question using open-source LMs. Like previous studies, we find that LMs behave much like humans when presented with entities whose animacy is typical. However, we also show that even when presented with stories about atypically animate entities, such as a peanut in love, LMs adapt: they treat these entities as animate, though they do not adapt as well as humans. Even when the context indicating atypical animacy is very short, LMs pick up on subtle clues and change their behavior. We conclude that despite the limited signal through which LMs can learn about animacy, they are indeed sensitive to the relevant lexical semantic nuances available in English.
    摘要 生物性 - Entity 是生物和意识的存在 - 是认知处理中的基本因素,影响 Memory, 视觉和语言等领域。然而,生物性并不总是直接表达在语言中:在英语中,它通常表现为动词和形容词的选择约束。这可能会对 transformer 语言模型(LMs)造成问题:它们通常只在文本上训练,因此缺乏人类从 extralinguistic 信息中学习生物性的信息。我们问:LMs 在处理生物性方面的影响如何 - 它们是否 behave 如人类一样?我们使用 open-source LMs 回答这个问题。与前一些研究相同,我们发现 LMs 与人类行为类似,只要 entity 的生物性是典型的。然而,我们还发现 LMs 可以适应不典型的 animate 实体,如一个爱上了花生的情节。LMs 会对这些实体进行处理,虽然它们不如人类那么好。即使 context 中表达 atypical animacy 的信息非常短暂,LMs 也可以捕捉到微妙的 clue 并改变行为。我们认为,即使 LMs 只有通过语言学习到的有限信号,它们仍然能够感受到英语中有关生物性的 relevante semantic nuance。

Simple Hardware-Efficient PCFGs with Independent Left and Right Productions

  • paper_url: http://arxiv.org/abs/2310.14997
  • repo_url: None
  • paper_authors: Wei Liu, Songlin Yang, Yoon Kim, Kewei Tu
  • for: 这个论文是为了提高PCFG的扩展和语言模型的性能而写的。
  • methods: 这个论文使用了一种低级别参数化方法来缩放PCFG,并且引入了一种简单的PCFG形式来提高语言模型的性能。
  • results: 这个论文的结果表明,使用这种简单的PCFG形式和低级别参数化方法可以更好地扩展PCFG,并且在语言模型方面表现更好,比如同样大小的低级别PCFG。
    Abstract Scaling dense PCFGs to thousands of nonterminals via a low-rank parameterization of the rule probability tensor has been shown to be beneficial for unsupervised parsing. However, PCFGs scaled this way still perform poorly as a language model, and even underperform similarly-sized HMMs. This work introduces \emph{SimplePCFG}, a simple PCFG formalism with independent left and right productions. Despite imposing a stronger independence assumption than the low-rank approach, we find that this formalism scales more effectively both as a language model and as an unsupervised parser. As an unsupervised parser, our simple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language model, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank PCFGs. We further introduce \emph{FlashInside}, a hardware IO-aware implementation of the inside algorithm for efficiently scaling simple PCFGs.
    摘要 压缩稠密PCFGs到千个非终态符的规模通过低级参数化规则概率矩阵已经被证明是无监督分析的助长。然而,PCFGs压缩得到的性能仍然较差,甚至下出了相同大小的HMMs。这项工作介绍了简单PCFG(SimplePCFG),一种简单的PCFG формаль语言模型,其中左侧和右侧生成规则独立。虽然这种形式强制了更加独立的假设,但我们发现它在语言模型和无监督分析器方面更好地扩展。作为无监督分析器,我们的简单PCFG在英语PTB上的平均F1值为65.1,作为语言模型,它的词频为119.0,超过了相同大小的低级PCFGs。我们还介绍了FlashInside,一种硬件IO意识的内部算法实现,用于有效地扩展简单PCFGs。

LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay

  • paper_url: http://arxiv.org/abs/2310.14985
  • repo_url: None
  • paper_authors: Yihuai Lan, Zhiqiang Hu, Lei Wang, Yang Wang, Deheng Ye, Peilin Zhao, Ee-Peng Lim, Hui Xiong, Hao Wang
  • for: 本研究目标是探索基于LLM的代理人在社交行为方面的开放问题。
  • methods: 我们采用了Avalon游戏作为环境,并使用系统提示来引导LLM代理人参与游戏。
  • results: 我们的研究表明,我们的框架可以快速适应Avalon游戏,并且可以生成适应性强的智能代理人。我们的结果还显示了LLM代理人在动态社交环境中的应用潜力。
    Abstract This paper aims to investigate the open research problem of uncovering the social behaviors of LLM-based agents. To achieve this goal, we adopt Avalon, a representative communication game, as the environment and use system prompts to guide LLM agents to play the game. While previous studies have conducted preliminary investigations into gameplay with LLM agents, there lacks research on their social behaviors. In this paper, we present a novel framework designed to seamlessly adapt to Avalon gameplay. The core of our proposed framework is a multi-agent system that enables efficient communication and interaction among agents. We evaluate the performance of our framework based on metrics from two perspectives: winning the game and analyzing the social behaviors of LLM agents. Our results demonstrate the effectiveness of our framework in generating adaptive and intelligent agents and highlight the potential of LLM-based agents in addressing the challenges associated with dynamic social environment interaction. By analyzing the social behaviors of LLM agents from the aspects of both collaboration and confrontation, we provide insights into the research and applications of this domain.
    摘要 本研究目的是探索基于LLM(语言模型)代理的社交行为问题。为达到这个目标,我们采用了Avalon游戏作为环境,并使用系统提示导引LLM代理进行游戏。先前的研究已经对LLM代理在游戏中的初步调查,但尚缺乏关于其社交行为的研究。本文提出了一种新的框架,可以轻松适应Avalon游戏环境。我们的框架核心是多代理系统,允许代理之间有效地交流和互动。我们根据游戏胜利和LLM代理社交行为的两个角度进行评价,并发现了我们的框架在生成适应性强和智能代理方面的效果。我们的结果还 highlight了LLM代理在面对动态社会环境的挑战中的潜在应用前景。通过分析LLM代理的社交行为从合作和对抗两个方面,我们提供了这个领域的研究和应用的深入理解。

Fidelity-Enriched Contrastive Search: Reconciling the Faithfulness-Diversity Trade-Off in Text Generation

  • paper_url: http://arxiv.org/abs/2310.14981
  • repo_url: https://github.com/ntunlplab/fecs
  • paper_authors: Wei-Lin Chen, Cheng-Kuang Wu, Hsin-Hsi Chen, Chung-Chi Chen
  • for: solves the hallucination problem in natural language generation tasks
  • methods: uses Fidelity-Enriched Contrastive Search (FECS) with context-aware regularization terms
  • results: consistently enhances faithfulness while maintaining output diversityHere’s the full translation of the paper’s abstract in Simplified Chinese:
  • for: 本研究旨在解决自然语言生成任务中的幻觉问题, язы言模型经常生成流利且吸引人的内容,但可能缺乏与提供的源文件的一致性,导致可能的不准确。
  • methods: 我们提出了一种新的解码方法,即强化对比搜索框架的 faithfulness-enriched contrastive search (FECS),该方法在生成文本时添加了上下文感知规则,以便批量抑制生成文本中的重复性。
  • results: 我们在摘要生成和对话生成两个极易幻觉的任务中进行了实验,结果表明,FECS可以在不同的语言模型大小下保持 faithfulness,并与其他解码算法相比保持输出多样性。
    Abstract In this paper, we address the hallucination problem commonly found in natural language generation tasks. Language models often generate fluent and convincing content but can lack consistency with the provided source, resulting in potential inaccuracies. We propose a new decoding method called Fidelity-Enriched Contrastive Search (FECS), which augments the contrastive search framework with context-aware regularization terms. FECS promotes tokens that are semantically similar to the provided source while penalizing repetitiveness in the generated text. We demonstrate its effectiveness across two tasks prone to hallucination: abstractive summarization and dialogue generation. Results show that FECS consistently enhances faithfulness across various language model sizes while maintaining output diversity comparable to well-performing decoding algorithms.
    摘要 在这篇论文中,我们解决了自然语言生成任务中的幻觉问题。语言模型经常生成流畅、有力的内容,但可能缺乏提供的源文本的一致性,导致可能的错误。我们提出了一种新的解码方法called Fidelity-Enriched Contrastive Search (FECS),它在对照搜索框架中添加了语言模型自适应正则化项。FECS 推荐的Token与提供的源文本具有相似性,同时对生成文本中的重复增加罚款。我们在摘要生成和对话生成两个任务中证明FECS 的有效性。结果表明,FECS 可以在不同的语言模型大小下保持准确性,并与其他解码算法相比保持输出多样性。

Penalty Decoding: Well Suppress the Self-Reinforcement Effect in Open-Ended Text Generation

  • paper_url: http://arxiv.org/abs/2310.14971
  • repo_url: https://github.com/zwhong714/penalty_decoding
  • paper_authors: Wenhong Zhu, Hongkun Hao, Rui Wang
  • for: 这篇论文是 investigate the self-reinforcement effect in text generation and the effectiveness of a repetition penalty to mitigate it.
  • methods: 这篇论文使用了一种忘记机制,使得选择罚款更加容易,以及一种长度罚款,以解决因罚款过重而导致的句子过短问题。
  • results: 实验结果表明,这种罚款解码方法可以生成高质量的句子,与人工输出类似。
    Abstract The decoding algorithm is critical for open-ended text generation, transforming latent representations into coherent and meaningful outputs. This paper investigates the self-reinforcement effect in text generation and the effectiveness of a repetition penalty to mitigate it. However, determining the optimal repetition penalty value is challenging. To tackle this, we propose a forgetting mechanism that disregards distant tokens, reducing the burden of penalty selection. In addition, we introduce a length penalty to address overly short sentences caused by excessive penalties. Our penalty decoding approach incorporating three strategies helps resolve issues with sampling methods deviating from factual information. Experimental results demonstrate the efficacy of our approach in generating high-quality sentences resembling human output.
    摘要 《解码算法是开放式文本生成中关键的,将潜在表示转化为有意义和 coherent 的输出。本文研究文本生成中的自我补偿效应以及使用重复罚款来缓解其。然而,选择优化的重复罚款值是困难的。为此,我们提出了忘记机制,忽略远程Token,减轻罚款选择的负担。此外,我们还引入了长度罚款,以解决由过分罚款导致的过短句子问题。我们的罚款解码方法结合了三种策略,帮助解决采样方法偏离实际信息的问题。实验结果表明,我们的方法可以生成高质量的句子,与人类输出类似。

Towards LLM-driven Dialogue State Tracking

  • paper_url: http://arxiv.org/abs/2310.14970
  • repo_url: https://github.com/woodscene/ldst
  • paper_authors: Yujie Feng, Zexin Lu, Bo Liu, Liming Zhan, Xiao-Ming Wu
  • for: 这个研究的目的是评估ChatGPT在对话管理中的能力。
  • methods: 这个研究使用了ChatGPT和一种基于小型开源模型的LDST框架进行对话管理。LDST使用了一种新的域槽指令调整方法来改进性能。
  • results: 研究发现LDST在零shot和几shot设置下比前一个SOTA方法表现出了remarkable的性能改进。
    Abstract Dialogue State Tracking (DST) is of paramount importance in ensuring accurate tracking of user goals and system actions within task-oriented dialogue systems. The emergence of large language models (LLMs) such as GPT3 and ChatGPT has sparked considerable interest in assessing their efficacy across diverse applications. In this study, we conduct an initial examination of ChatGPT's capabilities in DST. Our evaluation uncovers the exceptional performance of ChatGPT in this task, offering valuable insights to researchers regarding its capabilities and providing useful directions for designing and enhancing dialogue systems. Despite its impressive performance, ChatGPT has significant limitations including its closed-source nature, request restrictions, raising data privacy concerns, and lacking local deployment capabilities. To address these concerns, we present LDST, an LLM-driven DST framework based on smaller, open-source foundation models. By utilizing a novel domain-slot instruction tuning method, LDST achieves performance on par with ChatGPT. Comprehensive evaluations across three distinct experimental settings, we find that LDST exhibits remarkable performance improvements in both zero-shot and few-shot setting compared to previous SOTA methods. The source code is provided for reproducibility.
    摘要 对话状态跟踪(DST)对于实现对话系统中用户目标和系统行为的准确跟踪是非常重要的。大语言模型(LLM)如GPT3和ChatGPT的出现引发了对其多种应用场景的评估。在这项研究中,我们对ChatGPT在DST中的能力进行了初步评估。我们的评估发现ChatGPT在这个任务中表现出色,为研究人员提供了有价值的信息,并为设计和改进对话系统提供了有用的指导。 despite its impressive performance, ChatGPT有一些限制,包括它的关闭源代码、请求限制、数据隐私问题和无法在本地部署的问题。为了解决这些问题,我们提出了LDST,基于更小的开源基础模型的LLM驱动的DST框架。通过使用一种新的领域槽调整方法,LDST在零shot和几shot设置中表现出色,与前一代SOTA方法相比具有显著的性能改进。我们在三个不同的实验设置中进行了广泛的评估,发现LDST在零shot和几shot设置中具有remarkable的性能改进。源代码提供了重现性。

Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

  • paper_url: http://arxiv.org/abs/2310.14954
  • repo_url: https://github.com/scufan1990/key-frame-mechanism-for-efficient-conformer
  • paper_authors: Peng Fan, Changhao Shan, Sining Sun, Qing Yang, Jianwei Zhang
    for: 这个研究旨在提高Conformer模型的效率,并且解决Conformer模型中自注意力机制的计算复杂性问题。methods: 这个研究使用了Conformer块作为基础网络,并引入了中间CTC输出作为导向,以减少自注意力机制的计算复杂性。还引入了关键帧自注意力机制(KFSA)和关键帧下采样机制(KFDS)来减少计算量。results: 这个研究实现了比vanilla Conformer和其他相似工作(如Efficient Conformer)相当或更高的性能,同时可以快速释放大约60%的无用帧。这种方法可以大幅提高模型的推理速度。
    Abstract Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60\% useless frames during model training and inference, which will accelerate the inference speed significantly. This work code is available in {https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}
    摘要 近期,Conformer作为端到端自动语音识别的后ION网络 achieved state-of-the-art performance。Conformer块利用自我注意机制 capture global information,以及 convolutional neural network capture local information,resulting in improved performance。然而,Conformer-based model encounter issue with self-attention mechanism, as computational complexity grows quadratically with input sequence length。inspired by previous CTC guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of Conformer encoder。we define the frame with non-blank output as key frame。specifically, we introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of self-attention mechanism using key frames。our proposed approach consists of two encoders。following the initial encoder, we introduce an intermediate CTC loss function to compute label frame, enabling us to extract key frames and blank frames for KFSA。furthermore, we introduce key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop frames corresponding to blank labels, resulting in new acoustic feature sequences as input to second encoder。by using the proposed method, we achieve comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer。meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will significantly accelerate inference speed。this work code is available in {https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}

System Combination via Quality Estimation for Grammatical Error Correction

  • paper_url: http://arxiv.org/abs/2310.14947
  • repo_url: https://github.com/nusnlp/greco
  • paper_authors: Muhammad Reza Qorib, Hwee Tou Ng
  • for: 这个论文的目的是提出一种新的语法错误修正评估模型,以提高语法错误修正系统的评估精度。
  • methods: 这个论文使用了一种新的语法错误修正评估模型,叫做GRECO,它可以更好地评估修正后的句子质量。此外,论文还提出了三种方法来使用语法错误修正评估模型进行系统组合,包括模型无关、模型无关投票方法和模型相关方法。
  • results: 根据论文的实验结果,使用GRECO模型可以更好地评估修正后的句子质量,并且组合使用多个语法错误修正系统可以达到更高的F0.5分数。
    Abstract Quality estimation models have been developed to assess the corrections made by grammatical error correction (GEC) models when the reference or gold-standard corrections are not available. An ideal quality estimator can be utilized to combine the outputs of multiple GEC systems by choosing the best subset of edits from the union of all edits proposed by the GEC base systems. However, we found that existing GEC quality estimation models are not good enough in differentiating good corrections from bad ones, resulting in a low F0.5 score when used for system combination. In this paper, we propose GRECO, a new state-of-the-art quality estimation model that gives a better estimate of the quality of a corrected sentence, as indicated by having a higher correlation to the F0.5 score of a corrected sentence. It results in a combined GEC system with a higher F0.5 score. We also propose three methods for utilizing GEC quality estimation models for system combination with varying generality: model-agnostic, model-agnostic with voting bias, and model-dependent method. The combined GEC system outperforms the state of the art on the CoNLL-2014 test set and the BEA-2019 test set, achieving the highest F0.5 scores published to date.
    摘要 Quality estimation models have been developed to assess the corrections made by grammatical error correction (GEC) models when the reference or gold-standard corrections are not available. An ideal quality estimator can be utilized to combine the outputs of multiple GEC systems by choosing the best subset of edits from the union of all edits proposed by the GEC base systems. However, we found that existing GEC quality estimation models are not good enough in differentiating good corrections from bad ones, resulting in a low F0.5 score when used for system combination. In this paper, we propose GRECO, a new state-of-the-art quality estimation model that gives a better estimate of the quality of a corrected sentence, as indicated by having a higher correlation to the F0.5 score of a corrected sentence. It results in a combined GEC system with a higher F0.5 score. We also propose three methods for utilizing GEC quality estimation models for system combination with varying generality: model-agnostic, model-agnostic with voting bias, and model-dependent method. The combined GEC system outperforms the state of the art on the CoNLL-2014 test set and the BEA-2019 test set, achieving the highest F0.5 scores published to date.Here's the translation in Traditional Chinese:quality estimation models have been developed to assess the corrections made by grammatical error correction (GEC) models when the reference or gold-standard corrections are not available. An ideal quality estimator can be utilized to combine the outputs of multiple GEC systems by choosing the best subset of edits from the union of all edits proposed by the GEC base systems. However, we found that existing GEC quality estimation models are not good enough in differentiating good corrections from bad ones, resulting in a low F0.5 score when used for system combination. In this paper, we propose GRECO, a new state-of-the-art quality estimation model that gives a better estimate of the quality of a corrected sentence, as indicated by having a higher correlation to the F0.5 score of a corrected sentence. It results in a combined GEC system with a higher F0.5 score. We also propose three methods for utilizing GEC quality estimation models for system combination with varying generality: model-agnostic, model-agnostic with voting bias, and model-dependent method. The combined GEC system outperforms the state of the art on the CoNLL-2014 test set and the BEA-2019 test set, achieving the highest F0.5 scores published to date.

Unveiling A Core Linguistic Region in Large Language Models

  • paper_url: http://arxiv.org/abs/2310.14928
  • repo_url: None
  • paper_authors: Jun Zhao, Zhihao Zhang, Yide Ma, Qi Zhang, Tao Gui, Luhui Gao, Xuanjing Huang
  • for: 本研究用大语言模型(LLM)来深入理解智能的起源。
  • methods: 通过脑地図的研究,找到 LLM 中语言能力的核心区域,并研究这个区域与智能的发展有关的特征。
  • results: 研究发现,LLM 中语言能力的核心区域占据约1%的总模型参数,并且存在特定维度的参数变化会导致语言能力下降。此外,研究发现,提高语言能力不一定 accompanies 提高模型的知识水平,这可能 imply 存在域知识的分离。
    Abstract Brain localization, which describes the association between specific regions of the brain and their corresponding functions, is widely accepted in the field of cognitive science as an objective fact. Today's large language models (LLMs) possess human-level linguistic competence and can execute complex tasks requiring abstract knowledge and reasoning. To deeply understand the inherent mechanisms of intelligence emergence in LLMs, this paper conducts an analogical research using brain localization as a prototype. We have discovered a core region in LLMs that corresponds to linguistic competence, accounting for approximately 1% of the total model parameters. This core region exhibits significant dimension dependency, and perturbations to even a single parameter on specific dimensions can lead to a loss of linguistic competence. Furthermore, we observe that an improvement in linguistic competence does not necessarily accompany an elevation in the model's knowledge level, which might imply the existence of regions of domain knowledge that are dissociated from the linguistic region. Overall, exploring the LLMs' functional regions provides insights into the foundation of their intelligence. In the future, we will continue to investigate knowledge regions within LLMs and the interactions between them.
    摘要 布尔Localization,指的是脑部特定区域与其功能之间的关联,在认知科学中广泛得到认可。今天的大语言模型(LLM)具有人类水平的语言能力,可执行复杂的抽象知识和理解任务。通过使用脑部Localization作为原型,本文通过分析LLM内部的函数区域来深入理解智能的内在机制。我们发现了LLM中的核心区域,与语言能力相关,占总模型参数的约1%。这个核心区域显示出明显的维度依赖,甚至对特定维度的参数的小幅改变可能会导致语言能力的失去。此外,我们发现,提高语言能力不一定意味着提高模型的知识水平,这可能表明存在域知识的分离。总的来说,研究LLM内部的函数区域,为我们理解智能的基础提供了新的视角。未来,我们将继续探索LLM内部的知识区域,以及它们之间的交互。

Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time Controllable Text Generation

  • paper_url: http://arxiv.org/abs/2310.14892
  • repo_url: https://github.com/r1047/air-decoding
  • paper_authors: Tianqi Zhong, Quan Wang, Jingxuan Han, Yongdong Zhang, Zhendong Mao
  • for: 这个论文的目的是提高控制性能,但是它发现了一种新的问题——特性塌突,导致生成文本的流畅性快速下降,使文本无法使用。
  • methods: 该论文提出了一种新的轻量级解码框架 named Air-Decoding,其主要思想是重建特性分布,以保持 attribute words 和非 attribute words 的权重平衡,从而生成更流畅的文本。
  • results: 经过多个 CTG 任务的实验证明,该方法可以实现新的控制性能最佳化。
    Abstract Controllable text generation (CTG) aims to generate text with desired attributes, and decoding-time-based methods have shown promising performance on this task. However, in this paper, we identify the phenomenon of Attribute Collapse for the first time. It causes the fluency of generated text to rapidly decrease when the control strength exceeds a critical value, rendering the text completely unusable. This limitation hinders the effectiveness of decoding methods in achieving high levels of controllability. To address this problem, we propose a novel lightweight decoding framework named Air-Decoding. Its main idea is reconstructing the attribute distributions to balance the weights between attribute words and non-attribute words to generate more fluent text. Specifically, we train prefixes by prefix-tuning to obtain attribute distributions. Then we design a novel attribute distribution reconstruction method to balance the obtained distributions and use the reconstructed distributions to guide language models for generation, effectively avoiding the issue of Attribute Collapse. Experiments on multiple CTG tasks prove that our method achieves a new state-of-the-art control performance.
    摘要 《控制性文本生成(CTG)目标是生成具有愿景的文本,而解码时间基于的方法在这个任务上表现出了良好的表现。然而,在这篇论文中,我们第一次观察到了“特性塌瓦”现象。当控制力超过了一定的 crítical value 时,文本的流畅性快速下降,使得文本变得完全 unusable。这种限制了解码方法在实现高水平的控制性的能力。为解决这个问题,我们提出了一种新的轻量级解码框架,名为 Air-Decoding。它的主要想法是重建特性分布,以平衡 attribute words 和 non-attribute words 之间的权重,以生成更流畅的文本。具体来说,我们使用 prefix-tuning 方法来训练 prefixes,然后设计了一种新的特性分布重建方法,以平衡获得的分布,并使用重建的分布来导引语言模型进行生成,从而有效避免特性塌瓦问题。在多个 CTG 任务上进行了实验,我们发现我们的方法可以 дости到新的状态对抗性控制性表现。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

  • paper_url: http://arxiv.org/abs/2310.14880
  • repo_url: https://github.com/christinakang/sirac
  • paper_authors: Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Adnan Trakic, Terry Yue Zhuo, Patrick Charles Emerton, Genevieve Grant
  • for: 这个论文的目的是检验大语言模型(LLMs)是否能够像律师一样分析法律案例,以及 LLMS 是否能够按照法律专业人员使用的 IRAC 方法进行分析。
  • methods: 作者使用了一个新的词汇库,并将其应用于一个包含马来西亚和澳大利亚社会法律的场景集。他们使用的 IRAC 方法是法律专业人员广泛使用的一种框架。
  • results: 研究发现,ChatGPT 能够按照 IRAC 方法进行分析,但与法律专业人员的分析不完全相同。这些结果提供了未来研究的可能性,以提高 LLMS 和法律专业人员之间的协调性。
    Abstract Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Australian Social Act for Dependent Child. ChatGPT is applied to perform analysis on the corpus using the IRAC method, which is a framework widely used by legal professionals for organizing legal analysis. Each scenario in the corpus is annotated with a complete IRAC analysis in a semi-structured format so that both machines and legal professionals are able to interpret and understand the annotations. In addition, we conducted the first empirical assessment of ChatGPT for IRAC analysis in order to understand how well it aligns with the analysis of legal professionals. Our experimental results shed lights on possible future research directions to improve alignments between LLMs and legal experts in terms of legal reasoning.
    摘要 Translation notes:* "Large Language Models" (LLMs) was translated as "大型语言模型" (dàxìng yǔyán módelì)* "ChatGPT" was translated as "ChatGPT" (卖词GPT)* "IRAC" was translated as "IRAC" (法律分析框架)* "Contract Acts Malaysia" was translated as "马来西亚合同法" (málàixià yìpianfǎ)* "Australian Social Act for Dependent Children" was translated as "澳大利亚社会法(依赖儿童)" (àozhìdàlìyà shèhuìfǎ (yīzhì èrtóng))Please note that the translation is in Simplified Chinese, and the format of the text may be different from the original English version.

We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields

  • paper_url: http://arxiv.org/abs/2310.14870
  • repo_url: https://github.com/jpwahle/emnlp23-citation-field-influence
  • paper_authors: Jan Philip Wahle, Terry Ruas, Mohamed Abdalla, Bela Gipp, Saif M. Mohammad
    methods: 该论文使用了 77k NLP 论文、3.1m NLP 论文所引用的其他论文、以及 ~1.8m 其他论文所引用的 NLP 论文的索引分析来量化 NLP 领域与其他领域之间的影响关系。results: 研究发现,与其他领域之间的 NLP 领域之间的影响关系(CFDI)在1980年为0.58,而在2022年则为0.31(历史低点)。此外,研究发现 NLP 领域在过去几十年中变得更加闭塞,它在引用更多的 NLP 论文,同时 fewer 论文作为其他领域和 NLP 领域之间的桥梁。NLP 引用中的大多数来自计算机科学,只有 less than 8% 的 NLP 引用来自语言学,而 less than 3% 来自数学和心理学。这些发现提醒 NLP 领域需要反思与其他领域之间的交流和合作。
    Abstract Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP papers to other papers, and ~1.8m citations from other papers to NLP papers. We show that, unlike most fields, the cross-field engagement of NLP, measured by our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in 1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown more insular -- citing increasingly more NLP papers and having fewer papers that act as bridges between fields. NLP citations are dominated by computer science; Less than 8% of NLP citations are to linguistics, and less than 3% are to math and psychology. These findings underscore NLP's urgent need to reflect on its engagement with various fields.
    摘要 自然语言处理(NLP)即将对世界产生重要影响。然而,进步也会随之带来重大风险。解决这些风险需要跨学科合作。然而,有很少的实证工作研究过NLP与其他学科之间的交往(过去或当前)。在这篇论文中,我们使用我们提出的引用领域多样性指数(CFDI)量化NLP与23个学科之间的影响度。我们分析了大约77000篇NLP论文,大约3100万次来自NLP论文的引用到其他论文,以及大约1800万次来自其他论文的引用到NLP论文。我们发现,与大多数学科不同的是,NLP的跨学科交往度(CFDI)在1980年是0.58,到2022年则下降至0.31(历史低点)。此外,我们发现NLP在某些程度上变得更加闭塞——它更多地引用NLP论文,并且 fewer papers serve as bridges between fields。NLP引用的主要来源是计算机科学,仅有8%的NLP引用是到语言学,而仅有3%是到数学和心理学。这些发现表明NLP需要反思与多个学科之间的交往。

Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

  • paper_url: http://arxiv.org/abs/2310.14868
  • repo_url: https://github.com/muyo8692/stepbystep-reasoning-vs-negation
  • paper_authors: Mengyu Ye, Tatsuki Kuribayashi, Jun Suzuki, Goro Kobayashi, Hiroaki Funayama
  • for: 这项研究旨在探讨大型自然语言模型(LLMs)在步骤逻辑指导下的推理能力,尤其是在否定语言现象下的逻辑推理能力。
  • methods: 研究采用了一些控制 Settings(例如,基于虚构存在的推理)来评估现代 LLMs 的逻辑推理能力。
  • results: 研究发现,多达数十个现代 LLMs 在 lexical negation(例如,可能 ->不可能)时表现不稳定,并且每个 LLM 家族具有独特的限制。
    Abstract Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their ability to perform CoT-style reasoning robustly is of interest from a probing perspective. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation, which is a core linguistic phenomenon that is difficult to process. In particular, we introduce several controlled settings (e.g., reasoning in case of fictional entities) to evaluate the logical reasoning abilities of the models. We observed that dozens of modern LLMs were not robust against lexical negation (e.g., plausible ->implausible) when performing CoT-style reasoning, and the results highlight unique limitations in each LLM family.
    摘要 大型语言模型(LLM)利用步骤性理解指令,如链式思维(CoT)提示,可以强大地进行CoT式理解。在这种情况下, LLM 的可靠性和稳定性是一个有趣的问题。在这种研究中,我们审查了 LLM 在负符号(例如,可能 ->不可能)的逻辑理解能力,这是一种 linguistic 现象具有困难处理的特点。我们引入了一些控制条件(例如,在虚拟存在的情况下进行逻辑理解)来评估 LLM 的逻辑理解能力。我们发现了多达数十个现代 LLM 在 CoT 式理解中对lexical negation(例如,plausible ->implausible)不具有可靠性,结果透视了每个 LLM 家族的特殊限制。

Paraphrase Types for Generation and Detection

  • paper_url: http://arxiv.org/abs/2310.14863
  • repo_url: https://github.com/jpwahle/emnlp23-paraphrase-types
  • paper_authors: Jan Philip Wahle, Bela Gipp, Terry Ruas
  • for: This paper aims to address the limitations of current paraphrase generation and detection approaches by introducing two new tasks that consider specific linguistic perturbations at particular text positions.
  • methods: The paper proposes two new tasks, Paraphrase Type Generation and Paraphrase Type Detection, which involve generating and identifying fine-grained paraphrase types.
  • results: The results suggest that while current techniques perform well in a binary classification scenario, they struggle with the inclusion of fine-grained paraphrase types. Models trained in generating and identifying paraphrase types show improvements in tasks without them, and scaling these models further improves their ability to understand paraphrase types.
    Abstract Current approaches in paraphrase generation and detection heavily rely on a single general similarity score, ignoring the intricate linguistic properties of language. This paper introduces two new tasks to address this shortcoming by considering paraphrase types - specific linguistic perturbations at particular text positions. We name these tasks Paraphrase Type Generation and Paraphrase Type Detection. Our results suggest that while current techniques perform well in a binary classification scenario, i.e., paraphrased or not, the inclusion of fine-grained paraphrase types poses a significant challenge. While most approaches are good at generating and detecting general semantic similar content, they fail to understand the intrinsic linguistic variables they manipulate. Models trained in generating and identifying paraphrase types also show improvements in tasks without them. In addition, scaling these models further improves their ability to understand paraphrase types. We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
    摘要 Translated into Simplified Chinese:现有的简单相似性分数方法 heavily rely on a single general similarity score, 忽略了语言的细腻语言特性。这篇论文引入了两个新任务,以考虑具体的简单语言干扰,即特定文本位置的语言突变。我们称这两个任务为 Paraphrase Type Generation 和 Paraphrase Type Detection。我们的结果表明,虽然现有的技术在简单的二分类场景中表现良好,即是否简单化,但是包含细腻语言类型的场景是一个 significante 挑战。大多数方法可以生成和识别通用 semantic similar 内容,但是它们无法理解它们所操作的语言变量。具有 Paraphrase Type Generation 和 Paraphrase Type Detection 的模型在没有这些任务时也表现出了改进。此外,将这些模型进一步缩放也会提高它们对 Paraphrase Type 的理解。我们认为 Paraphrase Type 可以开启一个新的发展 парадигмы,解决未来的任务。

3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for Embodied Turn-Taking Prediction

  • paper_url: http://arxiv.org/abs/2310.14859
  • repo_url: None
  • paper_authors: Mehdi Fatan, Emanuele Mincato, Dimitra Pintzou, Mariella Dimiccoli
  • for: 预测多人会议中的转接(turn-taking)有很多实用应用在人机/机器人交互中。但是,人类communication的复杂性使得这个任务变得具有挑战性。
  • methods: 我们提出了一种基于多模态转换器的新建筑,用于预测embodied、同步多个视角数据中的转接。我们的实验结果表明,我们的方法可以在EgoCom数据集上实现substantial的性能提升,至多14.01%。
  • results: 我们的实验结果表明,我们的方法可以在EgoCom数据集上实现substantial的性能提升,至多14.01%。
    Abstract Predicting turn-taking in multiparty conversations has many practical applications in human-computer/robot interaction. However, the complexity of human communication makes it a challenging task. Recent advances have shown that synchronous multi-perspective egocentric data can significantly improve turn-taking prediction compared to asynchronous, single-perspective transcriptions. Building on this research, we propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data. Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average compared to existing baselines and alternative transformer-based approaches. The source code, and the pre-trained models of our 3T-Transformer will be available upon acceptance.
    摘要 预测多方会议中的转移很有实际应用在人机/机器人交互中。然而,人类communication的复杂性使得这个任务变得非常困难。最近的进展表明,同步多个视角的egos征data可以在转移预测中提供显著改进,比 asynchronous, single-perspective transcripts更高。基于这些研究,我们提议一种新的多模式 transformer-based 架构,用于预测embodied, synchronized multi-perspective数据中的转移。我们的实验结果表明,在最近引入的 EgoCom 数据集上,我们的方法可以与现有的基eline和替换 transformer-based 方法相比,提高了平均14.01%的性能。我们的源代码和预训练模型将在接受后提供。

Adaptive Policy with Wait-$k$ Model for Simultaneous Translation

  • paper_url: http://arxiv.org/abs/2310.14853
  • repo_url: None
  • paper_authors: Libo Zhao, Kai Fan, Wei Luo, Jing Wu, Shushu Wang, Ziqian Zeng, Zhongqiang Huang
  • for: 提高同时机器翻译(SiMT)的稳定性和质量。
  • methods: 提出一种更灵活的方法,即分离适应策略模型与翻译模型。使用差分基于预测分布的差分策略(DaP),可以为任何翻译模型进行读写决策,并且具有轻量级参数和计算效率。
  • results: 实验结果表明,我们的方法可以提供更好的翻译质量和响应时间的平衡,超过了强基eline。
    Abstract Simultaneous machine translation (SiMT) requires a robust read/write policy in conjunction with a high-quality translation model. Traditional methods rely on either a fixed wait-$k$ policy coupled with a standalone wait-$k$ translation model, or an adaptive policy jointly trained with the translation model. In this study, we propose a more flexible approach by decoupling the adaptive policy model from the translation model. Our motivation stems from the observation that a standalone multi-path wait-$k$ model performs competitively with adaptive policies utilized in state-of-the-art SiMT approaches. Specifically, we introduce DaP, a divergence-based adaptive policy, that makes read/write decisions for any translation model based on the potential divergence in translation distributions resulting from future information. DaP extends a frozen wait-$k$ model with lightweight parameters, and is both memory and computation efficient. Experimental results across various benchmarks demonstrate that our approach offers an improved trade-off between translation accuracy and latency, outperforming strong baselines.
    摘要 同时机器翻译(SiMT)需要一种可靠的读写策略,并且需要一个高质量的翻译模型。传统方法通常采用 either 固定 waith-$k$ 策略与独立的 waith-$k$ 翻译模型,或者一种适应策略与翻译模型一起受训练。在这个研究中,我们提出了一种更灵活的方法,即分离适应策略模型与翻译模型。我们的动机来自于我们发现,一个独立的多路 waith-$k$ 模型可以与适应策略一起实现竞争力。具体来说,我们引入了 DaP,一种基于偏移量的适应策略,可以为任何翻译模型进行读写决策,基于未来信息中的翻译分布的潜在偏移量。DaP 通过增加了轻量级的参数,可以减少内存和计算量。实验结果表明,我们的方法可以提供更好的翻译准确率和延迟之间的平衡,超过强基eline。

Universal Domain Adaptation for Robust Handling of Distributional Shifts in NLP

  • paper_url: http://arxiv.org/abs/2310.14849
  • repo_url: https://github.com/heyjoonkim/universal_domain_adaptation_for_nlp
  • paper_authors: Hyuhng Joon Kim, Hyunsoo Cho, Sang-Woo Lee, Junyeob Kim, Choonghyun Park, Sang-goo Lee, Kang Min Yoo, Taeuk Kim
  • for: 本研究旨在探讨通用领域适应(UniDA)在自然语言输入上的应用,以及实现适应能力和响应性(能够检测非典型输入)。
  • methods: 本研究使用现有的UniDA方法和STATE-OF-THE-ART的领域适应技术来验证,并在多个 dataset 上进行评估。
  • results: 研究发现,原本设计用于图像输入的UniDA方法可以有效地转移到自然语言领域,并且显示了适应困难度的影响在模型性能上。
    Abstract When deploying machine learning systems to the wild, it is highly desirable for them to effectively leverage prior knowledge to the unfamiliar domain while also firing alarms to anomalous inputs. In order to address these requirements, Universal Domain Adaptation (UniDA) has emerged as a novel research area in computer vision, focusing on achieving both adaptation ability and robustness (i.e., the ability to detect out-of-distribution samples). While UniDA has led significant progress in computer vision, its application on language input still needs to be explored despite its feasibility. In this paper, we propose a comprehensive benchmark for natural language that offers thorough viewpoints of the model's generalizability and robustness. Our benchmark encompasses multiple datasets with varying difficulty levels and characteristics, including temporal shifts and diverse domains. On top of our testbed, we validate existing UniDA methods from computer vision and state-of-the-art domain adaptation techniques from NLP literature, yielding valuable findings: We observe that UniDA methods originally designed for image input can be effectively transferred to the natural language domain while also underscoring the effect of adaptation difficulty in determining the model's performance.
    摘要 (注意:以下是简化中文版本,具体内容可能与原文有所不同)

Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution

  • paper_url: http://arxiv.org/abs/2310.14840
  • repo_url: https://github.com/clclab/pcfg-lm
  • paper_authors: Jaap Jumelet, Willem Zuidema
  • for: 这个论文旨在提供一种基于人工语言数据的语言模型训练、评估和解释 setup,以及一种使用这种数据生成的大量probabilistic grammar来控制生成过程。
  • methods: 这种方法使用一种基于状态拆分PCFGs的大型probabilistic grammar来生成人工语言数据,并且对这种数据进行了评估和解释。
  • results: 研究发现,不同的神经语言模型架构和训练目标在 aproximating lower bound on perplexity 方面存在明显的差异。此外,这种方法还允许直接比较神经网络模型学习的表达和符号学习规则。
    Abstract We present a setup for training, evaluating and interpreting neural language models, that uses artificial, language-like data. The data is generated using a massive probabilistic grammar (based on state-split PCFGs), that is itself derived from a large natural language corpus, but also provides us complete control over the generative process. We describe and release both grammar and corpus, and test for the naturalness of our generated data. This approach allows us to define closed-form expressions to efficiently compute exact lower bounds on obtainable perplexity using both causal and masked language modelling. Our results show striking differences between neural language modelling architectures and training objectives in how closely they allow approximating the lower bound on perplexity. Our approach also allows us to directly compare learned representations to symbolic rules in the underlying source. We experiment with various techniques for interpreting model behaviour and learning dynamics. With access to the underlying true source, our results show striking differences and outcomes in learning dynamics between different classes of words.
    摘要 我们提出了一种基于人工语言样本的训练、评估和解释神经语言模型的设置。我们使用一个大型概率 grammar(基于状态拆分 PCGB),该 grammar 是从大量自然语言词汇库中生成的,但允许我们完全控制生成过程。我们描述和发布了 grammar 和词汇库,并测试了生成的数据的自然性。这种方法允许我们计算精确的下界抑制能力,并且可以使用 causal 和 masked 语言模型来计算下界。我们的结果表明不同的神经语言模型架构和训练目标在接近下界抑制能力方面存在显著的差异。我们的方法还允许我们直接比较学习的表示与源代码中的symbolic规则进行比较。我们在不同的技术上进行了解model行为和学习动态的解释。通过访问真实的源代码,我们的结果显示了不同类型的词的学习动态和结果存在显著的差异。

Characterizing how ‘distributional’ NLP corpora distance metrics are

  • paper_url: http://arxiv.org/abs/2310.14829
  • repo_url: https://github.com/ibm/text-corpus-distance-distributionality
  • paper_authors: Samuel Ackerman, George Kour, Eitan Farchi
  • for: 本研究的目的是研究对两个文档集的距离度量的选择,以及这些度量的分布性。
  • methods: 本研究使用了一个名为“知similarity corpora”的集合,用于测试不同的距离度量的分布性。这个集合包含了两个重叠的文档集,并且通过对这两个集合中的文档进行重叠,生成了一个包含了各种重叠度量的数据集。
  • results: 研究发现,使用了不同的距离度量可以得到不同的分布性结果。例如,使用了MAuve距离度量和Frechet Inception距离度量可以得到更好的分布性结果,而使用了 tradicional的pairwise nearest-neighbor距离度量可以得到更差的分布性结果。此外,研究还发现,可以通过对知similarity corpora中的文档进行重叠,来评估不同距离度量的分布性。
    Abstract A corpus of vector-embedded text documents has some empirical distribution. Given two corpora, we want to calculate a single metric of distance (e.g., Mauve, Frechet Inception) between them. We describe an abstract quality, called `distributionality', of such metrics. A non-distributional metric tends to use very local measurements, or uses global measurements in a way that does not fully reflect the distributions' true distance. For example, if individual pairwise nearest-neighbor distances are low, it may judge the two corpora to have low distance, even if their two distributions are in fact far from each other. A more distributional metric will, in contrast, better capture the distributions' overall distance. We quantify this quality by constructing a Known-Similarity Corpora set from two paraphrase corpora and calculating the distance between paired corpora from it. The distances' trend shape as set element separation increases should quantify the distributionality of the metric. We propose that Average Hausdorff Distance and energy distance between corpora are representative examples of non-distributional and distributional distance metrics, to which other metrics can be compared, to evaluate how distributional they are.
    摘要 一个文档Vector embedding的集合有一定的实际分布。给定两个集合,我们想计算它们之间的单个度量(例如Mauve、Frechet Inception)。我们描述一个抽象的质量,called“分布性”,这种度量。一个非分布的度量通常使用非常Local的测量,或者使用全局测量,但是不充分反映分布的真实距离。例如,如果个体对对 nearest-neighbor 距离都很低,它可能会判断这两个集合的距离很低,即使它们两个分布在实际上很远。一个更分布的度量将,相反,更好地捕捉分布的总距离。我们量化这种质量通过从两个重叠 corpora 中构建一个知情相似 corpora 集并计算这些集合之间的距离。距离的趋势形状,作为集合元素 separation 增加时,可以量化分布性。我们建议Mauve 和Frechet Inception distance 是非分布的度量,而 average Hausdorff distance 和 energy distance 是分布的度量,其他度量可以与这些度量进行比较,以评估它们是多分布的程度。

ALCUNA: Large Language Models Meet New Knowledge

  • paper_url: http://arxiv.org/abs/2310.14820
  • repo_url: https://github.com/arvid-pku/alcuna
  • paper_authors: Xunjian Yin, Baizhou Huang, Xiaojun Wan
  • for: This paper aims to address the lack of benchmarks for evaluating large-scale language models’ (LLMs) ability to handle new knowledge, an important aspect in the rapidly evolving world.
  • methods: The proposed approach, called KnowGen, generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. A new benchmark, ALCUNA, is introduced to assess LLMs’ abilities in knowledge understanding, differentiation, and association.
  • results: The authors benchmark several LLMs and find that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. The impact of entity similarity on the model’s understanding of entity knowledge and the influence of contextual entities are also explored.
    Abstract With the rapid development of NLP, large-scale language models (LLMs) excel in various tasks across multiple domains now. However, existing benchmarks may not adequately measure these models' capabilities, especially when faced with new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important and challenging aspect in the rapidly evolving world. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs, reveals that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity on the model's understanding of entity knowledge and the influence of contextual entities. We appeal to the need for caution when using LLMs in new scenarios or with new knowledge, and hope that our benchmarks can help drive the development of LLMs in face of new knowledge.
    摘要 快速发展的自然语言处理(NLP)技术,大型自然语言模型(LLMs)在多个领域中展现出极高的能力。然而,现有的标准测试集可能不充分评估这些模型在面对新知识时的能力,特别是在新知识领域。本文提出了一种方法 called KnowGen,该方法通过修改现有实体属性和关系来生成新的知识,从而生成了与实际世界中的实体不同的虚拟实体。通过KnowGen,我们引入了一个名为ALCUNA的benchmark,用于评估LLMs的知识理解、区分和相关能力。我们对多个LLMs进行了测试,发现它们在面对新知识时表现不 satisfactory,特别是在新知识和内存知识之间的理解。我们还探索了实体相似性对模型理解实体知识的影响,以及Contextual entities的影响。我们强调在使用LLMs时需要小心,并希望我们的benchmark可以帮助驱动LLMs在面对新知识方面的发展。

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic

  • paper_url: http://arxiv.org/abs/2310.14819
  • repo_url: None
  • paper_authors: Sabri Boughorbel, Majd Hawasly
  • for: 这篇论文旨在评估大语言模型在阿拉伯语中的多turn指令响应能力。
  • methods: 本文使用GPT-4作为一个统一评估器,对英语和阿拉伯语查询进行了定制的阿拉伯语翻译MT-Bench测试环境。
  • results: 研究发现,不同任务类型(如逻辑vs文化)在英语和阿拉伯语下的模型响应有差异。基于多语言和多turn数据集进行微调的基础模型可以与从scratch在多语言数据集上训练的模型相比。最后,我们提出一个 ensemble of small, open LLMs可能可以与商业LLMs相比。
    Abstract While significant progress has been made in benchmarking Large Language Models (LLMs) across various tasks, there is a lack of comprehensive evaluation of their abilities in responding to multi-turn instructions in less-commonly tested languages like Arabic. Our paper offers a detailed examination of the proficiency of open LLMs in such scenarios in Arabic. Utilizing a customized Arabic translation of the MT-Bench benchmark suite, we employ GPT-4 as a uniform evaluator for both English and Arabic queries to assess and compare the performance of the LLMs on various open-ended tasks. Our findings reveal variations in model responses on different task categories, e.g., logic vs. literacy, when instructed in English or Arabic. We find that fine-tuned base models using multilingual and multi-turn datasets could be competitive to models trained from scratch on multilingual data. Finally, we hypothesize that an ensemble of small, open LLMs could perform competitively to proprietary LLMs on the benchmark.
    摘要 While significant progress has been made in benchmarking Large Language Models (LLMs) across various tasks, there is a lack of comprehensive evaluation of their abilities in responding to multi-turn instructions in less-commonly tested languages like Arabic. Our paper offers a detailed examination of the proficiency of open LLMs in such scenarios in Arabic. Utilizing a customized Arabic translation of the MT-Bench benchmark suite, we employ GPT-4 as a uniform evaluator for both English and Arabic queries to assess and compare the performance of the LLMs on various open-ended tasks. Our findings reveal variations in model responses on different task categories, e.g., logic vs. literacy, when instructed in English or Arabic. We find that fine-tuned base models using multilingual and multi-turn datasets could be competitive to models trained from scratch on multilingual data. Finally, we hypothesize that an ensemble of small, open LLMs could perform competitively to proprietary LLMs on the benchmark.Here's the translation in Traditional Chinese:虽然在不同任务上已经做出了重要的进步,但是对于少数测试语言如阿拉伯语的多回合指令对大型自然语言模型(LLM)的评估仍然缺乏全面的评估。我们的论文提供了阿拉伯语中的LLM在多回合指令下的详细评估。我们使用了GPT-4作为英语和阿拉伯语查询的uniform评估器,以评估和比较不同任务类别下的LLM表现。我们的发现显示了不同语言类别下的模型回应存在差异,例如逻辑vs文化。我们发现,使用多语言和多回合数据集进行微调的基本模型可以与从scratch在多语言数据集上训练的模型竞争。最后,我们提出了一个假设,assert that an ensemble of small, open LLMs could perform competitively to proprietary LLMs on the benchmark.

DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning

  • paper_url: http://arxiv.org/abs/2310.15205
  • repo_url: https://github.com/fudandisc/disc-finllm
  • paper_authors: Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, Zhongyu Wei
  • for: 建立一个金融大语言模型(DISC-FinLLM),以提高常规语言模型的多转问答能力、领域文本处理能力、数学计算能力和探索增强生成能力。
  • methods: 使用多专家精炼框架,将常规语言模型扩展为多转问答能力、领域文本处理能力、数学计算能力和探索增强生成能力。建立了金融问题预备集(DISC-FIN-SFT),包括四种类型的问题示例(顾问、NLP任务、计算和探索增强生成)。
  • results: 与基准模型进行比较,我们的模型在多个金融场景中表现出色,具有较高的准确率和效率。详细的结果可以参考https://github.com/FudanDISC/DISC-FinLLM。
    Abstract We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM), DISC-FinLLM. Our methodology improves general LLMs by endowing them with multi-turn question answering abilities, domain text processing capabilities, mathematical computation skills, and retrieval-enhanced generation capabilities. We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation). Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios. Further resources can be found at https://github.com/FudanDISC/DISC-FinLLM.
    摘要 我们提出了多个专家精细调整框架,以建立一个金融大语言模型(LLM),称之为DISC-FinLLM。我们的方法可以提高通用LLM的多回问答能力、领域文本处理能力、数学计算能力以及检索增强生成能力。我们建立了金融指导调整数据集(DISC-FIN-SFT),包括四类指导示例(咨询、NLP任务、计算和检索增强生成)。经多个标准准的评估表明,我们的模型在多种金融场景中表现更好于基eline模型。更多资源可以在https://github.com/FudanDISC/DISC-FinLLM找到。

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

  • paper_url: http://arxiv.org/abs/2310.14806
  • repo_url: None
  • paper_authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li, Yashesh Gaur
  • for: 这篇论文是为了解决自动语音识别(ASR)和语音翻译(ST)的实时流处理问题而写的。
  • methods: 该论文提出了一种基于流Transformer-抽象器(T-T)模型的方法,可以同时生成多个目标语言的一对多和一对一的转写和翻译。它还提出了一种基于时间戳信息的单个排序器训练方法,以便在流处理Setting中有效地生成ASR和ST输出。
  • results: 实验表明,该方法能够在{it,es,de}->英语的实时流处理中生成一对多的转写和翻译输出,并且可以使用单个排序器来生成多个目标语言的输出,这是首次实现的。
    Abstract The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in computational resources, and increased synchronization complexity in real time. In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. We introduce a novel method for joint token-level serialized output training based on timestamp information to effectively produce ASR and ST outputs in the streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
    摘要 随着全球交流增长,即时语音转写和翻译的需求不断增长。这使得为用户应用提供多种语言翻译成为必需。传统的自动语音识别(ASR)和语音翻译(ST)方法经常使用分开的系统,导致计算资源的浪费和实时同步复杂性的增加。在这篇论文中,我们提出了一种流式Transformer-Transducer(T-T)模型,能够同时生成多种语言的转写和翻译。我们还介绍了一种新的时间戳信息基于的单个解码器进行同步输出训练方法。在{it,es,de}->英语实验中,我们证明了我们的方法的效果,使得一个解码器可以生成一对多的同时输出。

Cross-Modal Conceptualization in Bottleneck Models

  • paper_url: http://arxiv.org/abs/2310.14805
  • repo_url: https://github.com/danisalukaev/xcbs
  • paper_authors: Danis Alukaev, Semen Kiselev, Ilya Pershin, Bulat Ibragimov, Vladimir Ivanov, Alexey Kornaev, Ivan Titov
  • for: 本研究旨在提出一种跨模态学习方法,以便在医学影像分类 зада务中使用文本描述来引导概念的生成。
  • methods: 本研究使用文本描述和医学影像之间的交互学习方法,将概念看作是隐藏的变量,并且通过预测概念来预测标签。
  • results: 经过实验表明,跨模态学习方法可以帮助生成可解释的概念,同时也可以增强模型的稳定性。
    Abstract Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray images) are annotated with high-level concepts (e.g., types of abnormalities), and perform classification by first predicting the concepts, followed by predicting the label relying on these concepts. The main difficulty in using CBMs comes from having to choose concepts that are predictive of the label and then having to label training examples with these concepts. In our approach, we adopt a more moderate assumption and instead use text descriptions (e.g., radiology reports), accompanying the images in training, to guide the induction of concepts. Our cross-modal approach treats concepts as discrete latent variables and promotes concepts that (1) are predictive of the label, and (2) can be predicted reliably from both the image and text. Through experiments conducted on datasets ranging from synthetic datasets (e.g., synthetic images with generated descriptions) to realistic medical imaging datasets, we demonstrate that cross-modal learning encourages the induction of interpretable concepts while also facilitating disentanglement. Our results also suggest that this guidance leads to increased robustness by suppressing the reliance on shortcut features.
    摘要 conePT bottleneck models (CBMs) assume that training examples (e.g., x-ray images) are annotated with high-level concepts (e.g., types of abnormalities), and perform classification by first predicting the concepts, followed by predicting the label relying on these concepts. The main difficulty in using CBMs comes from having to choose concepts that are predictive of the label and then having to label training examples with these concepts. In our approach, we adopt a more moderate assumption and instead use text descriptions (e.g., radiology reports), accompanying the images in training, to guide the induction of concepts. Our cross-modal approach treats concepts as discrete latent variables and promotes concepts that (1) are predictive of the label, and (2) can be predicted reliably from both the image and text. Through experiments conducted on datasets ranging from synthetic datasets (e.g., synthetic images with generated descriptions) to realistic medical imaging datasets, we demonstrate that cross-modal learning encourages the induction of interpretable concepts while also facilitating disentanglement. Our results also suggest that this guidance leads to increased robustness by suppressing the reliance on shortcut features.Here's a word-for-word translation of the text into Simplified Chinese: conePT bottleneck models (CBMs) 假设训练示例(例如,X射线图像)被标注为高级概念(例如,畸形类型),并在这些概念的基础上进行分类。使用 conePT bottleneck models 的主要困难在于选择预测性的概念,并将训练示例标注为这些概念。在我们的方法中,我们采用更为妥协的假设,而不是使用高级概念,而是使用附加到图像上的文本描述(例如,医学报告)来引导概念的生成。我们的cross-modal方法将概念视为离散的隐藏变量,并且激励概念的预测,其中 (1) 预测标签,并 (2) 可预测自图像和文本两个方面。通过在不同的synthetic datasets(例如,生成的描述和图像)和实际医学成像 datasets 上进行实验,我们示出了cross-modal学习可以促进可解释的概念的生成,同时还可以促进分离。我们的结果还表明,这种指导可以提高Robustness,因为它可以抑制快捷特征的依赖。

Geographical Erasure in Language Generation

  • paper_url: http://arxiv.org/abs/2310.14777
  • repo_url: https://github.com/amazon-science/geographical-erasure-in-language-generation
  • paper_authors: Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, Danish Pruthi
  • for: 研究大型语言模型(LLMs)如何受到训练数据中占主导地位的团体影响,并如何通过语言生成器中的各种方法来纠正这种偏见。
  • methods: 研究人员使用了多种方法来检测和纠正语言模型中的地域消失现象,包括对模型的训练数据进行分析,以及使用自定义目标函数进行微调。
  • results: 研究人员发现,在许多语言模型中,存在一种地域消失现象,其中某些国家的提及频率异常低。此外,研究人员还发现,这种消失与训练数据中国家提及频率的低 frequencie强相关。最后,研究人员通过微调来纠正这种消失。
    Abstract Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can propagate into generated language. In this work, we study and operationalise a form of geographical erasure, wherein language models underpredict certain countries. We demonstrate consistent instances of erasure across a range of LLMs. We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. Lastly, we mitigate erasure by finetuning using a custom objective.
    摘要

SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research

  • paper_url: http://arxiv.org/abs/2310.14757
  • repo_url: None
  • paper_authors: Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Leonardo Neves, Kiamehr Rezaee, Luis Espinosa-Anke, Jiaxin Pei, Jose Camacho-Collados
  • for: 提高 NLP 在社交媒体上的评估和比较性能
  • methods: 引入一个统一的评估标准 SuperTweetEval,包括多种任务和数据集,适应不同的模型和比较metric
  • results: 虽然最近的语言模型在语言模型方面有所进步,但社交媒体仍然是挑战,模型的性能需要进一步改进
    Abstract Despite its relevance, the maturity of NLP for social media pales in comparison with general-purpose models, metrics and benchmarks. This fragmented landscape makes it hard for the community to know, for instance, given a task, which is the best performing model and how it compares with others. To alleviate this issue, we introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval, which includes a heterogeneous set of tasks and datasets combined, adapted and constructed from scratch. We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
    摘要 尽管它的重要性,社交媒体NLP的成熟程度与通用模型、指标和benchmark相比,还是落后的。这个分化的景象使得社区很难知道,给定任务,哪怕是最高性能的模型,与其他模型如何进行比较。为了解决这个问题,我们引入了社交媒体NLP评估的统一benchmark,SuperTweetEval,该benchmark包括多种任务和数据集,经过组合、适应和自定义构建。我们对SuperTweetEval上的多种模型进行了性能测试,结果表明,尽管最近几年语言模型在语言模型方面取得了进步,但社交媒体仍然是一个挑战。

MCC-KD: Multi-CoT Consistent Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2310.14747
  • repo_url: None
  • paper_authors: Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, Ji Zhang
  • for: 提高大语言模型(LLM)中的复杂逻辑能力和小模型的逻辑能力之间的转移。
  • methods: 提出了多个逻辑(Multi-CoT)一致知识填充(KD)技术,通过对每个问题生成多个论证并强制这些论证之间的一致性,以提高逻辑能力的多样性和一致性。
  • results: 通过使用不同的模型结构(LLaMA/FlanT5)和不同的模型规模(3B/7B/11B/13B)进行实验,证明了MCC-KD在数学逻辑和常识逻辑benchmark上表现出色,同时也表明了其在非标准数据集上的稳定普适性。
    Abstract Large language models (LLMs) have showcased remarkable capabilities in complex reasoning through chain of thought (CoT) prompting. Recently, there has been a growing interest in transferring these reasoning abilities from LLMs to smaller models. However, achieving both the diversity and consistency in rationales presents a challenge. In this paper, we focus on enhancing these two aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets.
    摘要 In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets.Translated into Simplified Chinese:大型语言模型(LLM)有展示出很强的复杂推理能力,通过链接思维(CoT)提示。近期,对于将这些推理能力传递到较小的模型中,有增加的兴趣。然而,确保多样性和一致性在理由中是一个挑战。在这篇论文中,我们专注于提高这两个方面,并提出了多CoT一致知识传递(MCC-KD)来高效地传递推理能力。在MCC-KD中,我们为每个问题生成多个理由,并对应推理结果进行一致性 enforcement,通过对答案分布进行双向KL散度的最小化。我们使用不同的模型架构(LLaMA/FlanT5)和不同的模型缩减(3B/7B/11B/13B)进行评估,并评估其在数学推理和通过推理 benchmarks 上的表现。结果显示,MCC-KD不仅在内部数据上表现出色,而且在外部数据上也具有优秀的一致性和稳定性。

Once Upon a $\textit{Time}$ in $\textit{Graph}$: Relative-Time Pretraining for Complex Temporal Reasoning

  • paper_url: http://arxiv.org/abs/2310.14709
  • repo_url: https://github.com/damo-nlp-sg/rememo
  • paper_authors: Sen Yang, Xin Li, Lidong Bing, Wai Lam
  • for: 本研究旨在提高预训练语言模型对时间文本的理解和推理能力。
  • methods: 我们使用时间的本质,将所有时间范围内的句子串起来形成一个一维时轴,并建立基于时间相关性的图结构。我们提出了基于时间相关性的模型,即RemeMo,以连接所有时间范围内的事实。
  • results: 实验结果表明,RemeMo比基线T5在多个时间问答 datasets 下表现出优异性,特别是在模型长距离复杂时间关系的情况下。我们在 $\href{https://github.com/DAMO-NLP-SG/RemeMo}{\text{这里}$ 发布了我们的代码和预训练检查点。
    Abstract Our physical world is constantly evolving over time, rendering challenges for pre-trained language models to understand and reason over the temporal contexts of texts. Existing work focuses on strengthening the direct association between a piece of text and its time-stamp. However, the knowledge-time association is usually insufficient for the downstream tasks that require reasoning over temporal dependencies between knowledge. In this work, we make use of the underlying nature of time, all temporally-scoped sentences are strung together through a one-dimensional time axis, and suggest creating a graph structure based on the relative placements of events along the time axis. Inspired by the graph view, we propose RemeMo ($\underline{Re}$lative Ti$\underline{me}$ $\underline{Mo}$deling), which explicitly connects all temporally-scoped facts by modeling the time relations between any two sentences. Experimental results show that RemeMo outperforms the baseline T5 on multiple temporal question answering datasets under various settings. Further analysis suggests that RemeMo is especially good at modeling long-range complex temporal dependencies. We release our code and pre-trained checkpoints at $\href{https://github.com/DAMO-NLP-SG/RemeMo}{\text{this url}$.
    摘要

Strong and Efficient Baselines for Open Domain Conversational Question Answering

  • paper_url: http://arxiv.org/abs/2310.14708
  • repo_url: None
  • paper_authors: Andrei C. Coman, Gianni Barlacchi, Adrià de Gispert
  • for: 本研究旨在重新评估State-of-the-Art(SotA) dense passage retrieval(DPR)检索器和混合在编辑器(FiD)读取管道的效果,以及提出一种快速重新检索组件和targeted fine-tuning步骤,以提高ODConvQA任务的性能。
  • methods: 本研究使用了State-of-the-Art DPR检索器和FiD读取管道,并对其进行了改进和优化。我们还引入了一种快速重新检索组件,以及targeted fine-tuning步骤,以提高ODConvQA任务的性能。
  • results: 我们的实验结果表明,我们的方法可以提高SotA结果,同时降低读取器的延迟时间 by 60%。此外,我们还提供了一些新的VALUABLE INSIGHTS,以便未来的研究人员可以基于这些基elines,开发更复杂的方法,包括使用大型自然语言模型(LLMs)。
    Abstract Unlike the Open Domain Question Answering (ODQA) setting, the conversational (ODConvQA) domain has received limited attention when it comes to reevaluating baselines for both efficiency and effectiveness. In this paper, we study the State-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and Fusion-in-Decoder (FiD) reader pipeline, and show that it significantly underperforms when applied to ODConvQA tasks due to various limitations. We then propose and evaluate strong yet simple and efficient baselines, by introducing a fast reranking component between the retriever and the reader, and by performing targeted finetuning steps. Experiments on two ODConvQA tasks, namely TopiOCQA and OR-QuAC, show that our method improves the SotA results, while reducing reader's latency by 60%. Finally, we provide new and valuable insights into the development of challenging baselines that serve as a reference for future, more intricate approaches, including those that leverage Large Language Models (LLMs).
    摘要 (注意:以下是简化中文版本,不同于开放领域问答(ODQA)设置,对话型问答(ODConvQA)领域尚未得到了足够的关注,重新评估基线的效率和效果。本文研究了现状最佳(SotA)的稠密段 Retrieval(DPR) Retriever和混合在解码器(FiD)读取管道,并发现它在ODConvQA任务中表现不佳,主要由于多种限制。我们then propose和评估了一些简单、高效的基线,通过在搜索器和解码器之间添加快速重新排名组件,以及进行targeted finetuning步骤。在两个ODConvQA任务, namely TopiOCQA和OR-QuAC上,我们的方法提高了SotA结果,同时降低读取器的延迟时间60%。最后,我们提供了新的有价值的视角,包括使用大语言模型(LLMs)的更复杂的基线的开发。)

The continued usefulness of vocabulary tests for evaluating large language models

  • paper_url: http://arxiv.org/abs/2310.14703
  • repo_url: https://github.com/wordsgpt/llm_vocabulary_evaluation
  • paper_authors: Gonzalo Martínez, Javier Conde, Elena Merino-Gómez, Beatriz Bermúdez-Margaretto, José Alberto Hernández, Pedro Reviriego, Marc Brysbaert
  • for: 测试现代自然语言模型的质量
  • methods: 使用Landauer和Dumain(1997)提出的测试英语为外语测试,以及Yes/No测试来评估模型的性能
  • results: 现代主要语言模型在Target Word测试中表现不完美,有些模型在不同项目上出错。在Yes/No测试中,模型对非存在的单词表现 significatively worse,与其他观察结果相符。在西班牙语测试中,大多数模型给出了字典中存在的单词的意思和翻译,但最佳模型开始表现非常良好,并且还指出了 Dictionary 中不存在的单词。
    Abstract In their seminal article on semantic vectors, Landauer and Dumain (1997) proposed testing the quality of AI language models with a challenging vocabulary test. We show that their Test of English as a Foreign Language (TOEFL) test remains informative for contemporary major language models, since none of the models was perfect and made errors on divergent items. The TOEFL test consists of target words with four alternatives to choose from. We further tested the models on a Yes/No test that requires distinguishing between existing words and made-up nonwords. The models performed significantly worse on the nonword items, in line with other observations that current major language models provide non-existent information. The situation was worse when we generalized the tests to Spanish. Here, most models gave meanings/translations for the majority of random letter sequences. On the plus side, the best models began to perform quite well, and they also pointed to nonwords that were unknown to the test participants but can be found in dictionaries.
    摘要 landaurer和dumain(1997)在Semantic Vector的论文中提出了测试人工智能语言模型的困难词汇测试。我们发现,这些模型对于当代主要语言模型来说,测试仍然有用,因为None of the models was perfect and made errors on divergent items。TOEFL测试包括目标词汇与四个选项。我们还对模型进行了是否存在测试,需要分辨真正存在的词和虚构的非Word。模型在非word项目上表现 significatively worse,与其他观察结果相符,现代主要语言模型提供了无效的信息。在扩展到西班牙语时,大多数模型为大多数随机字串提供了意思/翻译。fortunately,最佳模型在测试中表现良好,并且指向了未知的测试参与者,但可以在字典中找到的非Word。

Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models

  • paper_url: http://arxiv.org/abs/2310.14696
  • repo_url: https://github.com/gankim/tree-of-clarifications
  • paper_authors: Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo Kang
  • for: 这篇论文是关于如何处理抽象问题的,即问题可以有多种解释。
  • methods: 该论文提出了一种新的框架,即树结构化的解释框架(Tree of Clarifications,ToC),用于处理抽象问题。该框架通过几个Prompting来构建问题的解释树,并使用这些解释树来生成长答案。
  • results: 论文在ASQA上使用几个ew-shot设置下,与已有的基eline相比,具有更高的Disambig-F1和Disambig-ROUGE分数。此外,论文还超过了完全监督的基eline,在整个训练集上的Disambig-F1和Disambig-ROUGE分数方面表现更优。
    Abstract Questions in open-domain question answering are often ambiguous, allowing multiple interpretations. One approach to handling them is to identify all possible interpretations of the ambiguous question (AQ) and to generate a long-form answer addressing them all, as suggested by Stelmakh et al., (2022). While it provides a comprehensive response without bothering the user for clarification, considering multiple dimensions of ambiguity and gathering corresponding knowledge remains a challenge. To cope with the challenge, we propose a novel framework, Tree of Clarifications (ToC): It recursively constructs a tree of disambiguations for the AQ -- via few-shot prompting leveraging external knowledge -- and uses it to generate a long-form answer. ToC outperforms existing baselines on ASQA in a few-shot setup across the metrics, while surpassing fully-supervised baselines trained on the whole training set in terms of Disambig-F1 and Disambig-ROUGE. Code is available at https://github.com/gankim/tree-of-clarifications.
    摘要 通常,开放领域问答中的问题是抽象的,允许多种解释。一种方法是将抽象问题(AQ)的所有可能的解释标出,并生成一个详细的答案,以满足用户的所有需求。然而,考虑多个维度的抽象和收集相关知识是一项挑战。为解决这个问题,我们提出了一种新的框架:树形详细(ToC)。它通过几何激活外部知识来 recursively 构建 AQ 的树型解释,并使用它来生成详细的答案。ToC 在 ASQA 中以几个shot 的设置超过现有基eline,并在 Disambig-F1 和 Disambig-ROUGE 中击败完全监督基eline 训练在整个训练集上。代码可以在 https://github.com/gankim/tree-of-clarifications 中找到。

SpEL: Structured Prediction for Entity Linking

  • paper_url: http://arxiv.org/abs/2310.14684
  • repo_url: https://github.com/shavarani/spel
  • paper_authors: Hassan S. Shavarani, Anoop Sarkar
  • for: 这篇论文是关于实体链接的研究,旨在创建结构化数据,将文本Span链接到ontology或知识源。
  • methods: 该论文使用结构预测方法进行实体链接,每个输入Token都被识别为实体,并将Token预测结果聚合。该系统被称为SpEL(结构预测 для实体链接)。
  • results: 该论文的实验结果表明,SpEL可以在常用的AIDAbenchmark数据集上超过状态时的实体链接性能,并且具有较少的模型输出词汇大小和快速推理速度。
    Abstract Entity linking is a prominent thread of research focused on structured data creation by linking spans of text to an ontology or knowledge source. We revisit the use of structured prediction for entity linking which classifies each individual input token as an entity, and aggregates the token predictions. Our system, called SpEL (Structured prediction for Entity Linking) is a state-of-the-art entity linking system that uses some new ideas to apply structured prediction to the task of entity linking including: two refined fine-tuning steps; a context sensitive prediction aggregation strategy; reduction of the size of the model's output vocabulary, and; we address a common problem in entity-linking systems where there is a training vs. inference tokenization mismatch. Our experiments show that we can outperform the state-of-the-art on the commonly used AIDA benchmark dataset for entity linking to Wikipedia. Our method is also very compute efficient in terms of number of parameters and speed of inference.
    摘要 Entity linking是一个重要的研究方向,旨在通过将文本段联结到ontology或知识源来创建结构化数据。我们重新审视了用于entity linking的结构预测方法,其中每个输入token都被视为实体,并将tokend预测结果聚合。我们的系统名为SpEL(结构预测 для实体联结),它使用了一些新的想法来应用结构预测到实体联结任务中,包括:两个精细微调步骤;上下文敏感预测聚合策略;模型输出词汇表的减少,以及解决实体联结系统中常见的训练vs推理tokenization差异问题。我们的实验表明,我们可以超过当今最佳实体联结系统在AIDA数据集上的性能。此外,我们的方法也具有较少的参数量和快速的推理速度。

Pre-Trained Language Models Augmented with Synthetic Scanpaths for Natural Language Understanding

  • paper_url: http://arxiv.org/abs/2310.14676
  • repo_url: None
  • paper_authors: Shuwen Deng, Paul Prasse, David R. Reich, Tobias Scheffer, Lena A. Jäger
  • for: 这个论文的目的是提出一种基于人工眼动数据的自然语言处理(NLP)模型,以提高语言理解能力。
  • methods: 该论文使用了一种基于人工眼动生成的语言模型,并通过在读取过程中生成人类化的眼动数据来提高模型的性能。
  • results: 论文的实验结果表明,该模型不仅能够超过基础语言模型的性能,而且与基于真实人类眼动数据的模型性能相当。
    Abstract Human gaze data offer cognitive information that reflects natural language comprehension. Indeed, augmenting language models with human scanpaths has proven beneficial for a range of NLP tasks, including language understanding. However, the applicability of this approach is hampered because the abundance of text corpora is contrasted by a scarcity of gaze data. Although models for the generation of human-like scanpaths during reading have been developed, the potential of synthetic gaze data across NLP tasks remains largely unexplored. We develop a model that integrates synthetic scanpath generation with a scanpath-augmented language model, eliminating the need for human gaze data. Since the model's error gradient can be propagated throughout all parts of the model, the scanpath generator can be fine-tuned to downstream tasks. We find that the proposed model not only outperforms the underlying language model, but achieves a performance that is comparable to a language model augmented with real human gaze data. Our code is publicly available.
    摘要 人类视线数据提供了认知信息,它反映了自然语言理解。实际上,将人类扫描路径与语言模型结合可以提高各种自然语言处理任务的性能,包括语言理解。然而,这种方法的可用性受到了文本 Corpora 的充足性和人类视线数据的罕见性的限制。虽然已经开发了生成人类化扫描路径的模型,但是对于 NLP 任务中的 sintetic gaze data 的潜在价值还没有得到充分的探索。我们开发了一种将生成的扫描路径与扫描路径增强语言模型结合的模型,从而消除了人类视线数据的需求。由于模型的错误梯度可以在所有部分上传递,因此扫描路径生成器可以根据下游任务进行细调。我们发现,我们提出的模型不仅能够超越下面的语言模型,而且可以与基于真实人类视线数据进行了比较。我们的代码公开可用。

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

  • paper_url: http://arxiv.org/abs/2310.14663
  • repo_url: None
  • paper_authors: Seongho Joo, Hyukhun Koh, Kyomin Jung
  • for: 这 paper 的目的是提出一种基于 Determinantal Point Processes (DPPs) 的文本到语音模型 (TTS),以实现更多的语音欣赏度和多样性。
  • methods: 该模型使用 DPPs 来生成语音样本,并具有一个欣赏度增强模块,以 simultanously 考虑每个样本的感知多样性和多个样本之间的多样性。
  • results: 比较 experiments 表明,DPP-TTS 能够生成更多的语音样本,同时保持语音的自然性。
    Abstract With the rapid advancement in deep generative models, recent neural Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, which can adversely affect the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.
    摘要 With the rapid advancement in deep generative models, recent neural Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, which can adversely affect the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.Here's the translation in Traditional Chinese:随着深度生成模型的快速进步,现代神经网络 Text-To-Speech(TTS)模型已成功地 sinthezing human-like speech。有些努力将 speech sintheized with various prosody beyond monotonous prosody patterns。然而,先前的工作有 Several limitations。First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody。Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity,which can adversely affect the naturalness of the speech。Second, the diversity among samples is neglected,since the sampling procedure often focuses on a single speech sample rather than multiple ones。在本文中,我们提出 DPP-TTS:基于 Determinantal Point Processes(DPPs)的 text-to-speech 模型,具有 prosody diversifying module。我们的 TTS 模型可以同时考虑每个样本的 perceptual 多样性和多个样本之间的多样性。我们显示 DPP-TTS 产生的 speech 样本与基准相比,具有更多的多样性,同时保持 speech 的自然性。

SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras

  • paper_url: http://arxiv.org/abs/2310.14654
  • repo_url: None
  • paper_authors: Nithya R, Malavika S, Jordan F, Arjun Gangwar, Metilda N J, S Umesh, Rithik Sarab, Akhilesh Kumar Dubey, Govind Divakaran, Samudra Vijaya K, Suryakanth V Gangashetty
  • for: 这个论文是为了推动印度语言技术社区为印度语言建立语音应用程序。
  • methods: 论文使用了manual transcription和legally sourced speech data来构建ASR系统。
  • results: 论文描述了数据收集和数据清洁过程,以及数据统计。
    Abstract India is home to a multitude of languages of which 22 languages are recognised by the Indian Constitution as official. Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate. To encourage the language technology community to build speech based applications in Indian languages, we are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology Madras and is a part of National Language Translation Mission (NLTM), funded by the Indian Ministry of Electronics and Information Technology (MeitY), Government of India. We describe the data collection and data cleaning process along with the data statistics in this paper.
    摘要 印度是一个多语言国家,其中22种语言被印度宪法认可为官方语言。建立基于语音的应用程序 для印度人口是一个困难的问题,因为有限的数据和语言和口音要适应。为促进印度语言技术社区在印度语言上建立语音基本应用程序,我们将开源SPRING-INX数据,包括2000小时的法定获取和手动译录的语音数据,用于ASR系统的建立。这个尝试由SPRING实验室,印度理工学院Madras进行,是国家语言翻译使命(NLTM)的一部分,由印度电子和信息技术部(MeitY)政府承担。我们将介绍数据收集和清洗过程以及数据统计。

Multilingual k-Nearest-Neighbor Machine Translation

  • paper_url: http://arxiv.org/abs/2310.14644
  • repo_url: None
  • paper_authors: David Stap, Christof Monz
  • for: 提高机器翻译质量,特别是低资源语言对翻译质量的限制。
  • methods: combinig representations from multiple languages into a single datastore。
  • results: 实现了低资源翻译质量的显著提高(最高+3.6 BLEU),以及高资源翻译质量的小幅提高(最高+0.5 BLEU),同时实现了 datastore 的减少和速度提高。
    Abstract k-nearest-neighbor machine translation has demonstrated remarkable improvements in machine translation quality by creating a datastore of cached examples. However, these improvements have been limited to high-resource language pairs, with large datastores, and remain a challenge for low-resource languages. In this paper, we address this issue by combining representations from multiple languages into a single datastore. Our results consistently demonstrate substantial improvements not only in low-resource translation quality (up to +3.6 BLEU), but also for high-resource translation quality (up to +0.5 BLEU). Our experiments show that it is possible to create multilingual datastores that are a quarter of the size, achieving a 5.3x speed improvement, by using linguistic similarities for datastore creation.
    摘要

Extending Input Contexts of Language Models through Training on Segmented Sequences

  • paper_url: http://arxiv.org/abs/2310.14633
  • repo_url: None
  • paper_authors: Petros Karypis, Julian McAuley, George Karypis
  • for: 提高语言模型对长输入的训练效果
  • methods: 使用分割序列和 interpolate-based 方法扩展绝对位置嵌入
  • results: 可以extend输入上下文大小无需改变模型结构和不增加内存成本,并且可以改善模型在长输入上的表现
    Abstract Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.
    摘要 实现长输入的语言模型训练具有许多技术挑战。为了考虑成本考虑,语言模型通常在固定序列长度下进行预训,然后被适应到更长的序列。我们探索了多种方法来将模型适应到更长的输入,包括在分段序列上进行训练和使用插值基于的方法来延长绝对位置嵌入。我们开发了一个训练程序,可以将预训的模型中的输入上下文大小延长,而不需要任何结构更改和额外的内存成本。通过在长输入中抽样段落,并保持段落的原始位置,模型可以学习新的位置互动。我们的方法对于使用绝对位置嵌入的模型和受欢迎的相对位置嵌入方法都有利,可以将输入上下文延长4倍,并降低序列长于它们预训的时间。我们展示了我们的方法可以将输入上下文延长4倍,同时降低序列长于它们预训的时间。

Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts

  • paper_url: http://arxiv.org/abs/2310.14628
  • repo_url: https://github.com/tengxiaoliu/xot
  • paper_authors: Tengxiao Liu, Qipeng Guo, Yuqing Yang, Xiangkun Hu, Yue Zhang, Xipeng Qiu, Zheng Zhang
  • for: 这paper aimed to propose an integrated problem-solving framework for math reasoning tasks, which can effectively utilize the strengths of different prompting methods.
  • methods: 该方法使用了多种推理思维方法,包括链条思维、程序思维等,并通过 Iterative execution 和动态方法 switching 来实现问题解决。
  • results: 经过广泛的实验测试, authors 证明了他们提出的方法的效果,并进行了问题解决的分析和比较。 besides, the results suggest that the framework is orthogonal to recent work on single reasoning methods and can be further generalized to logical reasoning domain.
    Abstract As large language models (LLMs) have shown effectiveness with different prompting methods, such as Chain of Thought, Program of Thought, we find that these methods have formed a great complementarity to each other on math reasoning tasks. In this work, we propose XoT, an integrated problem solving framework by prompting LLMs with diverse reasoning thoughts. For each question, XoT always begins with selecting the most suitable method then executes each method iteratively. Within each iteration, XoT actively checks the validity of the generated answer and incorporates the feedback from external executors, allowing it to dynamically switch among different prompting methods. Through extensive experiments on 10 popular math reasoning datasets, we demonstrate the effectiveness of our proposed approach and thoroughly analyze the strengths of each module. Moreover, empirical results suggest that our framework is orthogonal to recent work that makes improvements on single reasoning methods and can further generalise to logical reasoning domain. By allowing method switching, XoT provides a fresh perspective on the collaborative integration of diverse reasoning thoughts in a unified framework.
    摘要 受大语言模型(LLM)的不同提示方法的影响,我们发现这些方法在数学逻辑任务上形成了一种优良的补充关系。在这项工作中,我们提议XoT,一个集成的问题解决框架,通过对LLM的多种思维方法进行提示来解决问题。对于每个问题,XoT都会选择最适合的方法,然后在每次迭代中运行每个方法。在每次迭代中,XoT会活动地检查生成的答案的有效性,并根据外部执行器的反馈进行动态切换。通过对10种流行的数学逻辑数据集进行广泛的实验,我们证明了我们的提议的有效性,并且对每个模块进行了详细的分析。此外,实验结果表明,XoT是与最近的单一逻辑方法改进工作 ortogonal的,并且可以进一步泛化到逻辑逻辑领域。通过允许方法切换,XoT提供了一种新的多思维方法集成的共同框架,为数学逻辑问题的解决提供了新的思路。

CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster Tweet Classification

  • paper_url: http://arxiv.org/abs/2310.14627
  • repo_url: https://github.com/HenryPengZou/DeCrisisMB
  • paper_authors: Henry Peng Zou, Yue Zhou, Cornelia Caragea, Doina Caragea
  • for: 这个论文旨在提高自然灾害事件监测中的效果,使用少量标注数据和大量无标注数据进行分类。
  • methods: 该模型使用了 semi-supervised 和 few-shot 学习方法,只需要少量的标注数据和大量的无标注数据来实现精细化分类。
  • results: 模型在两个自然灾害数据集上的平均提高率为 11.2%,并且通过对数据量的变化和域外结果进行分析。
    Abstract The shared real-time information about natural disasters on social media platforms like Twitter and Facebook plays a critical role in informing volunteers, emergency managers, and response organizations. However, supervised learning models for monitoring disaster events require large amounts of annotated data, making them unrealistic for real-time use in disaster events. To address this challenge, we present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting where only a small number of annotated data is required. Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data, mimicking the early stage of a disaster. Through integrating effective semi-supervised learning ideas and incorporating TextMixUp, CrisisMatch achieves performance improvement on two disaster datasets of 11.2\% on average. Further analyses are also provided for the influence of the number of labeled data and out-of-domain results.
    摘要 共享的实时信息在社交媒体平台上,如推特和facebook,在援助者、紧急管理人员和应急响应组织中扮演了关键的角色。然而,监督学习模型用于监测自然灾害事件需要大量注释数据,使其在紧急事件中不实际。为解决这个挑战,我们提出了一种细化的自然灾害微博分类模型,基于半监督、少量学习设置。我们的模型,危机匹配,能够使用少量注释数据和大量无注释数据来分类微博,模拟紧急事件的早期阶段。通过 integrate 有效的半监督学习想法和 TextMixUp,危机匹配实现了平均11.2%的性能提升在两个自然灾害数据集上。此外,我们还提供了数据量的影响和 OUT-OF-DOMAIN 结果的分析。

Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue

  • paper_url: http://arxiv.org/abs/2310.14626
  • repo_url: None
  • paper_authors: Yuanxing Liu, Wei-Nan Zhang, Yifan Chen, Yuchi Zhang, Haopeng Bai, Fan Feng, Hengbin Cui, Yongbin Li, Wanxiang Che
  • for: 这篇论文主要目的是探讨在电商预售对话中使用语言模型和会话推荐系统(CRS)的合作方式,以提高推荐的准确性和有用性。
  • methods: 这篇论文使用了两种协作方法:一种是使用CRS协助语言模型(LLM),另一种是使用LLM协助CRS。这两种协作方法在四个电商预售对话任务中进行了广泛的实验。
  • results: 研究发现,在某些情况下,CRS和LLM的协作可以非常有效。
    Abstract E-commerce pre-sales dialogue aims to understand and elicit user needs and preferences for the items they are seeking so as to provide appropriate recommendations. Conversational recommender systems (CRSs) learn user representation and provide accurate recommendations based on dialogue context, but rely on external knowledge. Large language models (LLMs) generate responses that mimic pre-sales dialogues after fine-tuning, but lack domain-specific knowledge for accurate recommendations. Intuitively, the strengths of LLM and CRS in E-commerce pre-sales dialogues are complementary, yet no previous work has explored this. This paper investigates the effectiveness of combining LLM and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods: CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a real-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of two collaborative approaches with two CRSs and two LLMs on four tasks of Ecommerce pre-sales dialogue. We find that collaborations between CRS and LLM can be very effective in some cases.
    摘要 电商预售对话的目标是理解和提取用户需求和喜好,以提供相应的建议。对话推荐系统(CRS)学习用户表示,并基于对话上下文提供准确的建议,但是需要外部知识。大语言模型(LLM)通过精度地优化,生成类似预售对话的回答,但是缺乏特定领域知识。我们认为LLM和CRS在电商预售对话中的优势是补偿的,但是没有前期研究这一点。这篇论文探讨了将LLM和CRS在电商预售对话中合作的效果,并提出了两种合作方法:CRS帮助LLM和LLM帮助CRS。我们在一个真实的电商预售对话数据集上进行了广泛的实验。我们分析了在四个电商预售对话任务上的两种合作方法的影响。我们发现在某些情况下,LLM和CRS之间的合作可以非常有效。

CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine Chain-of-Thought Prompting for Multi-domain NLU Tasks

  • paper_url: http://arxiv.org/abs/2310.14623
  • repo_url: None
  • paper_authors: Hoang H. Nguyen, Ye Liu, Chenwei Zhang, Tao Zhang, Philip S. Yu
  • for: This paper aims to improve the performance of large language models (LLMs) in natural language understanding (NLU) tasks by proposing a Coarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks into multiple reasoning steps.
  • methods: The proposed approach uses semantic-based Abstract Meaning Representation (AMR) structured knowledge as an intermediate step to capture the nuances and diverse structures of utterances, and to understand connections between their varying levels of granularity.
  • results: The proposed approach is demonstrated effective in assisting LLMs adapt to multi-grained NLU tasks under both zero-shot and few-shot multi-domain settings.
    Abstract While Chain-of-Thought prompting is popular in reasoning tasks, its application to Large Language Models (LLMs) in Natural Language Understanding (NLU) is under-explored. Motivated by multi-step reasoning of LLMs, we propose Coarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks into multiple reasoning steps where LLMs can learn to acquire and leverage essential concepts to solve tasks from different granularities. Moreover, we propose leveraging semantic-based Abstract Meaning Representation (AMR) structured knowledge as an intermediate step to capture the nuances and diverse structures of utterances, and to understand connections between their varying levels of granularity. Our proposed approach is demonstrated effective in assisting the LLMs adapt to the multi-grained NLU tasks under both zero-shot and few-shot multi-domain settings.
    摘要 while Chain-of-Thought prompting is popular in reasoning tasks, its application to Large Language Models (LLMs) in Natural Language Understanding (NLU) is under-explored. Motivated by multi-step reasoning of LLMs, we propose Coarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks into multiple reasoning steps where LLMs can learn to acquire and leverage essential concepts to solve tasks from different granularities. Moreover, we propose leveraging semantic-based Abstract Meaning Representation (AMR) structured knowledge as an intermediate step to capture the nuances and diverse structures of utterances, and to understand connections between their varying levels of granularity. Our proposed approach is demonstrated effective in assisting the LLMs adapt to the multi-grained NLU tasks under both zero-shot and few-shot multi-domain settings.Here's the breakdown of the translation:* while (而) is translated as "而"* Chain-of-Thought (Chain-of-Thought) is translated as "思维链"* prompting (prompting) is translated as "提示"* is popular (is popular) is translated as "很受欢迎"* in reasoning tasks (in reasoning tasks) is translated as "在理解任务中"* its application (its application) is translated as "其应用"* to Large Language Models (LLMs) (to Large Language Models (LLMs)) is translated as "对大语言模型"* in Natural Language Understanding (NLU) (in Natural Language Understanding (NLU)) is translated as "在自然语言理解中"* is under-explored (is under-explored) is translated as "尚未得到足够的探索"* Motivated (Motivated) is translated as "受到动机"* by multi-step reasoning (by multi-step reasoning) is translated as "多步骤的理解"* of LLMs (of LLMs) is translated as "LLMs的"* we propose (we propose) is translated as "我们提议"* Coarse-to-Fine Chain-of-Thought (CoF-CoT) (Coarse-to-Fine Chain-of-Thought (CoF-CoT)) is translated as "从粗到细的思维链"* approach (approach) is translated as "方法"* that breaks down (that breaks down) is translated as "分解"* NLU tasks (NLU tasks) is translated as "自然语言理解任务"* into (into) is translated as "分解为"* multiple reasoning steps (multiple reasoning steps) is translated as "多个理解步骤"* where (where) is translated as "在"* LLMs can learn (LLMs can learn) is translated as "LLMs可以学习"* to acquire (to acquire) is translated as "获得"* and leverage (and leverage) is translated as "并利用"* essential concepts (essential concepts) is translated as "重要概念"* to solve (to solve) is translated as "解决"* tasks (tasks) is translated as "任务"* from different granularities (from different granularities) is translated as "不同粒度"* Moreover (Moreover) is translated as "另外"* we propose (we propose) is translated as "我们提议"* leveraging (leveraging) is translated as "利用"* semantic-based (semantic-based) is translated as "基于 semantics的"* Abstract Meaning Representation (AMR) (Abstract Meaning Representation (AMR)) is translated as "抽象意义表示"* structured knowledge (structured knowledge) is translated as "结构化知识"* as an intermediate step (as an intermediate step) is translated as "作为中间步骤"* to capture (to capture) is translated as "捕捉"* the nuances (the nuances) is translated as "细节"* and diverse structures (and diverse structures) is translated as "以及多种结构"* of utterances (of utterances) is translated as "语言表达中的"* and to understand (and to understand) is translated as "理解"* connections (connections) is translated as "连接"* between their varying levels of granularity (between their varying levels of granularity) is translated as "不同粒度之间的连接"* Our proposed approach (Our proposed approach) is translated as "我们提议的方法"* is demonstrated (is demonstrated) is translated as "已经证明"* effective (effective) is translated as "有效"* in assisting (in assisting) is translated as "协助"* the LLMs (the LLMs) is translated as "LLMs"* adapt (adapt) is translated as "适应"* to the multi-grained NLU tasks (to the multi-grained NLU tasks) is translated as "对多级自然语言理解任务"* under both zero-shot and few-shot (under both zero-shot and few-shot) is translated as "在零shot和几个shot的多级任务中"* multi-domain settings (multi-domain settings) is translated as "多个领域的多级任务"

Efficient Cross-Task Prompt Tuning for Few-Shot Conversational Emotion Recognition

  • paper_url: http://arxiv.org/abs/2310.14614
  • repo_url: None
  • paper_authors: Yige Xu, Zhiwei Zeng, Zhiqi Shen
  • for: 这个研究旨在提高对话中的情感识别(ERC)性能,并且使用预训练语言模型(PLMs)来实现这一目标。
  • methods: 我们提出了一种 derivatives-free 优化方法,即 Cross-Task Prompt Tuning(CTPT),可以在几个实验中进行几据学习。CTPT 利用了不同任务之间的共享知识,从而提高学习效率。
  • results: 我们在五个不同的对话上进行了实验,结果显示了我们的 CTPT 方法在几据学习和零据转移中具有优秀的成绩。
    Abstract Emotion Recognition in Conversation (ERC) has been widely studied due to its importance in developing emotion-aware empathetic machines. The rise of pre-trained language models (PLMs) has further pushed the limit of ERC performance. However, most recent works on ERC using PLMs are heavily data-driven, and requires fine-tuning the entire PLMs. To improve both sample and computational efficiency, we propose a derivative-free optimization method called Cross-Task Prompt Tuning (CTPT) for few-shot conversational emotion recognition. Unlike existing methods that learn independent knowledge from individual tasks, CTPT leverages sharable cross-task knowledge by exploiting external knowledge from other source tasks to improve learning performance under the few-shot setting. Moreover, CTPT only needs to optimize a vector under the low intrinsic dimensionality without gradient, which is highly parameter-efficient compared with existing approaches. Experiments on five different contextual conversation datasets demonstrate that our CTPT method has superior results on both few-shot scenarios and zero-shot transfers.
    摘要 与现有方法不同,CTPT不是独立学习各个任务的知识,而是利用其他来源任务的外部知识来提高在几何学shot设定下的学习性能。此外,CTPT只需要优化一个维度下的低内在维度,这比既有的方法更高效。在五种不同的语言对话dataset上进行了实验,我们发现我们的CTPT方法在几何学shotenario和零shot传递中具有优越的性能。

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?

  • paper_url: http://arxiv.org/abs/2310.14610
  • repo_url: None
  • paper_authors: Jaechan Lee, Alisa Liu, Orevaoghene Ahia, Hila Gonen, Noah A. Smith
  • for: 本研究旨在study English idiomatic expressions的翻译问题,尤其是这些表达在不同语言中的含义差异。
  • methods: 我们使用了MT模型和语言模型进行比较,以评估它们在拥有歧义性的句子中表现的不同。我们收集了512对英语句子,其中一个句子 literal,另一个句子 figurative,并且在不同的目标语言中进行了对照试验。
  • results: 我们发现,当面临歧义性句子时,现有的MT模型往往会直接翻译成 literal 的意思,而忽略 figurative 的含义。相比之下,语言模型在不同的语言中表现更加灵活,尽管还存在一些语言之间的差异。这些结果表明,语言模型可能是跨语言翻译中的强大后备选择。
    Abstract The translation of ambiguous text presents a challenge for translation systems, as it requires using the surrounding context to disambiguate the intended meaning as much as possible. While prior work has studied ambiguities that result from different grammatical features of the source and target language, we study semantic ambiguities that exist in the source (English in this work) itself. In particular, we focus on idioms that are open to both literal and figurative interpretations (e.g., goose egg), and collect TIDE, a dataset of 512 pairs of English sentences containing idioms with disambiguating context such that one is literal (it laid a goose egg) and another is figurative (they scored a goose egg, as in a score of zero). In experiments, we compare MT-specific models and language models for (i) their preference when given an ambiguous subsentence, (ii) their sensitivity to disambiguating context, and (iii) the performance disparity between figurative and literal source sentences. We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation. On the other hand, LMs are far more context-aware, although there remain disparities across target languages. Our findings underline the potential of LMs as a strong backbone for context-aware translation.
    摘要 现在的翻译系统面临着抽象文本的翻译挑战,因为它需要使用周围的上下文来尽可能精确地翻译意图。而我们在这里研究的是源语言(英语)本身的含义抽象,具体来说是idioms,这些idioms可以被理解为Literal和 figurative两种不同的 interpretations(如鸡蛋)。我们收集了TIDE数据集,包含512对英语句子,其中一个是Literal(它产生了鸡蛋),另一个是 figurative(他们得到了鸡蛋,即 zeroscore)。在实验中,我们比较了MT特定模型和语言模型,包括:(i)它们在抽象子句中的偏好,(ii)它们对上下文的敏感度,(iii)来自Literal和 figurative源语言句子的性能差异。我们发现现有的MT模型在翻译英语idioms时一直 literal,即使上下文表明 figurative 的解释。相反,LMs 是非常上下文感知的,尽管存在不同目标语言的差异。我们的发现 подчеркиваетLMs的强大后备可以为上下文感知翻译提供支持。

Long Short-Term Planning for Conversational Recommendation Systems

  • paper_url: http://arxiv.org/abs/2310.14609
  • repo_url: None
  • paper_authors: Xian Li, Hongguang Shi, Yunfei Wang, Yeqin Zhang, Xubin Li, Cam-Tu Nguyen
  • for: 本研究旨在提高对话推荐系统(CRS)中对话代理的自然问题表达和适应推荐。
  • methods: 本文提出了一种新的很短期反馈体系,通过将对话模型和推荐预测模型相互连接,使这两个重要组件在CRS中充分交互。
  • results: 研究表明,该体系可以更好地捕捉用户偏好,并提供适应的推荐。
    Abstract In Conversational Recommendation Systems (CRS), the central question is how the conversational agent can naturally ask for user preferences and provide suitable recommendations. Existing works mainly follow the hierarchical architecture, where a higher policy decides whether to invoke the conversation module (to ask questions) or the recommendation module (to make recommendations). This architecture prevents these two components from fully interacting with each other. In contrast, this paper proposes a novel architecture, the long short-term feedback architecture, to connect these two essential components in CRS. Specifically, the recommendation predicts the long-term recommendation target based on the conversational context and the user history. Driven by the targeted recommendation, the conversational model predicts the next topic or attribute to verify if the user preference matches the target. The balance feedback loop continues until the short-term planner output matches the long-term planner output, that is when the system should make the recommendation.
    摘要 在对话推荐系统(CRS)中,中心问题是如何让对话代理人自然地请求用户偏好并提供适合的推荐。现有的工作主要采用层次架构,其中高级政策决定是否 invoke对话模块(问题)或推荐模块(推荐)。这种架构 prevents这两个组件从全面相互作用。相反,这篇论文提出了一种新的架构,长期短期反馈架构,以连接这两个关键组件在 CRS 中。具体来说,推荐模型根据对话 контекст和用户历史预测长期推荐目标。驱动了这个目标的推荐,对话模型预测下一个话题或特性,以验证用户喜好是否匹配目标。这种平衡反馈循环持续 until short-term planner 输出与长期 planner 输出匹配,即系统应该提供推荐。

Investigating the Fairness of Large Language Models for Predictions on Tabular Data

  • paper_url: http://arxiv.org/abs/2310.14607
  • repo_url: None
  • paper_authors: Yanchen Liu, Srishti Gautam, Jiaqi Ma, Himabindu Lakkaraju
  • for: 本文探讨了使用大型自然语言模型(LLMs)进行表格任务预测时,是否存在社会偏见问题,以及这些偏见对公平性的影响。
  • methods: 本文使用了一系列实验来研究 LLMS 在表格任务预测时是否会继承社会偏见,以及这些偏见对公平性的影响。
  • results: 研究结果表明, LLMS 在表格任务预测时会继承社会偏见,这些偏见会导致公平性问题。此外,我们的研究还发现,在减少偏见方面,尝试用宽泛的 Fine-tuning 和 Context-aware 策略可以有一定的改善,但是这些策略并不能完全消除社会偏见的影响。
    Abstract Recent literature has suggested the potential of using large language models (LLMs) to make predictions for tabular tasks. However, LLMs have been shown to exhibit harmful social biases that reflect the stereotypes and inequalities present in the society. To this end, as well as the widespread use of tabular data in many high-stake applications, it is imperative to explore the following questions: what sources of information do LLMs draw upon when making predictions for tabular tasks; whether and to what extent are LLM predictions for tabular tasks influenced by social biases and stereotypes; and what are the consequential implications for fairness? Through a series of experiments, we delve into these questions and show that LLMs tend to inherit social biases from their training data which significantly impact their fairness in tabular prediction tasks. Furthermore, our investigations show that in the context of bias mitigation, though in-context learning and fine-tuning have a moderate effect, the fairness metric gap between different subgroups is still larger than that in traditional machine learning models, such as Random Forest and shallow Neural Networks. This observation emphasizes that the social biases are inherent within the LLMs themselves and inherited from their pre-training corpus, not only from the downstream task datasets. Besides, we demonstrate that label-flipping of in-context examples can significantly reduce biases, further highlighting the presence of inherent bias within LLMs.
    摘要
  1. What sources of information do LLMs rely on when making predictions for tabular tasks?2. To what extent are LLM predictions influenced by social biases and stereotypes?3. What are the implications for fairness?Through a series of experiments, we found that LLMs inherit social biases from their training data, which significantly impacts their fairness in tabular prediction tasks. Additionally, we found that in-context learning and fine-tuning have a moderate effect in mitigating biases, but the fairness metric gap between different subgroups is still larger than in traditional machine learning models. This suggests that the social biases are inherent within the LLMs themselves and not just from the downstream task datasets. Furthermore, we demonstrated that label-flipping of in-context examples can significantly reduce biases, highlighting the presence of inherent bias within LLMs.

M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2310.14605
  • repo_url: https://github.com/grandchicken/m2df
  • paper_authors: Fei Zhao, Chunhui Li, Zhen Wu, Yawen Ouyang, Jianbing Zhang, Xinyu Dai
  • for: 本研究旨在提高多模态方面性情感分析(MABSA)任务的精度,尤其是避免因为数据中含有噪音图像而带来的负面影响。
  • methods: 我们提出了一种基于课程学习的多级多课程减噪框架(M2DF),通过调整训练数据的顺序来实现减噪。
  • results: 我们的方法在三个MABSA任务下表现了consistently比 estado-of-the-art更高的性能。
    Abstract Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained Sentiment Analysis task, which has attracted growing research interests recently. Existing work mainly utilizes image information to improve the performance of MABSA task. However, most of the studies overestimate the importance of images since there are many noise images unrelated to the text in the dataset, which will have a negative impact on model learning. Although some work attempts to filter low-quality noise images by setting thresholds, relying on thresholds will inevitably filter out a lot of useful image information. Therefore, in this work, we focus on whether the negative impact of noisy images can be reduced without modifying the data. To achieve this goal, we borrow the idea of Curriculum Learning and propose a Multi-grained Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by adjusting the order of training data. Extensive experimental results show that our framework consistently outperforms state-of-the-art work on three sub-tasks of MABSA.
    摘要 多模式方面基 sentiment分析 (MABSA) 是一个细化的 sentiment 分析任务,在最近吸引了增加研究兴趣。现有的工作主要利用图像信息提高 MABSA 任务的性能。然而,大多数研究过度估计图像的重要性,因为 dataset 中存在许多不相关的噪音图像,这将对模型学习产生负面影响。虽然一些工作尝试通过设置阈值来过滤噪音图像,但这会导致排除有用的图像信息。因此,在这个工作中,我们强调是否可以降低噪音图像的负面影响而无需修改数据。为达到这个目标,我们借鉴了课程学习的想法并提出了 Multi-grained Multi-curriculum Denoising Framework (M2DF),可以通过调整训练数据的顺序来实现减噪。我们的框架在三个 MABSA 任务上进行了广泛的实验,并 consistently 超越了当前的状态艺。

Generative Pre-trained Transformer for Vietnamese Community-based COVID-19 Question Answering

  • paper_url: http://arxiv.org/abs/2310.14602
  • repo_url: None
  • paper_authors: Tam Minh Vo, Khiem Vinh Tran
  • for: 本研究旨在探讨 Generative Pre-trained Transformer (GPT) 在 Vietnamese 自然语言处理领域的应用前景。
  • methods: 本研究使用 GPT-2 作为 decoder,并对不同的 Transformers 和 SOTA 模型进行比较分析,以评估其在社区问答系统中的表现。
  • results: 实验结果表明,GPT-2 模型在社区问答系统中表现出色,不仅超越了其他 SOTA 模型,还超越了之前在 Vietnamese 语言上开发的社区问答模型。
    Abstract Recent studies have provided empirical evidence of the wide-ranging potential of Generative Pre-trained Transformer (GPT), a pretrained language model, in the field of natural language processing. GPT has been effectively employed as a decoder within state-of-the-art (SOTA) question answering systems, yielding exceptional performance across various tasks. However, the current research landscape concerning GPT's application in Vietnamese remains limited. This paper aims to address this gap by presenting an implementation of GPT-2 for community-based question answering specifically focused on COVID-19 related queries in Vietnamese. We introduce a novel approach by conducting a comparative analysis of different Transformers vs SOTA models in the community-based COVID-19 question answering dataset. The experimental findings demonstrate that the GPT-2 models exhibit highly promising outcomes, outperforming other SOTA models as well as previous community-based COVID-19 question answering models developed for Vietnamese.
    摘要 现有研究提供了对Generative Pre-trained Transformer(GPT)在自然语言处理领域的证据,证明GPT在问答系统中作为解码器可以获得出色的性能。然而,关于GPT在越南语中的应用研究现场仍然很有限。这篇论文计划 Addressing this gap by presenting an implementation of GPT-2 for community-based question answering specifically focused on COVID-19 related queries in Vietnamese. We introduce a novel approach by conducting a comparative analysis of different Transformers vs SOTA models in the community-based COVID-19 question answering dataset. The experimental findings demonstrate that the GPT-2 models exhibit highly promising outcomes, outperforming other SOTA models as well as previous community-based COVID-19 question answering models developed for Vietnamese.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Large Search Model: Redefining Search Stack in the Era of LLMs

  • paper_url: http://arxiv.org/abs/2310.14587
  • repo_url: None
  • paper_authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei
  • for: 这篇论文的目的是提出一种新的搜索框架,即大型搜索模型(LLM),通过将所有搜索任务都视为自然语言生成问题,以便自然地通过自然语言提示来定制任务。
  • methods: 这篇论文使用了一种新的概念框架,即大型搜索模型(LLM),该模型可以通过自然语言理解和逻辑能力来提高搜索结果质量,同时也可以简化现有的复杂的搜索堆栈。
  • results: 作者通过一系列证明性实验来证明了这种框架的可行性,并讨论了在现实世界搜索系统中实施这种方法的挑战。
    Abstract Modern search engines are built on a stack of different components, including query understanding, retrieval, multi-stage ranking, and question answering, among others. These components are often optimized and deployed independently. In this paper, we introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM). All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts. This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simultaneously simplifying the existing cumbersome search stack. To substantiate the feasibility of this framework, we present a series of proof-of-concept experiments and discuss the potential challenges associated with implementing this approach within real-world search systems.
    摘要 现代搜索引擎是由不同组件构成的,包括查询理解、检索、多stage排名和问答等。这些组件通常是独立优化和部署的。在这篇论文中,我们提出了一个新的概念框架,即大型搜索模型(LLM),它将搜索任务集成到一个大型自然语言模型中。所有任务都是表示为文本生成问题,使得可以通过自然语言提示来定制任务。这个提出的框架利用大型语言理解和逻辑能力,可能提高搜索结果质量,同时也可能简化现有的复杂的搜索堆栈。为证明这个框架的可行性,我们提供了一系列证明性实验,并讨论了在真实世界搜索系统中实施这种方法的潜在挑战。

JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Classification

  • paper_url: http://arxiv.org/abs/2310.14583
  • repo_url: https://github.com/HenryPengZou/JointMatch
  • paper_authors: Henry Peng Zou, Cornelia Caragea
  • for: 提高 semi-supervised text classification 的性能,解决 pseudo-label 偏见和错误积累问题
  • methods: 提出 JointMatch 方法, combinig 最近 semi-supervised learning 和 learning with noise task,适应性地调整类别的阈值,以降低模型偏爱当前容易的类别
  • results: 在 benchmark 数据集上实验,JointMatch 方法可以 дости得 significiant 5.13% 的提高,特别是在 extremely-scarce-label 设定下,AG News 上只有 5 个标签,JointMatch 方法可以达到 86% 的准确率
    Abstract Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data. However, existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation. In this paper, we propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning and the task of learning with noise. JointMatch adaptively adjusts classwise thresholds based on the learning status of different classes to mitigate model bias towards current easy classes. Additionally, JointMatch alleviates error accumulation by utilizing two differently initialized networks to teach each other in a cross-labeling manner. To maintain divergence between the two networks for mutual learning, we introduce a strategy that weighs more disagreement data while also allowing the utilization of high-quality agreement data for training. Experimental results on benchmark datasets demonstrate the superior performance of JointMatch, achieving a significant 5.13% improvement on average. Notably, JointMatch delivers impressive results even in the extremely-scarce-label setting, obtaining 86% accuracy on AG News with only 5 labels per class. We make our code available at https://github.com/HenryPengZou/JointMatch.
    摘要 semi-supervised文本分类(SSTC)在最近引起了越来越多的关注,这是因为它可以利用无标签数据。然而,现有的方法基于pseudo-标签受到了pseudo-标签偏见和错误积累的问题。在这篇论文中,我们提出了JointMatch,一种整体的方法 дляSSTC,解决了这些挑战。JointMatch可以根据不同类型的学习状况来调整类别的阈值,以降低模型对当前容易类别的偏见。此外,JointMatch可以通过两个不同初始化的网络来教育对方,从而减轻错误积累。为维护两个网络之间的分化,我们提出了一种策略,即在训练中使用高质量一致数据,同时 weights更多的不一致数据。实验结果表明,JointMatch在标准 benchmarkdataset上表现出色,相对于基eline的5.13%提高。尤其是在EXTREMELY-SCARCE-LABEL setting下,JointMatch可以达到86%的准确率,只使用AG News中每个类别5个标签。我们将代码提供在https://github.com/HenryPengZou/JointMatch上。

DeCrisisMB: Debiased Semi-Supervised Learning for Crisis Tweet Classification via Memory Bank

  • paper_url: http://arxiv.org/abs/2310.14577
  • repo_url: https://github.com/HenryPengZou/DeCrisisMB
  • paper_authors: Henry Peng Zou, Yue Zhou, Weizhi Zhang, Cornelia Caragea
    for: 本研究旨在提高危机事件监测和救援过程中 semi-supervised 模型的准确率和泛化能力,并且解决 semi-supervised 模型容易受到类别偏见的问题。methods: 本研究首先对两种最新的减偏方法进行了研究,然后提出了一种简单 yet effective 的减偏方法——DeCrisisMB,该方法利用 Memory Bank 来存储和在每次训练中进行平等抽样,从而消除类别偏见。results: 对不同减偏方法的性能和泛化能力进行了广泛的实验,结果显示我们提出的方法在 both in-distribution 和 out-of-distribution 设置下表现出色,超过了其他方法的性能和泛化能力。
    Abstract During crisis events, people often use social media platforms such as Twitter to disseminate information about the situation, warnings, advice, and support. Emergency relief organizations leverage such information to acquire timely crisis circumstances and expedite rescue operations. While existing works utilize such information to build models for crisis event analysis, fully-supervised approaches require annotating vast amounts of data and are impractical due to limited response time. On the other hand, semi-supervised models can be biased, performing moderately well for certain classes while performing extremely poorly for others, resulting in substantially negative effects on disaster monitoring and rescue. In this paper, we first study two recent debiasing methods on semi-supervised crisis tweet classification. Then we propose a simple but effective debiasing method, DeCrisisMB, that utilizes a Memory Bank to store and perform equal sampling for generated pseudo-labels from each class at each training iteration. Extensive experiments are conducted to compare different debiasing methods' performance and generalization ability in both in-distribution and out-of-distribution settings. The results demonstrate the superior performance of our proposed method. Our code is available at https://github.com/HenryPengZou/DeCrisisMB.
    摘要 In this paper, we study two recent debiasing methods for semi-supervised crisis tweet classification. We then propose a simple but effective method called DeCrisisMB, which uses a Memory Bank to store and perform equal sampling for generated pseudo-labels from each class at each training iteration. We conduct extensive experiments to compare the performance and generalization ability of different debiasing methods in both in-distribution and out-of-distribution settings. The results show that our proposed method outperforms the others. Our code is available at .

Exploring the Boundaries of GPT-4 in Radiology

  • paper_url: http://arxiv.org/abs/2310.14573
  • repo_url: None
  • paper_authors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Maria Teodora Wetscherek, Robert Tinn, Harshita Sharma, Fernando Pérez-García, Anton Schwaighofer, Pranav Rajpurkar, Sameer Tajdin Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya V. Nori, Matthew P. Lungren, Ozan Oktay, Javier Alvarez-Valle
  • for: This paper assesses the performance of GPT-4 on text-based applications for radiology reports, comparing it against state-of-the-art (SOTA) radiology-specific models.
  • methods: The paper explores various prompting strategies to evaluate GPT-4 on a diverse range of common radiology tasks, including zero-shot prompting and example-based prompting.
  • results: GPT-4 either outperforms or is on par with current SOTA radiology models in various tasks, including temporal sentence similarity classification and natural language inference. Additionally, GPT-4 outputs for findings summarization are found to be overall comparable with existing manually-written impressions.
    Abstract The recent success of general-domain large language models (LLMs) has significantly changed the natural language processing paradigm towards a unified foundation model across domains and applications. In this paper, we focus on assessing the performance of GPT-4, the most capable LLM so far, on the text-based applications for radiology reports, comparing against state-of-the-art (SOTA) radiology-specific models. Exploring various prompting strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and we found GPT-4 either outperforms or is on par with current SOTA radiology models. With zero-shot prompting, GPT-4 already obtains substantial gains ($\approx$ 10% absolute improvement) over radiology models in temporal sentence similarity classification (accuracy) and natural language inference ($F_1$). For tasks that require learning dataset-specific style or schema (e.g. findings summarisation), GPT-4 improves with example-based prompting and matches supervised SOTA. Our extensive error analysis with a board-certified radiologist shows GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context that require nuanced domain knowledge. For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
    摘要 现代大语言模型(LLM)的成功已经改变了自然语言处理的核心思路,即在各个领域和应用之间建立共同基础模型。在这篇论文中,我们将关注GPT-4,目前最强大的LLM,在文本基础上的应用场景中的性能,并与当前领先的医学特定模型进行比较。我们使用不同的提示策略,对GPT-4进行了广泛的评估,并发现GPT-4在多种常见的医学任务中表现出色,其中包括时间序列相似性分类和自然语言推理等。在需要学习数据集特定的样式或结构(如发现摘要)时,GPT-4可以通过示例提示来改进,并与现有的超级vised SOTA模型匹配。我们的广泛的错误分析表明,GPT-4在复杂的医学上有足够的知识,只有 occasionally 出现具有细化领域知识的错误。在发现摘要方面,GPT-4的输出被发现与现有的手动编写的印象相当相似。

A Boundary Offset Prediction Network for Named Entity Recognition

  • paper_url: http://arxiv.org/abs/2310.18349
  • repo_url: https://github.com/mhtang1995/bopn
  • paper_authors: Minghao Tang, Yongquan He, Yongxiu Xu, Hongbo Xu, Wenyuan Zhang, Yang Lin
  • for: 本研究旨在提高Named Entity Recognition (NER) 的准确率,解决 span-based 方法中的样本空间不均衡和非实体 span 的忽略问题。
  • methods: 我们提出了一种新的方法,即Boundary Offset Prediction Network (BOPN),该方法预测候选它 span 和最近的实体 span 的边界偏移。通过利用边界偏移的指导 semantics,BOPN 建立了非实体 span 和实体 span 之间的连接,使非实体 span 能够作为实体检测的额外正例样本。此外,我们的方法将实体类型和 span 表示结合起来生成类型意识的边界偏移,而不是直接使用实体类型作为检测目标。
  • results: 我们在八个常用的 NER 数据集上进行了实验,结果显示,我们的提出的 BOPN 可以比前一个状态的方法更高的准确率。
    Abstract Named entity recognition (NER) is a fundamental task in natural language processing that aims to identify and classify named entities in text. However, span-based methods for NER typically assign entity types to text spans, resulting in an imbalanced sample space and neglecting the connections between non-entity and entity spans. To address these issues, we propose a novel approach for NER, named the Boundary Offset Prediction Network (BOPN), which predicts the boundary offsets between candidate spans and their nearest entity spans. By leveraging the guiding semantics of boundary offsets, BOPN establishes connections between non-entity and entity spans, enabling non-entity spans to function as additional positive samples for entity detection. Furthermore, our method integrates entity type and span representations to generate type-aware boundary offsets instead of using entity types as detection targets. We conduct experiments on eight widely-used NER datasets, and the results demonstrate that our proposed BOPN outperforms previous state-of-the-art methods.
    摘要 Named entity recognition (NER) 是自然语言处理中的基本任务,旨在在文本中Identify和分类命名实体。然而,使用span基本方法进行NER通常会导致样本空间不均衡,并忽略非实体 span与实体 span之间的连接。为了解决这些问题,我们提出了一种新的NER方法,即边界偏移预测网络(BOPN)。BOPN预测候选块和最近实体块之间的边界偏移。通过利用边界偏移的导引 semantics,BOPN建立了非实体 span和实体 span之间的连接,使得非实体 span能够作为实体检测的额外正例样本。此外,我们的方法集成实体类型和块表示,而不是使用实体类型作为检测目标。我们在八个广泛使用的NER数据集上进行实验,结果显示,我们提出的BOPN比前一个状态的方法更高效。

HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

  • paper_url: http://arxiv.org/abs/2310.14566
  • repo_url: https://github.com/tianyi-lab/hallusionbench
  • paper_authors: Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
  • for: 研究大语言模型(LLM)在图像理解任务中的提升。
  • methods: 对图像理解任务使用vision模型和LLM的启合,并对VLM的两种错误(语言幻觉和视觉误差)进行分析。
  • results: 通过创建了HallusionBench图像理解benchmark,发现VLM的语言优先级可能导致忽略图像背景和依赖于语言优先级进行理解,而视觉模块在VLM中较弱,可能导致误导性的视觉表示。
    Abstract Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. The benchmark and codebase will be released at https://github.com/tianyi-lab/HallusionBench.
    摘要 大型语言模型(LLM), после与视觉模型进行Alignment和融合,可以在图像理解任务中带来很大的改进。这得到了最新的GPT-4V(ison)、LLaVA-1.5等等的证明。然而,这些顶尖的视觉语言模型(VLM)中的强大语言优先可能会使其忽视图像上下文, solely rely on(也许是矛盾的)语言优先来进行理解。与此相反,视觉模块在VLM中比LLM更弱,可能会导致误导性的视觉表示,这些表示然后被LLM翻译成自信的错误。为了研究这两种VLM的错误,我们创建了HallusionBench,一个图像上下文理解 bencmark,它仍然是GPT-4V和LLaVA-1.5所不能解决的挑战。我们提供了图像示例的详细分析,它提供了新的理解关于VLM的 hallucination 和 visual illusion,以及未来如何改进它们的新 идеа。benchmark和代码库将在https://github.com/tianyi-lab/HallusionBench中发布。

Language Models Hallucinate, but May Excel at Fact Verification

  • paper_url: http://arxiv.org/abs/2310.14564
  • repo_url: None
  • paper_authors: Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, Hao Peng
  • for: 评估大语言模型(LLM)的可靠性和可信度,以及其应用于事实核查 task 的可能性。
  • methods: 采用人类评估方法评估 LLM 的输出是否准确,并分析 LLM 对高质量证据的依赖性以及其 robustness 和泛化能力的弱点。
  • results: 研究发现,即使使用最先进的 LLM 如 GPT-3.5 和 ChatGPT,其生成的事实输出率仅为 25% 左右,表明需要进一步提高 LLM 的可靠性和可信度。而 unexpectedly,FLAN-T5-11B,最不准确的生成器,在事实核查 task 中表现最佳,甚至超过 GPT3.5 和 ChatGPT。
    Abstract Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments, at least in the Wikipedia domain. Surprisingly, FLAN-T5-11B, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.
    摘要 Simplified Chinese:近期的自然语言处理(NLP)技术发展受到大型语言模型(LLMs)的巨大影响。然而,这些模型经常“幻想”并生成非事实输出。我们的人工评估发现, même GPT-3.5,一个现代 LLM,生成事实输出的时间只占25%左右。这 подчерки了需要事实验证器来衡量和激励进步。我们发现, LLMs 可以被重新用于有效的事实验证器,与人类判断具有强相关性,至少在Wikipedia 领域。奇怪的是, FLAN-T5-11B,我们研究中最不事实的生成器,在事实验证方面表现最佳,甚至超过了 GPT3.5 和 ChatGPT 等更高能力 LLMs。我们的分析还探讨了这些 LLMs 借助高质量证据的依赖,以及它们的缺乏 Robustness 和总体化能力。我们的研究提供了开发可靠生成模型的洞察。

NormDial: A Comparable Bilingual Synthetic Dialog Dataset for Modeling Social Norm Adherence and Violation

  • paper_url: http://arxiv.org/abs/2310.14563
  • repo_url: https://github.com/aochong-li/normdial
  • paper_authors: Oliver Li, Mallika Subramanian, Arkadiy Saakyan, Sky CH-Wang, Smaranda Muresan
  • for: 这个论文是为了研究社会规范如何影响人们之间的交流。
  • methods: 这个论文使用了人类 Loop 管道生成高质量的对话数据,并对社交规范遵从和违反进行了分别标注。
  • results: 研究发现,大语言模型在这个任务上表现不佳,提出了新的研究方向以更好地理解社交规范在对话中的表现。
    Abstract Social norms fundamentally shape interpersonal communication. We present NormDial, a high-quality dyadic dialogue dataset with turn-by-turn annotations of social norm adherences and violations for Chinese and American cultures. Introducing the task of social norm observance detection, our dataset is synthetically generated in both Chinese and English using a human-in-the-loop pipeline by prompting large language models with a small collection of expert-annotated social norms. We show that our generated dialogues are of high quality through human evaluation and further evaluate the performance of existing large language models on this task. Our findings point towards new directions for understanding the nuances of social norms as they manifest in conversational contexts that span across languages and cultures.
    摘要 社会规范深刻影响人际交流。我们介绍NormDial数据集,包含中美文化的高质量对话副本,每个对话都有 turnovers 的社会规范遵循和违反的标注。我们提出社会规范遵循检测任务,使用人类在 loops 中的人工生成管道,通过提示大型自然语言模型一小集专家标注的社会规范,生成的对话质量高。我们通过人类评估和现有大型自然语言模型在这个任务上的表现进行评估,我们的发现指向了在语言和文化交叉之下社会规范的细节。

The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages

  • paper_url: http://arxiv.org/abs/2310.14557
  • repo_url: None
  • paper_authors: Chiyu Zhang, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed
  • for: 这研究旨在探讨 instruction-tuned 大型自然语言处理(LLM)模型在跨语言社会功能性含义(SM)理解方面的能力。
  • methods: 研究使用了多种多语言预训练语言模型(如 mT5)和指令调教LLM(如 BLOOMZ、ChatGPT)在 SPARROW 测试集上进行微调、零shot 和几shot 学习。
  • results: 研究发现现有的开源指令调教LLM仍然在不同语言的 SM 理解方面表现不佳,在一些情况下与随机基线几乎相当。此外,尽管 ChatGPT 表现较好,但它仍然落后任务特定微调模型的差距为 12.19 SPARROW 分数。
    Abstract Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate remarkable performance in a wide range of tasks. Despite numerous recent studies that examine the performance of instruction-tuned LLMs on various NLP benchmarks, there remains a lack of comprehensive investigation into their ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning embedded within social and interactive contexts. This deficiency arises partly from SM not being adequately represented in any of the existing benchmarks. To address this gap, we present SPARROW, an extensive multilingual benchmark specifically designed for SM understanding. SPARROW comprises 169 datasets covering 13 task types across six primary categories (e.g., anti-social language detection, emotion recognition). SPARROW datasets encompass 64 different languages originating from 12 language families representing 16 writing scripts. We evaluate the performance of various multilingual pretrained language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT) on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our comprehensive analysis reveals that existing open-source instruction tuned LLMs still struggle to understand SM across various languages, performing close to a random baseline in some cases. We also find that although ChatGPT outperforms many LLMs, it still falls behind task-specific finetuned models with a gap of 12.19 SPARROW score. Our benchmark is available at: https://github.com/UBC-NLP/SPARROW
    摘要 <>将大型语言模型(LLM)如ChatGPT调教,表现出杰出的表现在各种任务中。Despite numerous recent studies examining the performance of instruction-tuned LLMs on various NLP benchmarks, there remains a lack of comprehensive investigation into their ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning embedded within social and interactive contexts. This deficiency arises partly from SM not being adequately represented in any of the existing benchmarks. To address this gap, we present SPARROW, an extensive multilingual benchmark specifically designed for SM understanding. SPARROW comprises 169 datasets covering 13 task types across six primary categories (e.g., anti-social language detection, emotion recognition). SPARROW datasets encompass 64 different languages originating from 12 language families representing 16 writing scripts. We evaluate the performance of various multilingual pretrained language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT) on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our comprehensive analysis reveals that existing open-source instruction-tuned LLMs still struggle to understand SM across various languages, performing close to a random baseline in some cases. We also find that although ChatGPT outperforms many LLMs, it still falls behind task-specific finetuned models with a gap of 12.19 SPARROW score. Our benchmark is available at: https://github.com/UBC-NLP/SPARROW.

Harnessing ChatGPT for thematic analysis: Are we ready?

  • paper_url: http://arxiv.org/abs/2310.14545
  • repo_url: None
  • paper_authors: V Vien Lee, Stephanie C. C. van der Lubbe, Lay Hoon Goh, Jose M. Valderas
  • for: This paper explores the use of ChatGPT in three core phases of thematic analysis within a medical context, including direct coding of transcripts, generating themes from a predefined list of codes, and preprocessing quotes for manuscript inclusion.
  • methods: The paper uses ChatGPT, an advanced natural language processing tool, to automate the thematic analysis process.
  • results: The authors assess the strengths and limitations of using ChatGPT in thematic analysis, highlighting areas where human intervention remains necessary, and argue that ChatGPT can function as a valuable tool during analysis, enhancing the efficiency and offering additional insights into the qualitative data.Here is the same information in Simplified Chinese text:
  • for: 这篇论文探讨了在医学上使用ChatGPT进行主题分析的可能性,包括直接编码访谈稿、从预定的代码列表中生成主题以及为报告 incluption 进行预处理。
  • methods: 论文使用了ChatGPT,一种高级自然语言处理工具,来自动化主题分析过程。
  • results: 作者评估了使用ChatGPT进行主题分析的优劣点,指出了人工干预仍然必要的领域,并 argue that ChatGPT可以作为分析过程中的有价值工具,提高效率并提供额外的数据分析 Insights。
    Abstract ChatGPT is an advanced natural language processing tool with growing applications across various disciplines in medical research. Thematic analysis, a qualitative research method to identify and interpret patterns in data, is one application that stands to benefit from this technology. This viewpoint explores the utilization of ChatGPT in three core phases of thematic analysis within a medical context: 1) direct coding of transcripts, 2) generating themes from a predefined list of codes, and 3) preprocessing quotes for manuscript inclusion. Additionally, we explore the potential of ChatGPT to generate interview transcripts, which may be used for training purposes. We assess the strengths and limitations of using ChatGPT in these roles, highlighting areas where human intervention remains necessary. Overall, we argue that ChatGPT can function as a valuable tool during analysis, enhancing the efficiency of the thematic analysis and offering additional insights into the qualitative data.
    摘要 chatGPT是一种先进的自然语言处理工具,在医学研究中有越来越多的应用。本观点探讨了在医学上使用chatGPT进行主题分析的三个核心阶段:1)直接编码讲话稿,2)根据预定的代码列表生成主题,3)预处理引用 для报告 inclusion。此外,我们还探讨了使用chatGPT生成访谈稿,可以用于培训。我们评估了使用chatGPT的优点和缺点,并指出了人工干预仍然必要的领域。总的来说,我们认为chatGPT可以作为分析中的有价值工具,提高分析的效率并为质量数据提供额外的意义。

Evaluating Large Language Models on Controlled Generation Tasks

  • paper_url: http://arxiv.org/abs/2310.14542
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wieting, Nanyun Peng, Xuezhe Ma
  • for: 研究大语言模型在不同任务上的能力
  • methods: 使用不同的大语言模型和小语言模型进行比较
  • results: 发现大语言模型在细化的任务上弱于小语言模型,但在其他任务上可以和小语言模型相比或超越它们的能力。Translation:
  • for: 研究大语言模型在不同任务上的能力
  • methods: 使用不同的大语言模型和小语言模型进行比较
  • results: 发现大语言模型在细化的任务上弱于小语言模型,但在其他任务上可以和小语言模型相比或超越它们的能力。
    Abstract While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that **large language models struggle at meeting fine-grained hard constraints**.
    摘要 Recent research has focused on the capabilities of large language models in various tasks, such as question generation, reading comprehension, and multilingual tasks. However, few studies have explored the controllability of large language models on generation tasks. In this study, we conduct an extensive analysis of various benchmarks, including a sentence planning benchmark with different granularities. We compare large language models with state-of-the-art finetuned smaller models and find that large language models struggle to meet fine-grained hard constraints. Our results show that while large language models can excel in some tasks, they fall behind, are comparable, or exceed the ability of smaller models in others. We conclude that large language models have limitations in meeting fine-grained hard constraints.Translation notes:* "benchmark tasks" is translated as "benchmark任务" (benchmark task)* "question generation" is translated as "问题生成" (question generation)* "reading comprehension" is translated as "阅读理解" (reading comprehension)* "multilingual" is translated as "多语言" (multilingual)* "fine-grained hard constraints" is translated as "细化硬性约束" (fine-grained hard constraints)* "state-of-the-art finetuned smaller models" is translated as "现有最佳精度小型模型" (state-of-the-art finetuned smaller models)Note that the translation is based on the Simplified Chinese version of the text, and the translation may vary slightly depending on the specific context and audience.

Continual Named Entity Recognition without Catastrophic Forgetting

  • paper_url: http://arxiv.org/abs/2310.14541
  • repo_url: https://github.com/bladedancer957/cpfd
  • paper_authors: Duzhen Zhang, Wei Cong, Jiahua Dong, Yahan Yu, Xiuyi Chen, Yonggang Zhang, Zhen Fang
  • for: 本研究旨在提高 continual named entity recognition (CNER) 中的模型更新策略,尤其是减轻 catastrophic forgetting 问题。
  • methods: 本文提出了一种池化特征储存损失,以保持旧entity type 知识,同时吸收新entity type。此外,我们还提出了一种 confidence-based pseudo-labeling 方法,使用旧模型预测 non-entity type。最后,我们建议一种 adaptive re-weighting 类型偏好学习策略。
  • results: 我们在十个 CNER 设置中进行了广泛的实验,使用三个不同的数据集。结果显示,我们的方法在 Micro 和 Macro F1 分数中表现出色,与先前的状态OF-THE-ART方法相比,平均提高了6.3% 和 8.0%。
    Abstract Continual Named Entity Recognition (CNER) is a burgeoning area, which involves updating an existing model by incorporating new entity types sequentially. Nevertheless, continual learning approaches are often severely afflicted by catastrophic forgetting. This issue is intensified in CNER due to the consolidation of old entity types from previous steps into the non-entity type at each step, leading to what is known as the semantic shift problem of the non-entity type. In this paper, we introduce a pooled feature distillation loss that skillfully navigates the trade-off between retaining knowledge of old entity types and acquiring new ones, thereby more effectively mitigating the problem of catastrophic forgetting. Additionally, we develop a confidence-based pseudo-labeling for the non-entity type, \emph{i.e.,} predicting entity types using the old model to handle the semantic shift of the non-entity type. Following the pseudo-labeling process, we suggest an adaptive re-weighting type-balanced learning strategy to handle the issue of biased type distribution. We carried out comprehensive experiments on ten CNER settings using three different datasets. The results illustrate that our method significantly outperforms prior state-of-the-art approaches, registering an average improvement of $6.3$\% and $8.0$\% in Micro and Macro F1 scores, respectively.
    摘要

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

  • paper_url: http://arxiv.org/abs/2310.14534
  • repo_url: https://github.com/Jacob-Zhou/gecdi
  • paper_authors: Houquan Zhou, Yumeng Liu, Zhenghua Li, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang
  • for: 本文旨在提高语音识别领域中的语法错误检测性能。
  • methods: 本文提出了一种统一的解码优化框架,通过外部评价人来评估生成token的适应程度,并通过动态地影响下一个token的选择。
  • results: 经过广泛的实验表明,本文的方法可以与State-of-the-art方法竞争,并且在英文和中文数据集上具有优秀的性能。
    Abstract The sequence-to-sequence (Seq2Seq) approach has recently been widely used in grammatical error correction (GEC) and shows promising performance. However, the Seq2Seq GEC approach still suffers from two issues. First, a Seq2Seq GEC model can only be trained on parallel data, which, in GEC task, is often noisy and limited in quantity. Second, the decoder of a Seq2Seq GEC model lacks an explicit awareness of the correctness of the token being generated. In this paper, we propose a unified decoding intervention framework that employs an external critic to assess the appropriateness of the token to be generated incrementally, and then dynamically influence the choice of the next token. We discover and investigate two types of critics: a pre-trained left-to-right language model critic and an incremental target-side grammatical error detector critic. Through extensive experiments on English and Chinese datasets, our framework consistently outperforms strong baselines and achieves results competitive with state-of-the-art methods.
    摘要 seq2seq方法在 grammatical error correction (GEC) 领域已经广泛应用,并表现良好。然而,seq2seq GEC 方法仍然受到两个问题的限制。首先,一个 seq2seq GEC 模型只能在平行数据上训练,在 GEC 任务中,这些平行数据经常含有噪音和数量有限。其次,seq2seq GEC 模型的解码器缺乏对生成token的正确性的显式意识。在这篇论文中,我们提议一种统一的解码干扰 Framework,利用外部批评人来评估生成token的适用程度,然后动态地影响下一个token的选择。我们发现和探索了两种批评人:一个预训练的左到右语言模型批评人和一个逐行目标边 grammatical error detector 批评人。通过对英语和中文数据集进行了广泛的实验,我们的框架一直表现出优于强基eline和状态 искус的方法。

Dual-Feedback Knowledge Retrieval for Task-Oriented Dialogue Systems

  • paper_url: http://arxiv.org/abs/2310.14528
  • repo_url: None
  • paper_authors: Tianyuan Shi, Liangzhi Li, Zijian Lin, Tao Yang, Xiaojun Quan, Qifan Wang
  • for: 提高终端任务对话系统的成功率,通过快速选择相关信息满足用户请求。
  • methods: 提出了一种 Retriever-Generator 架构,通过使用搜寻器 retrieve 相关知识,并使用生成器生成系统回答。另外,由于搜寻器没有培训标签,我们提议使用生成器的反馈作为 Pseudo-labels 来培训搜寻器。
  • results: 在三个 benchmark 数据集上进行实验, results 表明我们的方法可以在任务对话任务中显著提高性能。
    Abstract Efficient knowledge retrieval plays a pivotal role in ensuring the success of end-to-end task-oriented dialogue systems by facilitating the selection of relevant information necessary to fulfill user requests. However, current approaches generally integrate knowledge retrieval and response generation, which poses scalability challenges when dealing with extensive knowledge bases. Taking inspiration from open-domain question answering, we propose a retriever-generator architecture that harnesses a retriever to retrieve pertinent knowledge and a generator to generate system responses.~Due to the lack of retriever training labels, we propose relying on feedback from the generator as pseudo-labels to train the retriever. To achieve this, we introduce a dual-feedback mechanism that generates both positive and negative feedback based on the output of the generator. Our method demonstrates superior performance in task-oriented dialogue tasks, as evidenced by experimental results on three benchmark datasets.
    摘要 高效知识检索对终端任务对话系统的成功起着关键作用,因为它可以帮助选择用户请求所需的相关信息。然而,现有的方法通常将知识检索和响应生成集成起来,这会导致面临广泛知识库的扩展性挑战。取得开放式问答系统的灵感,我们提议一种检索生成架构,该架构利用检索器来检索相关知识,并使用生成器来生成系统响应。由于检索器没有专门的训练标签,我们提议利用生成器的反馈作为假标签来训练检索器。为实现这一点,我们提出了一种双反馈机制,该机制可以根据生成器的输出生成正面和负面反馈。我们的方法在任务对话任务中表现出了超过比较的表现,实验结果表明。

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

  • paper_url: http://arxiv.org/abs/2310.18347
  • repo_url: None
  • paper_authors: Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, Jing Xiao
  • for: 提高 Retrieval Question Answering (ReQA) 任务的性能,使得可以使用适应性的 Large Language Models (LLMs) 作为生成器。
  • methods: 提出了一种可调PRCA,位于生成器和搜索器之间,通过在搜索阶段使用奖励学习来更正搜索结果,以提高ReQA性能。
  • results: 经验表明,PRCA可以在三个数据集上提高ReQA性能,最高提高20%,表明PRCA在LLMs时代具有显著的潜在价值。
    Abstract The Retrieval Question Answering (ReQA) task employs the retrieval-augmented framework, composed of a retriever and generator. The generator formulates the answer based on the documents retrieved by the retriever. Incorporating Large Language Models (LLMs) as generators is beneficial due to their advanced QA capabilities, but they are typically too large to be fine-tuned with budget constraints while some of them are only accessible via APIs. To tackle this issue and further improve ReQA performance, we propose a trainable Pluggable Reward-Driven Contextual Adapter (PRCA), keeping the generator as a black box. Positioned between the retriever and generator in a Pluggable manner, PRCA refines the retrieved information by operating in a token-autoregressive strategy via maximizing rewards of the reinforcement learning phase. Our experiments validate PRCA's effectiveness in enhancing ReQA performance on three datasets by up to 20% improvement to fit black-box LLMs into existing frameworks, demonstrating its considerable potential in the LLMs era.
    摘要 抓取问答任务(ReQA)采用抓取加强框架,由搜索器和生成器组成。生成器根据搜索器返回的文档来形成答案。尽管大语言模型(LLM)作为生成器具有高水平的问答能力,但它们通常是资金限制不能进行精细调整的,而且一些只能通过API访问。为了解决这个问题并进一步提高ReQA性能,我们提议一种可调整的插入式奖励驱动上下文 adaptor(PRCA),将生成器视为黑盒子。PRCA位于搜索器和生成器之间,通过在token autoregressive策略中 maximizing奖励回报来细化搜索结果。我们的实验证明PRCA在三个 dataset 上提高了ReQA性能,最高提高20%,这说明PRCA在LLMs era中具有可观的潜力。

Rethinking Word-Level Auto-Completion in Computer-Aided Translation

  • paper_url: http://arxiv.org/abs/2310.14523
  • repo_url: https://github.com/galaxychen/wlac-joint-training
  • paper_authors: Xingyu Chen, Lemao Liu, Guoping Huang, Zhirui Zhang, Mingming Yang, Shuming Shi, Rui Wang
  • for: 提高 Computer-Assisted Translation 中 Word-Level Auto-Completion 的性能。
  • methods: 提出一种基于可测量标准的方法,以确定合适的自动完成词选择。
  • results: 通过实验表明,提议的方法可以在不同的 encoder-based 架构上提高 WLAC 性能,并且使用较小的模型大小。
    Abstract Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted Translation. It aims at providing word-level auto-completion suggestions for human translators. While previous studies have primarily focused on designing complex model architectures, this paper takes a different perspective by rethinking the fundamental question: what kind of words are good auto-completions? We introduce a measurable criterion to answer this question and discover that existing WLAC models often fail to meet this criterion. Building upon this observation, we propose an effective approach to enhance WLAC performance by promoting adherence to the criterion. Notably, the proposed approach is general and can be applied to various encoder-based architectures. Through extensive experiments, we demonstrate that our approach outperforms the top-performing system submitted to the WLAC shared tasks in WMT2022, while utilizing significantly smaller model sizes.
    摘要

QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

  • paper_url: http://arxiv.org/abs/2310.14520
  • repo_url: https://github.com/lingchensanwen/qudeval
  • paper_authors: Yating Wu, Ritika Mangla, Greg Durrett, Junyi Jessy Li
  • for: 本研究旨在提供一个自动评估QUD结构的框架,以满足语言模型的进一步发展。
  • methods: 本研究使用了一个新的评估数据集——QUDeval,并将QUD的理论限制实现为具体协议。
  • results: 研究发现,现有的语言模型仍然具有实现QUD结构的困难度,并且现有的评估指标不具体地反映评估器的质量。 however, human-authored QUDs are scored highly by human evaluators, suggesting headroom for further progress on language modeling.
    Abstract Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them. Automatic parsing of a discourse to produce a QUD structure thus entails a complex question generation task: given a document and an answer sentence, generate a question that satisfies linguistic constraints of QUD and can be grounded in an anchor sentence in prior context. These questions are known to be curiosity-driven and open-ended. This work introduces the first framework for the automatic evaluation of QUD parsing, instantiating the theoretical constraints of QUD in a concrete protocol. We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs. Using QUDeval, we show that satisfying all constraints of QUD is still challenging for modern LLMs, and that existing evaluation metrics poorly approximate parser quality. Encouragingly, human-authored QUDs are scored highly by our human evaluators, suggesting that there is headroom for further progress on language modeling to improve both QUD parsing and QUD evaluation.
    摘要 问题下的讨论(QUD)是一种灵活的语言框架,在这个框架下,对话进行不断的问题和答案交互。自动分析对话生成 QUD 结构,因此需要进行复杂的问题生成任务:给定一个文档和一个答案句子,生成一个满足语言约束的 QUD 问题,并且可以基于先前的上下文中的拓展句子。这些问题被称为 curios-driven 和开放的。本工作提出了自动评估 QUD 解析的第一个框架,实现了理论约束的具体实现。我们提出了 QUDeval 数据集,包含2,190个精细评估的 QUD 问题,其中一部分来自 fine-tuned 系统,另一部分来自 LLMS。使用 QUDeval,我们发现现代 LLMS 仍然很难满足 QUD 语言约束,并且现有的评估 метри低估 parser 质量。幸好,人类编写的 QUD 问题得分高,表示可以进一步进行语言模型化以提高 QUD 解析和 QUD 评估。

Turn-Level Active Learning for Dialogue State Tracking

  • paper_url: http://arxiv.org/abs/2310.14513
  • repo_url: None
  • paper_authors: Zihan Zhang, Meng Fang, Fanghua Ye, Ling Chen, Mohammad-Reza Namazi-Rad
  • for: 这 paper 的目的是提出一种新的 turn-level active learning 框架,用于对话系统中的对话数据分类。
  • methods: 该框架使用选择性的注释方法,以优化对话数据的注释效率。
  • results: 实验结果表明,该方法可以在有限的标注预算下实现相对比较好的对话数据分类性能,并且可以减少对话数据的注释量。
    Abstract Dialogue state tracking (DST) plays an important role in task-oriented dialogue systems. However, collecting a large amount of turn-by-turn annotated dialogue data is costly and inefficient. In this paper, we propose a novel turn-level active learning framework for DST to actively select turns in dialogues to annotate. Given the limited labelling budget, experimental results demonstrate the effectiveness of selective annotation of dialogue turns. Additionally, our approach can effectively achieve comparable DST performance to traditional training approaches with significantly less annotated data, which provides a more efficient way to annotate new dialogue data.
    摘要 对话状态跟踪(DST)在任务导向对话系统中扮演着重要的角色。然而,收集大量的回合对话数据 annotation是成本高昂且不效率的。在这篇论文中,我们提出了一种新的回合活动学习框架,用于精选对话中的回合进行标注。由于有限的标注预算,实验结果表明我们的方法可以有效地实现选择性的标注对话回合,并且可以与传统的训练方法具有相同的DST性能,但是具有更少的标注数据,从而提供了更有效的对话数据标注方法。

CITB: A Benchmark for Continual Instruction Tuning

  • paper_url: http://arxiv.org/abs/2310.14510
  • repo_url: None
  • paper_authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad
  • for: 本研究旨在解决连续学习(CL)任务中的指令调整问题,以便更好地掌握和应用自然语言指令。
  • methods: 本研究采用了现有的连续学习方法,并对其进行了修改和调整,以便更好地适应不同类型的语言指令。
  • results: 研究发现,现有的连续学习方法不充分利用了自然语言指令的丰富性,并且在不断调整模型的情况下,可以获得类似或更好的结果。
    Abstract Continual learning (CL) is a paradigm that aims to replicate the human ability to learn and accumulate knowledge continually without forgetting previous knowledge and transferring it to new tasks. Recent instruction tuning (IT) involves fine-tuning models to make them more adaptable to solving NLP tasks in general. However, it is still uncertain how instruction tuning works in the context of CL tasks. This challenging yet practical problem is formulated as Continual Instruction Tuning (CIT). In this work, we establish a CIT benchmark consisting of learning and evaluation protocols. We curate two long dialogue task streams of different types, InstrDialog and InstrDialog++, to study various CL methods systematically. Our experiments show that existing CL methods do not effectively leverage the rich natural language instructions, and fine-tuning an instruction-tuned model sequentially can yield similar or better results. We further explore different aspects that might affect the learning of CIT. We hope this benchmark will facilitate more research in this direction.
    摘要

EXPLAIN, EDIT, GENERATE: Rationale-Sensitive Counterfactual Data Augmentation for Multi-hop Fact Verification

  • paper_url: http://arxiv.org/abs/2310.14508
  • repo_url: https://github.com/aaandy-zhu/race
  • paper_authors: Yingjie Zhu, Jiasheng Si, Yibo Zhao, Haiyang Zhu, Deyu Zhou, Yulan He
  • for: 提高自然语言处理中的多跳事实验证性能
  • methods: 使用 Explain-Edit-Generate 架构生成多样化和精准的对应替换文本,并使用 checking 和 filtering 模块来规范化对假文本的检查和筛选
  • results: 提出的方法在对假文本生成中比基eline高效,能够生成多样化的对应替换文本而无需破坏逻辑关系
    Abstract Automatic multi-hop fact verification task has gained significant attention in recent years. Despite impressive results, these well-designed models perform poorly on out-of-domain data. One possible solution is to augment the training data with counterfactuals, which are generated by minimally altering the causal features of the original data. However, current counterfactual data augmentation techniques fail to handle multi-hop fact verification due to their incapability to preserve the complex logical relationships within multiple correlated texts. In this paper, we overcome this limitation by developing a rationale-sensitive method to generate linguistically diverse and label-flipping counterfactuals while preserving logical relationships. In specific, the diverse and fluent counterfactuals are generated via an Explain-Edit-Generate architecture. Moreover, the checking and filtering modules are proposed to regularize the counterfactual data with logical relations and flipped labels. Experimental results show that the proposed approach outperforms the SOTA baselines and can generate linguistically diverse counterfactual data without disrupting their logical relationships.
    摘要 自动多阶假设验证任务在近年获得了很大的关注。 DESPITE 这些优化的模型在训练数据上表现出色,但它们在对预测的数据上表现不佳。一个可能的解决方案是将训练数据补充了假设,这些假设是通过最小化的因果特征来生成的。但现有的假设数据增强技术无法处理多阶假设验证,因为它们无法维持多个相关的文本之间的复杂逻辑关系。在这篇论文中,我们解决这个限制,我们发展了一种关系敏感的方法,可以生成 linguistically 多元和标签转换的假设,同时保持逻辑关系。具体来说,我们使用 Explain-Edit-Generate 架构来生成多元和流畅的假设。此外,我们也提出了检查和筛选模组,以规范假设数据中的逻辑关系和标签转换。实验结果显示,我们的方法可以超越现有的基eline,并可以生成 linguistically 多元的假设数据,不会遗传逻辑关系的破坏。

Sentiment analysis with adaptive multi-head attention in Transformer

  • paper_url: http://arxiv.org/abs/2310.14505
  • repo_url: None
  • paper_authors: Fanfei Meng, David Demeter
  • for: 这 paper 是为了提出一种基于注意机制的电影评论文档情感识别框架。
  • methods: 这 paper 使用了一种自适应多头注意架构(AdaptAttn),其中注意头数量根据句子长度进行自适应调整。
  • results: 实验结果表明,与基eline模型相比,本模型在 Standford 大电影评论数据集上的 F1 分数几乎相同。
    Abstract We propose a novel framework based on the attention mechanism to identify the sentiment of a movie review document. Previous efforts on deep neural networks with attention mechanisms focus on encoder and decoder with fixed numbers of multi-head attention. Therefore, we need a mechanism to stop the attention process automatically if no more useful information can be read from the memory.In this paper, we propose an adaptive multi-head attention architecture (AdaptAttn) which varies the number of attention heads based on length of sentences. AdaptAttn has a data preprocessing step where each document is classified into any one of the three bins small, medium or large based on length of the sentence. The document classified as small goes through two heads in each layer, the medium group passes four heads and the large group is processed by eight heads. We examine the merit of our model on the Stanford large movie review dataset. The experimental results show that the F1 score from our model is on par with the baseline model.
    摘要 我们提出了一种基于注意机制的新框架,用于 identificar电影评论文档中的情感。先前的各种深度神经网络与注意机制实验均采用固定数量的多头注意。因此,我们需要一种机制来自动停止注意过程,以避免继续读取内存中的无用信息。在这篇论文中,我们提出了一种自适应多头注意架构(AdaptAttn),其中注意头数量根据句子长度进行变化。AdaptAttn具有一个数据预处理步骤,其中每个文档根据句子长度被分类为小、中、大三类中的任一类。小类文档通过每层两个头进行处理,中类文档通过每层四个头进行处理,大类文档通过每层八个头进行处理。我们对斯坦福大学电影评论数据集进行实验,结果显示我们的模型与基线模型的F1分数几乎相同。

Diversify Question Generation with Retrieval-Augmented Style Transfer

  • paper_url: http://arxiv.org/abs/2310.14503
  • repo_url: https://github.com/gouqi666/rast
  • paper_authors: Qi Gou, Zehua Xia, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li, Nguyen Cam-Tu
  • for: 提高问题生成的表达多样性,使得问题更能够表达出人类语言的多样性和自然性。
  • methods: 利用不同模板的风格进行问题生成,并通过可优化的强化学习法找到最佳的模板。
  • results: 在多样性和一致性两个指标下,RAST方法比前一代多样性驱动基线方法表现出色,同时保持了与模板的一致性。
    Abstract Given a textual passage and an answer, humans are able to ask questions with various expressions, but this ability is still challenging for most question generation (QG) systems. Existing solutions mainly focus on the internal knowledge within the given passage or the semantic word space for diverse content planning. These methods, however, have not considered the potential of external knowledge for expression diversity. To bridge this gap, we propose RAST, a framework for Retrieval-Augmented Style Transfer, where the objective is to utilize the style of diverse templates for question generation. For training RAST, we develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward. Here, the consistency reward is computed by a Question-Answering (QA) model, whereas the diversity reward measures how much the final output mimics the retrieved template. Experimental results show that our method outperforms previous diversity-driven baselines on diversity while being comparable in terms of consistency scores. Our code is available at https://github.com/gouqi666/RAST.
    摘要 Simplified Chinese: humans can ask questions with various expressions, but this ability is still challenging for most question generation (QG) systems. existing solutions mainly focus on the internal knowledge within the given passage or the semantic word space for diverse content planning. these methods, however, have not considered the potential of external knowledge for expression diversity. to bridge this gap, we propose RAST, a framework for Retrieval-Augmented Style Transfer, where the objective is to utilize the style of diverse templates for question generation. for training RAST, we develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward. here, the consistency reward is computed by a Question-Answering (QA) model, whereas the diversity reward measures how much the final output mimics the retrieved template. experimental results show that our method outperforms previous diversity-driven baselines on diversity while being comparable in terms of consistency scores. our code is available at https://github.com/gouqi666/RAST.

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

  • paper_url: http://arxiv.org/abs/2310.14491
  • repo_url: https://github.com/yifan-h/mechanisticprobe
  • paper_authors: Yifan Hou, Jiaoda Li, Yu Fei, Alessandro Stolfo, Wangchunshu Zhou, Guangtao Zeng, Antoine Bosselut, Mrinmaya Sachan
  • for: 本研究旨在解释语言模型(LM)是如何进行多步逻辑思维的。
  • methods: 本研究使用了一种新的探测方法(名为 MechanisticProbe),可以从模型的注意模式中恢复出逻辑树。
  • results: 研究发现, MechanisticProbe 能够在大多数示例中从模型的注意模式中恢复出逻辑树信息,表明LM在许多情况下实际上是通过多步逻辑过程来完成任务。
    Abstract Recent work has shown that language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. However, it is unclear whether LMs perform these tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. In this paper, we try to answer this question by exploring a mechanistic interpretation of LMs for multi-step reasoning tasks. Concretely, we hypothesize that the LM implicitly embeds a reasoning tree resembling the correct reasoning process within it. We test this hypothesis by introducing a new probing approach (called MechanisticProbe) that recovers the reasoning tree from the model's attention patterns. We use our probe to analyze two LMs: GPT-2 on a synthetic task (k-th smallest element), and LLaMA on two simple language-based reasoning tasks (ProofWriter & AI2 Reasoning Challenge). We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples, suggesting that the LM indeed is going through a process of multi-step reasoning within its architecture in many cases.
    摘要

Text Fact Transfer

  • paper_url: http://arxiv.org/abs/2310.14486
  • repo_url: https://github.com/nbalepur/text-fact-transfer
  • paper_authors: Nishant Balepur, Jie Huang, Kevin Chen-Chuan Chang
  • for: 控制文本的样式,包括将过去的新闻更新为当前事件和将教育材料重新用途。
  • methods: 提出了文本事实传递任务,即将文本的事实内容传递到不同话题中,保持原始文本的样式不变。
  • results: 通过设计ModQGA框架,可以准确地传递文本的事实内容,而不是改变原始文本的样式。
    Abstract Text style transfer is a prominent task that aims to control the style of text without inherently changing its factual content. To cover more text modification applications, such as adapting past news for current events and repurposing educational materials, we propose the task of text fact transfer, which seeks to transfer the factual content of a source text between topics without modifying its style. We find that existing language models struggle with text fact transfer, due to their inability to preserve the specificity and phrasing of the source text, and tendency to hallucinate errors. To address these issues, we design ModQGA, a framework that minimally modifies a source text with a novel combination of end-to-end question generation and specificity-aware question answering. Through experiments on four existing datasets adapted for text fact transfer, we show that ModQGA can accurately transfer factual content without sacrificing the style of the source text.
    摘要

“Why Should I Review This Paper?” Unifying Semantic, Topic, and Citation Factors for Paper-Reviewer Matching

  • paper_url: http://arxiv.org/abs/2310.14483
  • repo_url: https://github.com/plubplub1/bountyfarm
  • paper_authors: Yu Zhang, Yanzhen Shen, Xiusi Chen, Bowen Jin, Jiawei Han
  • for: This paper proposes a unified model for paper-reviewer matching that jointly captures semantic, topic, and citation factors to improve the accuracy of matching reviewers with papers.
  • methods: The proposed UniPR model uses a contextualized language model backbone to learn common knowledge and introduces instruction tuning to characterize the uniqueness of each factor by producing factor-aware paper embeddings.
  • results: Experiments on four datasets across different fields consistently validate the effectiveness of the UniPR model in comparison with state-of-the-art paper-reviewer matching methods and scientific pre-trained language models.
    Abstract As many academic conferences are overwhelmed by a rapidly increasing number of paper submissions, automatically finding appropriate reviewers for each submission becomes a more urgent need than ever. Various factors have been considered by previous attempts on this task to measure the expertise relevance between a paper and a reviewer, including whether the paper is semantically close to, shares topics with, and cites previous papers of the reviewer. However, the majority of previous studies take only one of these factors into account, leading to an incomprehensive evaluation of paper-reviewer relevance. To bridge this gap, in this paper, we propose a unified model for paper-reviewer matching that jointly captures semantic, topic, and citation factors. In the unified model, a contextualized language model backbone is shared by all factors to learn common knowledge, while instruction tuning is introduced to characterize the uniqueness of each factor by producing factor-aware paper embeddings. Experiments on four datasets (one of which is newly contributed by us) across different fields, including machine learning, computer vision, information retrieval, and data mining, consistently validate the effectiveness of our proposed UniPR model in comparison with state-of-the-art paper-reviewer matching methods and scientific pre-trained language models.
    摘要 很多学术会议由于纷纷增加的论文提交数量而面临着自动找到适当的评审人的需求,而这已经成为一项非常紧迫的任务。先前的尝试中考虑了多种因素来衡量论文和评审人之间的专业相关性,包括论文与评审人的语义相似性、论文与评审人的话题相似性以及论文与评审人之间的引用关系。但是,大多数先前的研究只是单独考虑了一个这些因素,导致评审人与论文之间的评估不充分。为了bridging这个差距,在这篇论文中,我们提议一种统一的论文评审匹配模型,该模型同时捕捉语义、话题和引用因素。在统一模型中,一个contextualized language model backbone被共享,以学习共同知识,而instruction tuning被引入,以特别 caracterize每个因素的唯一性,生成因素相关的论文嵌入。实验在四个数据集(其中一个是我们新提供的) across不同的领域,包括机器学习、计算机视觉、信息检索和数据挖掘,验证了我们提议的UniPR模型的效果,相比之下与状态当前的论文评审匹配方法和科学预训言语言模型。

DetectGPT-SC: Improving Detection of Text Generated by Large Language Models through Self-Consistency with Masked Predictions

  • paper_url: http://arxiv.org/abs/2310.14479
  • repo_url: None
  • paper_authors: Rongsheng Wang, Qi Li, Sihong Xie
  • for: 检测AI生成文本是否是人类生成的文本
  • methods: 使用自适应性预测来检测AI生成文本的自我一致性
  • results: 在不同任务中,DetectGPT-SC超过当前状态的检测性能
    Abstract General large language models (LLMs) such as ChatGPT have shown remarkable success, but it has also raised concerns among people about the misuse of AI-generated texts. Therefore, an important question is how to detect whether the texts are generated by ChatGPT or by humans. Existing detectors are built on the assumption that there is a distribution gap between human-generated and AI-generated texts. These gaps are typically identified using statistical information or classifiers. In contrast to prior research methods, we find that large language models such as ChatGPT exhibit strong self-consistency in text generation and continuation. Self-consistency capitalizes on the intuition that AI-generated texts can still be reasoned with by large language models using the same logical reasoning when portions of the texts are masked, which differs from human-generated texts. Using this observation, we subsequently proposed a new method for AI-generated texts detection based on self-consistency with masked predictions to determine whether a text is generated by LLMs. This method, which we call DetectGPT-SC. We conducted a series of experiments to evaluate the performance of DetectGPT-SC. In these experiments, we employed various mask scheme, zero-shot, and simple prompt for completing masked texts and self-consistency predictions. The results indicate that DetectGPT-SC outperforms the current state-of-the-art across different tasks.
    摘要 通用大型语言模型(LLM)如ChatGPT已经表现出了惊人的成功,但也引起了人们对AI生成文本的使用的担忧。因此,一个重要的问题是如何检测文本是由ChatGPT或人类生成的。现有的检测器基于人类生成和AI生成文本之间的分布差异。这些差异通常通过统计信息或分类器来定义。与先前的研究方法不同,我们发现了大型语言模型如ChatGPT在文本生成中具有强的自我一致性。自我一致性利用了人类生成文本和AI生成文本之间的逻辑相似性,这与人类生成文本不同。使用这一观察,我们后续提出了一种基于自我一致性的新方法 дляAI生成文本检测,称为DetectGPT-SC。我们进行了一系列实验来评估DetectGPT-SC的性能。在这些实验中,我们使用了不同的mask scheme、零shot和简单的提示来完成masked文本和自我一致预测。结果表明,DetectGPT-SC在不同任务上比现有状态的培训数据表现出了更高的性能。

GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

  • paper_url: http://arxiv.org/abs/2310.14478
  • repo_url: https://github.com/knowledge-computing/geolm
  • paper_authors: Zekun Li, Wenxuan Zhou, Yao-Yi Chiang, Muhao Chen
  • for: 该论文旨在提高自然语言处理和地理科学之间的交互,通过充分利用大规模可用的地理数据库,如OpenStreetMap。
  • methods: 该论文提出了一种基于地理信息的语言模型,称为GeoLM,可以增强对地OINames的理解。GeoLM通过将地理信息与文本 Corpora中的语言信息相连接,通过对比学习和遮盖语言模型来连接这两种类型的上下文。它还包括一种空间坐标编码机制,以编码距离和方向关系,捕捉地理上下文。
  • results: 实验表明,GeoLM可以有效地支持拼音识别、地OINames连接、关系提取和地entity类型分类等任务,bridge自然语言处理和地理科学之间的空难。代码可以在https://github.com/knowledge-computing/geolm中下载。
    Abstract Humans subconsciously engage in geospatial reasoning when reading articles. We recognize place names and their spatial relations in text and mentally associate them with their physical locations on Earth. Although pretrained language models can mimic this cognitive process using linguistic context, they do not utilize valuable geospatial information in large, widely available geographical databases, e.g., OpenStreetMap. This paper introduces GeoLM, a geospatially grounded language model that enhances the understanding of geo-entities in natural language. GeoLM leverages geo-entity mentions as anchors to connect linguistic information in text corpora with geospatial information extracted from geographical databases. GeoLM connects the two types of context through contrastive learning and masked language modeling. It also incorporates a spatial coordinate embedding mechanism to encode distance and direction relations to capture geospatial context. In the experiment, we demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing, which bridge the gap between natural language processing and geospatial sciences. The code is publicly available at https://github.com/knowledge-computing/geolm.
    摘要 人们无意识地在阅读文章时进行地ospatial 理解。我们认可地名和其所处的空间关系在文本中,并将其与地球上的 физические位置相关联。虽然预训练的语言模型可以通过语言上的语境来模仿这种认知过程,但它们不会利用大量的地ospatial信息,例如OpenStreetMap。这篇论文介绍了GeoLM,一种基于地ospatial信息的语言模型,它可以增强自然语言处理中的地ospatial信息的理解。GeoLM利用地ospatial信息来将文本中的地名作为锚点,与文本中的语言信息相连接。GeoLM通过对比学习和隐藏语言模型来连接这两种上下文。它还包括一个空间坐标编码机制,以编码距离和方向关系,以 Capture地ospatial上下文。在实验中,我们示出了GeoLM在支持拼音识别、地名连接、关系提取和地ospatial类型识别等方面的表现,这些能力 bridge 自然语言处理和地ospatial科学之间的空难。代码可以在https://github.com/knowledge-computing/geolm 中获取。