cs.CL - 2023-10-26

TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the Automatic Ordering of Events in News Articles

  • paper_url: http://arxiv.org/abs/2310.17802
  • repo_url: https://github.com/alsayyahi/timeline
  • paper_authors: Sarah Alsayyahi, Riza Batista-Navarro
  • for: 本研究旨在提高现有的时间关系抽象模型,因为现有的新闻数据集存在许多问题,包括:(1) 注解者之间的干扰准确性低,由于注解指南不够具体,未能准确地定义时间关系的标准; (2) 排除文档内部远程关系(即文档中不同段落之间的关系); (3) 排除不以词为中心的事件。本研究提出了一个新的注解方案,clearly defines the criteria for annotating temporal relations, includes events that are not expressed as verbs, and annotates all temporal relations, including long-distance ones.
  • methods: 本研究提出了一种新的注解方法,包括:(1) 使用明确的注解指南,以准确地定义时间关系的标准; (2) 使用自动化注解工具,减少注解员的时间和努力; (3) 包括不以词为中心的事件。
  • results: 本研究在新闻数据集上进行了基eline模型的训练和评估,与之前report的时间关系数据集相比,获得了改善的注解干扰准确性。
    Abstract Temporal relation extraction models have thus far been hindered by a number of issues in existing temporal relation-annotated news datasets, including: (1) low inter-annotator agreement due to the lack of specificity of their annotation guidelines in terms of what counts as a temporal relation; (2) the exclusion of long-distance relations within a given document (those spanning across different paragraphs); and (3) the exclusion of events that are not centred on verbs. This paper aims to alleviate these issues by presenting a new annotation scheme that clearly defines the criteria based on which temporal relations should be annotated. Additionally, the scheme includes events even if they are not expressed as verbs (e.g., nominalised events). Furthermore, we propose a method for annotating all temporal relations -- including long-distance ones -- which automates the process, hence reducing time and manual effort on the part of annotators. The result is a new dataset, the TIMELINE corpus, in which improved inter-annotator agreement was obtained, in comparison with previously reported temporal relation datasets. We report the results of training and evaluating baseline temporal relation extraction models on the new corpus, and compare them with results obtained on the widely used MATRES corpus.
    摘要 temporal relation extraction models 在现有的新闻 datasets 中遇到了一些问题,包括:(1) annotator 之间的协调性低下,由于注释指南不够specificity,不确定哪些 temporal relation 应该被注释;(2) 在给定文档中排除远程关系( span across 不同段落); 和(3) 排除不以 verb 表达的事件。这篇论文希望通过提出一个新的注释方案来缓解这些问题,该方案明确定义了注释 temporal relation 的标准,并包括不以 verb 表达的事件。此外,我们还提议使用自动化注释方法来快速和减少注释员的时间和劳动。这使得得到了一个新的数据集,名为 TIMELINE corpus,其中得到了提高的 inter-annotator 协调性,与之前报道的 temporal relation 数据集相比。我们对新数据集进行了训练和评估基eline temporal relation EXTRACTION 模型,并与 widely 使用的 MATRES corpus 进行了比较。

Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?

  • paper_url: http://arxiv.org/abs/2310.17774
  • repo_url: None
  • paper_authors: Sathvik Nair, Philip Resnik
  • for: This paper tests the assumption that LLMs (large language models) can be used effectively on psycholinguistic data without considering morphological information.
  • methods: The paper compares surprisal estimates using orthographic, morphological, and BPE (byte pair encoding) tokenization against reading time data to determine the impact of tokenization method on LLM predictions.
  • results: The results show that BPE-based tokenization does not result in significantly worse predictions compared to morphological and orthographic segmentation, but a finer-grained analysis reveals potential issues with relying on BPE and suggests the use of morphologically-aware surprisal estimates as an alternative method for evaluating morphological prediction.
    Abstract An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that in the aggregate, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.
    摘要 一种重要的假设,即使用语言模型(LLM)处理心理语言数据时,没有得到确认。 LLM 的预测基于字符串tokenization,而不是word decomposition into morphemes。这么?我们仔细测试这一点, Comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data。我们的结果重复了之前的发现,并提供了证据,表明在总体上,使用 BPE tokenization 的预测不比 morphological and orthographic segmentation 受到影响。然而,一个更加细致的分析表明,可能存在依赖于 BPE 基于 tokenization 的问题,以及提供了 morphologically-aware surprisal estimates 和一种新的评估方法。

A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

  • paper_url: http://arxiv.org/abs/2310.17750
  • repo_url: None
  • paper_authors: Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, Mei Chen
  • for: 这个论文目标是提出一种自动测量责任AI(RAI)指标的框架,用于评估大语言模型(LLM)和相关产品和服务的可责任性。
  • methods: 该框架基于现有的技术和社会技术知识,利用现代大语言模型,如GPT-4,来自动测量LLMs中可能产生的各种危害。
  • results: 通过这个框架,我们可以对不同的LLM进行多个案例研究,以评估它们是否违反了不同的RAI相关原则。此框架可以与领域特定的社会技术知识结合使用,以创造未来的新危害评估领域的测量方法。
    Abstract We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles. The framework may be employed alongside domain-specific sociotechnical expertise to create measurements for new harm areas in the future. By implementing this framework, we aim to enable more advanced harm measurement efforts and further the responsible use of LLMs.
    摘要 我们提出了一套自动测量责任人工智能(RAI)指标的框架,用于大语言模型(LLM)和相关产品和服务的评估。我们的自动测量损害方法基于现有的技术和社会技术知识,利用现代最先进的LLM,如GPT-4,并可以用于评估不同LLM在多种伦理原则上的违反。我们使用这套框架进行了多个案例研究,探讨了不同LLM在不同伦理原则上的可能损害。这套框架可以与域专业知识相结合,以创造未来的新损害领域的测量。通过实施这套框架,我们期望推动责任用LLM的更高级别的损害评估,并促进责任用LLM的应用。

StyleBART: Decorate Pretrained Model with Style Adapters for Unsupervised Stylistic Headline Generation

  • paper_url: http://arxiv.org/abs/2310.17743
  • repo_url: None
  • paper_authors: Hanqing Wang, Yajing Luo, Boya Xiong, Guanhua Chen, Yun Chen
  • for: 这篇论文主要针对的是无监督的 стилистического标题生成任务,即生成一个标题不仅总结文章内容,还反映出用户所需的样式。
  • methods: 该论文提出了一种无监督的方法,即StyleBART,该方法使用预训练的 BART 模型,并在其上添加了不同风格的适应器,以生成具有不同风格的标题。
  • results: 经过自动和人工评估,StyleBART 可以具有新的状态 искусственный智能水平,生成高质量的标题,同时具有用户所需的风格。
    Abstract Stylistic headline generation is the task to generate a headline that not only summarizes the content of an article, but also reflects a desired style that attracts users. As style-specific article-headline pairs are scarce, previous researches focus on unsupervised approaches with a standard headline generation dataset and mono-style corpora. In this work, we follow this line and propose StyleBART, an unsupervised approach for stylistic headline generation. Our method decorates the pretrained BART model with adapters that are responsible for different styles and allows the generation of headlines with diverse styles by simply switching the adapters. Different from previous works, StyleBART separates the task of style learning and headline generation, making it possible to freely combine the base model and the style adapters during inference. We further propose an inverse paraphrasing task to enhance the style adapters. Extensive automatic and human evaluations show that StyleBART achieves new state-of-the-art performance in the unsupervised stylistic headline generation task, producing high-quality headlines with the desired style.
    摘要 “样式化标题生成任务是生成一个标题,不仅概括文章内容,还反映用户所需的样式。由于样式特定的文章标题对пада scarce, previous researches focus on unsupervised approaches with a standard headline generation dataset and mono-style corpora. 在这项工作中,我们跟随这条线和提出了 StyleBART,一种不supervised方法 для样式化标题生成。我们的方法在预训练BART模型的基础之上添加了适应器,负责不同样式,使得可以通过简单地切换适应器来生成多样的标题。与前一些工作不同,StyleBART分离了样式学习和标题生成任务,使得在推理过程中可以自由地组合基础模型和样式适应器。我们还提出了反向重写任务,以增强样式适应器。自动和人工评估表明,StyleBART在无监督样式化标题生成任务中实现了新的状态纪录性表现,生成高质量的标题,满足用户所需的样式。”

ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages

  • paper_url: http://arxiv.org/abs/2310.17737
  • repo_url: None
  • paper_authors: Mohammad Akbari, Saeed Ranjbar Alvar, Behnam Kamranian, Amin Banitalebi-Dehkordi, Yong Zhang
  • For: 这个论文的目的是提出一种基于多modalitat的语言模型ArchBERT,用于同时学习自然语言和神经网络架构。* Methods: 这个论文使用了一种名为Masked Architecture Modeling(MAM)的预训练策略,并使用了两个新的双Modal数据集进行训练和验证。* Results: 经过数据分析和实验,ArchBERT在不同的下游任务中表现出色,包括建筑 oriented 理解、问答和摘要。
    Abstract Building multi-modal language models has been a trend in the recent years, where additional modalities such as image, video, speech, etc. are jointly learned along with natural languages (i.e., textual information). Despite the success of these multi-modal language models with different modalities, there is no existing solution for neural network architectures and natural languages. Providing neural architectural information as a new modality allows us to provide fast architecture-2-text and text-2-architecture retrieval/generation services on the cloud with a single inference. Such solution is valuable in terms of helping beginner and intermediate ML users to come up with better neural architectures or AutoML approaches with a simple text query. In this paper, we propose ArchBERT, a bi-modal model for joint learning and understanding of neural architectures and natural languages, which opens up new avenues for research in this area. We also introduce a pre-training strategy named Masked Architecture Modeling (MAM) for a more generalized joint learning. Moreover, we introduce and publicly release two new bi-modal datasets for training and validating our methods. The ArchBERT's performance is verified through a set of numerical experiments on different downstream tasks such as architecture-oriented reasoning, question answering, and captioning (summarization). Datasets, codes, and demos are available supplementary materials.
    摘要 “在 latest years, 多modal language models 已经成为一股潮流,其中包括图像、视频、语音等多种多样的Modalities 被同时学习与自然语言(即文本信息)。尽管这些多modal language models 已经取得了成功,但是在 neural network architectures 和自然语言之间仍没有现有的解决方案。提供 neural 架构信息作为新的Modalities 允许我们在云端提供快速的 architecture-2-text 和 text-2-architecture 搜寻/生成服务,仅需单一的推论。这个解决方案非常有价值,可以帮助 beginner 和 intermediate ML 用户创建更好的 neural 架构或 AutoML 方法,只需要一个简单的文本查询。在本文中,我们提出 ArchBERT,一个 bi-modal 模型 для 共同学习和理解 neural 架构和自然语言,这开启了新的研究领域。我们还提出了一个名为 Masked Architecture Modeling (MAM) 的预训练策略,以更加通用的共同学习。此外,我们创建了两个新的 bi-modal 数据集用于训练和验证我们的方法。ArchBERT 的表现被证明通过不同的下游任务,如 architecture-oriented reasoning、问题回答和描述(摘要)。数据集、代码和示例在补充材料中提供。”

Investigating Multilingual Coreference Resolution by Universal Annotations

  • paper_url: http://arxiv.org/abs/2310.17734
  • repo_url: https://github.com/haixiachai/multi-coref
  • paper_authors: Haixia Chai, Michael Strube
  • for: 本研究是针对多语言核心共referencing(MCR)任务进行研究,使用新提出的多语言核心共referencing数据集(CorefUD)进行调查。
  • methods: 本研究首先通过查看不同语言水平和 genre 的真实数据,以获得关于核心共referencing的特征的启示。其次,通过分析CRAC 2022 共同任务中 SotA 系统无法解决的最difficult case,进行错误分析。最后,基于这种分析,从universal morphosyntactic annotations中提取特征,并将其集成到基eline系统中,以评估其可能带来的改进。
  • results: 我们的最佳配置的特征提高了基eline系统的 F1 分数0.9%。
    Abstract Multilingual coreference resolution (MCR) has been a long-standing and challenging task. With the newly proposed multilingual coreference dataset, CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by using its harmonized universal morphosyntactic and coreference annotations. First, we study coreference by examining the ground truth data at different linguistic levels, namely mention, entity and document levels, and across different genres, to gain insights into the characteristics of coreference across multiple languages. Second, we perform an error analysis of the most challenging cases that the SotA system fails to resolve in the CRAC 2022 shared task using the universal annotations. Last, based on this analysis, we extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits for the MCR task. Our results show that our best configuration of features improves the baseline by 0.9% F1 score.
    摘要 多语言核心引用解决 (MCR) 是一项长期存在挑战的任务。我们使用新提出的多语言核心引用数据集(CorefUD,Nedoluzhko et al., 2022)进行研究。我们首先研究核心引用的特征,包括提取不同语言水平的 mention、entity 和文档水平的 annotations,以及不同类型的文献中的核心引用特征。其次,我们对 CRAC 2022 共享任务中 SotA 系统失败的最具挑战性情况进行错误分析。最后,我们从 universally 的 morphosyntactic 注释中提取特征,并将这些特征与基eline 系统集成,以评估其可能带来的改进。我们的结果表明,我们最佳的配置在 F1 分数上提高了 0.9%。

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

  • paper_url: http://arxiv.org/abs/2310.17723
  • repo_url: None
  • paper_authors: Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He
  • for: 这个论文旨在提出一个全新的、具有硬件优化的Robust Optimized Post-training W8A8量化框架,以提高深度神经网络的执行速度和可靠性。
  • methods: 这个框架使用了 ZeroQuant 的动态量化技术,并且特别处理了内存带宽和 compute-intensive 算子,以便在硬件上实现最佳性能。此外,这个框架还提供了一个可选的 INT8 模组,可以在 FP16/BF16 模式下运行,以提高准确性。
  • results: 这个框架可以实现更好的硬件性能和可靠性,并且可以适应不同的硬件环境。在一个实际应用中,这个框架可以实现更好的执行速度和准确性,相比之下 ZeroQuant 的动态量化技术。
    Abstract Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.
    摘要 深度神经网络的批量化技术是减少深度神经网络的内存和计算需求的关键。现有的解决方案,如ZeroQuant,为BERT和GPT模型提供了动态量化,但忽略了关键的内存缓存操作和每个token量化的复杂性。为了解决这些缺陷,我们提出了一种全新的、具有硬件优化的robust批量化框架ZeroQuant-HERO。这个框架独特地集成了内存带宽和计算投入的操作,寻求最佳硬件性能。此外,它还提供了flexibility,允许特定INT8模块在FP16/BF16模式下运行,以提高精度。

Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for Relation Extraction from Financial Documents

  • paper_url: http://arxiv.org/abs/2310.17714
  • repo_url: https://github.com/pawan2411/pan-dl_refind
  • paper_authors: Pawan Kumar Rajpoot, Ankur Parikh
  • for: 提高 implicit expression 和 long-tail relation class 的处理能力,以及提供一种可 accessible 的 RE 模型 для用户。
  • methods: 使用 nearest-neighbor search over dense vectors of lexico-syntactic patterns 来咨询训练关系,并使用这种方法来解决语言复杂性和数据稀缺问题。
  • results: 在 REFinD 上测试,本方法达到了状态级性能,并且在人工循环设置下提供了一个不错的开始。
    Abstract Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation classes, caused by language complexity and data sparsity. Further, these approaches and models are largely inaccessible to users who don't have direct access to large language models (LLMs) and/or infrastructure for supervised training or fine-tuning. Rule-based systems also struggle with implicit expressions. Apart from this, Real world financial documents such as various 10-X reports (including 10-K, 10-Q, etc.) of publicly traded companies pose another challenge to rule-based systems in terms of longer and complex sentences. In this paper, we introduce a simple approach that consults training relations at test time through a nearest-neighbor search over dense vectors of lexico-syntactic patterns and provides a simple yet effective means to tackle the above issues. We evaluate our approach on REFinD and show that our method achieves state-of-the-art performance. We further show that it can provide a good start for human in the loop setup when a small number of annotations are available and it is also beneficial when domain experts can provide high quality patterns.
    摘要 <>语言模型已经在relation抽取(RE)方面做出了很大的进步,但现有的RE模型通常无法处理两种情况:偏向表达和长尾关系类别,这是由于语言复杂性和数据稀缺所致。此外,现有的方法和模型具有训练和精度调整的限制,使得用户无法直接访问大语言模型(LLM)和基础设施。规则式系统也难以处理偏向表达。除此之外,公开上市公司的财务报表(如10-K、10-Q等)也对规则式系统 pose another challenge,因为它们具有更长和复杂的句子。在这篇论文中,我们介绍了一种简单的方法,通过在测试时通过 nearest-neighbor搜索 dense vector of lexico-syntactic pattern来咨询训练关系,并提供了一种简单 yet effective的方法来解决以上问题。我们对REFinD进行评估,并证明我们的方法可以达到状态的表现。我们还证明,它可以提供一个好的起点 для人工循环setup,当只有少量注释时,以及当领域专家可以提供高质量的模式时。

Is Explanation the Cure? Misinformation Mitigation in the Short Term and Long Term

  • paper_url: http://arxiv.org/abs/2310.17711
  • repo_url: None
  • paper_authors: Yi-Li Hsu, Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku
  • for: 本研究旨在测试自动生成的解释是否能够帮助人们战胜假新闻,并与警告标签相比。
  • methods: 本研究使用了GPT-4生成的反假证据来对假新闻进行处理。
  • results: 研究发现,两种干预措施都能够有效地降低受试者对假新闻的信念,并且这两种干预措施在短期和长期都有相似的效果。
    Abstract With advancements in natural language processing (NLP) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. While many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. In this study, we compare the effectiveness of a warning label and the state-of-the-art counterfactual explanations generated by GPT-4 in debunking misinformation. In a two-wave, online human-subject study, participants (N = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by GPT-4 generated explanations. Our results show that both interventions significantly decrease participants' self-reported belief in fake claims in an equivalent manner for the short-term and long-term. We discuss the implications of our findings and directions for future NLP-based misinformation debunking strategies.
    摘要 随着自然语言处理(NLP)模型的进步,自动生成解释被提议用于社交媒体平台上 combat 谣言,并且添加警告标签到已知的假新闻。然而,许多研究者对生成好的解释进行了重点研究,对于这些解释如何真正地帮助人们战胜假新闻则未得到充分探讨。在本研究中,我们比较了警告标签和GPT-4的国际标准对抗谣言的效果。在两个波次的在线人类试验中,参与者(N = 215)被随机分配到控制组(无任何干预)、警告标签组(假称有警告标签)或解释组(假称有GPT-4生成的解释)。我们的结果表明,两种 интервенción都能够在短期和长期内等效地减少参与者对假CLAIM的自我报告的信任度。我们讨论了我们的发现的意义和未来NLP基于的谣言战胜策略的方向。

The impact of using an AI chatbot to respond to patient messages

  • paper_url: http://arxiv.org/abs/2310.17703
  • repo_url: https://github.com/aim-harvard/oncqa
  • paper_authors: Shan Chen, Marco Guevara, Shalini Moningi, Frank Hoebers, Hesham Elhalawani, Benjamin H. Kann, Fallon E. Chipidza, Jonathan Leeman, Hugo J. W. L. Aerts, Timothy Miller, Guergana K. Savova, Raymond H. Mak, Maryam Lustberg, Majid Afshar, Danielle S. Bitterman
    for: 这个研究旨在测试人工智能聊天机器人(ChatGPT)是否可以减轻医生的负担,并且检查这些系统在医疗决策中的价值和影响。methods: 这个研究使用了两阶段跨sectional研究,让6名癌科医生回答100个真实的人工癌病患者情况和门诊信息,首先是手动回答,然后是使用人工智能帮助。results: 研究发现,AI帮助下的回答比手动回答更长、更难读,但是可以提供可接受的草稿无需修改58%的时间。AI帮助下的效率提高77%的时间,并且具有低伤害风险(82%安全)。但是,7.7%的AI回答可能会导致严重伤害。在31%的情况下,医生认为AI草稿是人类写的。AI帮助下的回答比手动回答更有传递patient education的建议, fewer clinical actions。结果显示AI可以改善医生的效率和病人照顾,但是需要在安全的情况下使用。
    Abstract Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and impact on clinical decision-making have not been studied for this intended use. We are the first to examine the utility of large language models in assisting clinicians draft responses to patient questions. In our two-stage cross-sectional study, 6 oncologists responded to 100 realistic synthetic cancer patient scenarios and portal messages developed to reflect common medical situations, first manually, then with AI assistance. We find AI-assisted responses were longer, less readable, but provided acceptable drafts without edits 58% of time. AI assistance improved efficiency 77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses could severely harm. In 31% cases, physicians thought AI drafts were human-written. AI assistance led to more patient education recommendations, fewer clinical actions than manual responses. Results show promise for AI to improve clinician efficiency and patient care through assisting documentation, if used judiciously. Monitoring model outputs and human-AI interaction remains crucial for safe implementation.
    摘要 医疗文书负担是临床医生疲劳的主要 contribuutor,这种情况在全国范围内升高,是我们护理患者的能力的急迫性问题。人工智能(AI)聊天机器人,如ChatGPT,可以减轻医生的文书负担。虽然许多医院在电子医疗记录系统中活动地整合这些系统,但AI聊天机器人在临床决策中的实用性和影响尚未得到研究。我们是第一个对大型自然语言模型在协助临床医生编写答复 patient questions 进行了研究。在我们的两个阶段cross-sectional研究中,6名Oncologists responded to 100种真实 synthetic cancer patient scenario和门户消息,这些消息是通过反映常见医疗情况来设计的。我们发现,使用AI助手的答复比手动答复长得多,可读性下降,但是58%的时间内可以提供可接受的答复无需修改。使用AI助手可以提高效率,77%的时间内可以提高效率,并且风险低(82%的时间内安全)。但是,7.7%的AI答复可能会严重害。在31%的情况下,医生认为AI答复是人类写的。使用AI助手可以提高患者教育建议和临床行动的数量,而手动答复则相对较少。结果表明,AI可以通过协助文书,提高临床医生的效率和患者的护理质量,但是要在安全的情况下使用。监测模型输出和人机AI交互仍然是关键。

Non-contrastive sentence representations via self-supervision

  • paper_url: http://arxiv.org/abs/2310.17690
  • repo_url: None
  • paper_authors: Marco Farina, Duccio Pappadopulo
  • for: 学习文本和句子嵌入的Unsupervised方法
  • methods: 使用自动标注对照方法和计算维度对比的方法
  • results: 无需auxiliary损失函数,自动标注对照方法可以超越SimCSE在下游任务上的表现
    Abstract Sample contrastive methods, typically referred to simply as contrastive are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.
    摘要 Sample contrastive methods, typically referred to simply as contrastive, are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.Here's the translation in Traditional Chinese:Sample contrastive methods, typically referred to simply as contrastive, are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.

torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free Deep Learning Studies: A Case Study on NLP

  • paper_url: http://arxiv.org/abs/2310.17644
  • repo_url: https://github.com/yoshitomo-matsubara/torchdistill
  • paper_authors: Yoshitomo Matsubara
  • for: 本研究旨在提高科学研究中的可重现性,通过对深度学习领域的研究领域进行快速发展,支持更多任务和第三方库。
  • methods: 本研究使用了模块驱动的 coding-free 深度学习框架 torchdistill,并将其升级至支持更多任务。为了证明新的框架可以支持更多任务,我们使用了一个基于 upgraded torchdistill 的脚本,使用了各种 Hugging Face 库来重现 GLUE benchmark 结果。
  • results: 我们在本研究中重现了 27 个 fine-tuned BERT 模型和配置,并将其发布在 Hugging Face 上。我们还重新实现了一些小型模型和新的知识传递方法,并在计算机视觉任务上进行了额外的实验。
    Abstract Reproducibility in scientific work has been becoming increasingly important in research communities such as machine learning, natural language processing, and computer vision communities due to the rapid development of the research domains supported by recent advances in deep learning. In this work, we present a significantly upgraded version of torchdistill, a modular-driven coding-free deep learning framework significantly upgraded from the initial release, which supports only image classification and object detection tasks for reproducible knowledge distillation experiments. To demonstrate that the upgraded framework can support more tasks with third-party libraries, we reproduce the GLUE benchmark results of BERT models using a script based on the upgraded torchdistill, harmonizing with various Hugging Face libraries. All the 27 fine-tuned BERT models and configurations to reproduce the results are published at Hugging Face, and the model weights have already been widely used in research communities. We also reimplement popular small-sized models and new knowledge distillation methods and perform additional experiments for computer vision tasks.
    摘要 科学研究中的重复性在机器学习、自然语言处理和计算机视觉等领域日益重要,因为近年来深度学习的发展对研究领域的进步提供了大量支持。在这项工作中,我们发布了torchdistill的升级版本,该框架支持了图像分类和物体检测任务,并且可以用于无编程的知识储存实验。为了证明新版本框架可以支持更多任务,我们使用基于升级后的torchdistill的脚本重现GLUEbenchmark中BERT模型的结果。我们在Hugging Face上发布了27个精度调整后的BERT模型和配置,以及模型的 weights,这些模型 weights已经在研究 сообществе广泛使用。此外,我们还重新实现了一些小型模型和新的知识储存方法,并在计算机视觉任务上进行了其他实验。

InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators

  • paper_url: http://arxiv.org/abs/2310.17630
  • repo_url: https://github.com/yangheng95/instoptima
  • paper_authors: Heng Yang, Ke Li
  • for: 提高 instruciton 工程效率,推动 instruction 学科发展
  • methods: 基于 evolutionary multi-objective optimization 的 instruciton 生成方法,利用大语言模型 simulate instruction 操作,并 introducing objective-guided mechanism
  • results: 实验结果显示 improved fine-tuning performance 和生成多种高质量 instrucitonIn English, this would be:
  • for: Improving instruction engineering efficiency to advance the field of instruction studies
  • methods: Using an evolutionary multi-objective optimization approach with a large language model to simulate instruction operators, and introducing an objective-guided mechanism to enhance the quality of generated instructions
  • results: Experimental results show improved fine-tuning performance and the generation of a diverse set of high-quality instructions.
    Abstract Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover. Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.
    摘要 过去的研究主要集中在对预训语言模型进行调整,但是指令工程的效率仍然较低,这限制了指令研究的发展。现在的研究主要是对指令生成进行自动化,但是这些研究主要关注性能的提高,而忽略了其他重要的目标,例如指令长度和混淆率。因此,我们提出了一个新的方法(即InstOptima),将指令生成视为进化多个目标优化问题。在对文本进行修订的方法不同之余,我们的方法利用大型语言模型(LLM)来模拟指令操作,包括变异和交叉。此外,我们引入了目标导向的机制,让LLM能够理解目标,并提高生成的指令质量。实验结果显示,我们的方法可以提高调整性能和生成多个高质量的指令。

Proving Test Set Contamination in Black Box Language Models

  • paper_url: http://arxiv.org/abs/2310.17623
  • repo_url: None
  • paper_authors: Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto
  • for: 这篇论文是为了证明语言模型是否受到数据污染的。
  • methods: 这篇论文使用了一种基于交换可能性的方法来证明语言模型是否受到数据污染。
  • results: 这篇论文的实验结果表明,使用这种方法可以准确地检测语言模型是否受到数据污染,并且可以在小型模型和少量数据情况下进行检测。
    Abstract Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.
    摘要 Translated into Simplified Chinese:大型语言模型通过互联网数据进行训练,引发了关注和推测,它们可能已经记忆了公共测试集。但是从推测到证据是困难的,因为 propietary 模型使用的预训练数据通常不公开 accessible。我们表明可以在无数据污染情况下提供可证明的保证。我们的方法利用了无数据污染时,所有的交换 benchmark 都应该是相同概率的。相反,语言模型很可能记忆示例的顺序,因此一个污染的语言模型会找到一些特定的 canonical 顺序非常有可能性。我们的测试会在这些可能性中旁 Flag 潜在的污染。我们示例了我们的方法可以在复杂的情况下可靠地证明测试集污染,包括模型只有 1.4 十亿参数,测试集只有 1000 个示例,并且数据集仅在预训练 Corpora 中出现一些次。使用我们的测试,我们对五种公共可 accessible 语言模型进行审核,并未发现普遍的污染。

Uncovering Meanings of Embeddings via Partial Orthogonality

  • paper_url: http://arxiv.org/abs/2310.17611
  • repo_url: None
  • paper_authors: Yibo Jiang, Bryon Aragam, Victor Veitch
  • for: 这个论文研究了语言语义结构如何在数字嵌入中编码。
  • methods: 作者使用了受限正交的概念和方法来捕捉语义独立性。
  • results: 作者证明了受限正交可以 capture语义独立性,并提出了独立嵌入和独立嵌入的存在。
    Abstract Machine learning tools often rely on embedding text as vectors of real numbers. In this paper, we study how the semantic structure of language is encoded in the algebraic structure of such embeddings. Specifically, we look at a notion of ``semantic independence'' capturing the idea that, e.g., ``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such examples are intuitive, it is difficult to formalize such a notion of semantic independence. The key observation here is that any sensible formalization should obey a set of so-called independence axioms, and thus any algebraic encoding of this structure should also obey these axioms. This leads us naturally to use partial orthogonality as the relevant algebraic structure. We develop theory and methods that allow us to demonstrate that partial orthogonality does indeed capture semantic independence. Complementary to this, we also introduce the concept of independence preserving embeddings where embeddings preserve the conditional independence structures of a distribution, and we prove the existence of such embeddings and approximations to them.
    摘要

  • paper_url: http://arxiv.org/abs/2310.17609
  • repo_url: None
  • paper_authors: Haitao Li, Yunqiu Shao, Yueyue Wu, Qingyao Ai, Yixiao Ma, Yiqun Liu
    for:LeCaRDv2 is a large-scale Legal Case Retrieval Dataset (version 2) that aims to alleviate the limitations of existing datasets in the Chinese legal system, such as limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies.methods:LeCaRDv2 consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. It enriches the existing relevance criteria by considering three key aspects: characterization, penalty, and procedure. Additionally, a two-level candidate set pooling strategy is proposed to effectively identify potential candidates for each query case.results:The dataset has been annotated by multiple legal experts specializing in criminal law, ensuring the accuracy and reliability of the annotations. The evaluation of several state-of-the-art retrieval models at LeCaRDv2 demonstrates that there is still significant room for improvement in legal case retrieval.Here’s the Chinese version of the information:for:LeCaRDv2 是一个大规模的法律案例检索数据集(第二版),旨在解决现有数据集中的三个问题:数据量有限,法律相关性过于窄,和数据采样中的候选人选择策略过于简单。methods:LeCaRDv2 包含 800 个查询和 55,192 个候选案例,从 4.3 万个刑事案件文档中提取。它对现有的相关性标准进行扩展,考虑三个关键方面: caracterization、penalty 和 procedure。此外,还提出了一种两级候选人pooling策略,以更好地确定每个查询案件的可能候选人。results:所有案例都被多名专业律师(专门从事刑事法) annotate,以确保数据的准确性和可靠性。在 LeCaRDv2 上评估了多种当前领先的检索模型,显示了法律案例检索仍有很大的提升空间。
    Abstract As an important component of intelligent legal systems, legal case retrieval plays a critical role in ensuring judicial justice and fairness. However, the development of legal case retrieval technologies in the Chinese legal system is restricted by three problems in existing datasets: limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies used in data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. To the best of our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval datasets, providing extensive coverage of criminal charges. Additionally, we enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. This comprehensive criteria enriches the dataset and may provides a more holistic perspective. Furthermore, we propose a two-level candidate set pooling strategy that effectively identify potential candidates for each query case. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law. Their expertise ensures the accuracy and reliability of the annotations. We evaluate several state-of-the-art retrieval models at LeCaRDv2, demonstrating that there is still significant room for improvement in legal case retrieval. The details of LeCaRDv2 can be found at the anonymous website https://github.com/anonymous1113243/LeCaRDv2.
    摘要 legal case retrieval 作为智能法律系统的重要组件,在确保司法公平和公正方面扮演着关键角色。然而,中国法律系统中的法律案例检索技术的发展受到了三种现有数据集的限制:数据量受限,法律相关性的定义太窄,以及数据采样中使用的候选人选择策略过于简单。为了解决这些问题,我们提出了LeCaRDv2,一个大规模的法律案例检索数据集(版本2)。它包括800个查询和55,192个候选者,从430万起诉案件文档中提取出来。我们知道LeCaRDv2是中国法律系统中最大的法律案例检索数据集之一,提供了广泛的刑事罪名覆盖。此外,我们增强了现有的相关性标准,考虑了三个关键方面: caracterización、刑事和程序。这种全面的标准可能提供了更全面的视角。此外,我们提议了两级候选者集pooling策略,可以有效地找到每个查询案例的可能候选者。需要注意的是,所有案例在数据集中都被多名法律专家特许律师 annotated,他们的专业性确保了数据集的准确性和可靠性。我们在LeCaRDv2上评估了多种当前最佳 Retrieval 模型,显示了法律案例检索领域还有很大的改进空间。LeCaRDv2的细节可以在https://github.com/anonymous1113243/LeCaRDv2 的匿名网站上找到。

Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

  • paper_url: http://arxiv.org/abs/2310.17591
  • repo_url: https://github.com/venkatasg/lil-bevo
  • paper_authors: Venkata S Govindarajan, Juan Diego Rodriguez, Kaj Bostrom, Kyle Mahowald
  • for: 本研究旨在提出一种基于小量数据的语言模型预训练方法,以提高语言模型的性能。
  • methods: 该方法使用了三种inghamredients:初始预训练使用音乐数据、在较短的序列上进行预训练 перед较长的序列上,以及在特定的标记Token上进行做袋装。
  • results: 研究发现,使用这些技术可以使模型的性能达到或超过随机的水平,但是并不能与更大的语言模型在更多的数据上达到的性能水平。另外,在较短的序列上进行预训练表现较好,而使用音乐数据可能带来小量的改善。
    Abstract We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a
    摘要 我们给你介绍Lil-Bevo,我们对 BabyLM 挑战的提交。我们预训我们的封面语言模型使用三种食品:初始预训使用音乐数据、在较短的序列上进行训练 перед训练较长的序列、以及对特定的Token进行遮盾以 targets 一些 BLiMP 任务。总体而言,我们的基eline模型的表现高于机会,但与更多数据训练的更大LLMs相比,表现仍然较低。我们发现训练较短的序列表现比较好,但预训使用音乐数据可能帮助表现,但效果似乎很小。我们的对话类型掩盖语言模型增强法不会提高模型的表现,但在特定的 BLiMP 任务上(例如,负面负陵项),它可能有所帮助。训练表现好的LLMs在小量数据上是一个困难但有可能提供有益的任务。一些我们的技术表现了一定的 promise,但需要更多的工作来探索这些技术是否可以提高表现。我们的代码可以在 GitHub 上找到(https://github.com/venkatasg/Lil-Bevo),我们的模型可以在 Hugging Face 上找到(https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a)。

PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent

  • paper_url: http://arxiv.org/abs/2310.17588
  • repo_url: None
  • paper_authors: Guangliang Liu, Zhiyu Xue, Xitong Zhang, Kristen Marie Johnson, Rongrong Wang
  • for: 提高大型语言模型(PLMs)在下游任务中的泛化性能,并且在少量数据学习(few-shot learning)情况下具有良好的泛化性能。
  • methods: 提议一种两阶段精度调整方法(PAC-tuning),首先根据PAC-Bayes训练直接逼近PAC-Bayes泛化 bound,然后在训练过程中通过在模型参数中尝试随机变量来 modify 梯度,实现一种变种的扰动梯度下降(PGD)。
  • results: 实验结果表明,PAC-tuning可以成功地处理精度调整问题,并在5个 GLUEbenchmark任务上超越强基eline方法,Visible margin。这些结果证明了PAC训练可以在任何其他使用Adam优化器进行训练的设置中应用。
    Abstract Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale optimization problem, in which the choice of the training algorithm critically determines how well the trained model can generalize to unseen test data, especially in the context of few-shot learning. To achieve good generalization performance and avoid overfitting, techniques such as data augmentation and pruning are often applied. However, adding these regularizations necessitates heavy tuning of the hyperparameters of optimization algorithms, such as the popular Adam optimizer. In this paper, we propose a two-stage fine-tuning method, PAC-tuning, to address this optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly minimizes the PAC-Bayes generalization bound to learn proper parameter distribution. Second, PAC-tuning modifies the gradient by injecting noise with the variance learned in the first stage into the model parameters during training, resulting in a variant of perturbed gradient descent (PGD). In the past, the few-shot scenario posed difficulties for PAC-Bayes training because the PAC-Bayes bound, when applied to large models with limited training data, might not be stringent. Our experimental results across 5 GLUE benchmark tasks demonstrate that PAC-tuning successfully handles the challenges of fine-tuning tasks and outperforms strong baseline methods by a visible margin, further confirming the potential to apply PAC training for any other settings where the Adam optimizer is currently used for training.
    摘要 大规模优化问题中,细化预训练语言模型(PLM) для下游任务是一个大型优化问题,在这个问题中,选择训练算法的选择对模型能够在未经见数据测试中准确预测的能力具有关键作用。为了 достичь好的泛化性表现和避免过拟合,技术如数据增强和剪裁通常被应用。然而,添加这些正则化需要严重调整优化算法的超参数,如流行的Adam优化器。在这篇论文中,我们提议一种两Stage细化方法,即PAC-tuning,以解决这个优化挑战。首先,基于PAC-Bayes培训,PAC-tuning直接将PAC-Bayes泛化约束约束学习到合适的参数分布。其次,PAC-tuning在训练过程中将模型参数中插入采样变量,从而实现一种变化的梯度下降(PGD)。在过去,几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的

Global Voices, Local Biases: Socio-Cultural Prejudices across Languages

  • paper_url: http://arxiv.org/abs/2310.17586
  • repo_url: https://github.com/iamshnoo/weathub
  • paper_authors: Anjishnu Mukherjee, Chahat Raj, Ziwei Zhu, Antonios Anastasopoulos
  • for: 这个论文旨在探讨语言模型(LM)如何反映和增强社会偏见。
  • methods: 作者使用Word Embedding Association Test(WEAT)测试24种语言中LM的偏见,并在每种语言的地区上添加了当地文化特点。
  • results: 研究发现,LM具有广泛的社会偏见,包括语言、文化和社会方面的偏见。此外,研究还发现了新的偏见维度,如恶意、残夷和更多的社会偏见。最后,研究者强调了这些社会偏见的重要性,并提出了更加公平的语言模型的需求。
    Abstract Human biases are ubiquitous but not uniform: disparities exist across linguistic, cultural, and societal borders. As large amounts of recent literature suggest, language models (LMs) trained on human data can reflect and often amplify the effects of these social biases. However, the vast majority of existing studies on bias are heavily skewed towards Western and European languages. In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies and yielding interesting findings about LM bias. We additionally enhance this data with culturally relevant information for each language, capturing local contexts on a global scale. Further, to encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more. Moreover, we delve deeper into the Indian linguistic landscape, conducting a comprehensive regional bias analysis across six prevalent Indian languages. Finally, we highlight the significance of these social biases and the new dimensions through an extensive comparison of embedding methods, reinforcing the need to address them in pursuit of more equitable language models. All code, data and results are available here: https://github.com/iamshnoo/weathub.
    摘要 人类偏见 universal,但不均匀:在语言、文化和社会边界之间存在差异。据 latest literature 表明,语言模型(LM)在人类数据上训练后可能反映和增强社会偏见的效果。然而,现有的大多数研究都集中在西方和欧洲语言上,这使得我们无法充分了解各语言的偏见。在这项工作中,我们将 Word Embedding Association Test(WEAT)扩展到24种语言,以便更广泛的研究和发现有趣的发现。此外,我们还增加了每种语言的地方信息,以捕捉每种语言的当地情况。此外,我们还研究了更广泛的社会偏见维度,包括恶意、残夷和更多的维度。此外,我们进行了全面的印度语言风貌分析,对6种流行的印度语言进行了详细的地方偏见分析。最后,我们强调了这些社会偏见的重要性,并通过对不同的插入方法进行了广泛比较,强调需要对其进行更加公正的补偿。所有代码、数据和结果都可以在以下链接中找到:https://github.com/iamshnoo/weathub。

1D-Touch: NLP-Assisted Coarse Text Selection via a Semi-Direct Gesture

  • paper_url: http://arxiv.org/abs/2310.17576
  • repo_url: None
  • paper_authors: Peiling Jiang, Li Feng, Fuling Sun, Parakrant Sarkar, Haijun Xia, Can Liu
  • for: 提高touch屏上文本选择的精度和效率,特别是在word和phrase水平上。
  • methods: 引入1D-Touch方法,使用简单的垂直滑块势件来扩展和缩小选择区域,从word到semantic chunk的范围。
  • results: 对于coarse-grained文本选择任务,与默认word-snapping方法相比,1D-Touch方法提高选择精度20%。
    Abstract Existing text selection techniques on touchscreen focus on improving the control for moving the carets. Coarse-grained text selection on word and phrase levels has not received much support beyond word-snapping and entity recognition. We introduce 1D-Touch, a novel text selection method that complements the carets-based sub-word selection by facilitating the selection of semantic units of words and above. This method employs a simple vertical slide gesture to expand and contract a selection area from a word. The expansion can be by words or by semantic chunks ranging from sub-phrases to sentences. This technique shifts the concept of text selection, from defining a range by locating the first and last words, towards a dynamic process of expanding and contracting a textual semantic entity. To understand the effects of our approach, we prototyped and tested two variants: WordTouch, which offers a straightforward word-by-word expansion, and ChunkTouch, which leverages NLP to chunk text into syntactic units, allowing the selection to grow by semantically meaningful units in response to the sliding gesture. Our evaluation, focused on the coarse-grained selection tasks handled by 1D-Touch, shows a 20% improvement over the default word-snapping selection method on Android.
    摘要 现有的触感屏选Text技术主要关注改进选择框的控制。word和phrase层级的粗体化文本选择尚未得到了多少支持,只有word拖拽和实体识别。我们介绍了1D-Touch,一种新的文本选择方法,该方法补充了基于拖拽的字符选择,并且使用简单的垂直滑块手势来扩展和缩小选择区域。这种方法将文本选择的概念从定义选择范围的first和last字符改变为一个动态的文本semantic实体选择过程。为了了解我们的方法的效果,我们实现了两个变体:WordTouch,它通过直接拖拽word来扩展选择范围,和ChunkTouch,它利用NLP将文本切分成语法单位,使选择可以通过semantically meaningful单位来响应滑块手势的扩展。我们的评估对于Android上的粗体化选择任务表现出了20%的提升。

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

  • paper_url: http://arxiv.org/abs/2310.17570
  • repo_url: None
  • paper_authors: Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
  • for: 这篇论文主要关注在于如何将传播生成模型(Diffusion Generative Models)有效地应用于语音生成和翻译任务中。
  • methods: 本论文提出了一种新的扩散模型,通过将扩散前进过程应用在连续的语音表示空间中,而将扩散后退程应用在组合的语音单位空间中。这样可以保留连续语音表示空间的semantic结构,并将连续和组合扩散模型融合起来。
  • results: 根据实验结果,提出的方法可以与 computationally intensive auto-regressive baselines(500步平均)相比,具有相同的表现(50步平均),并且需要更少的解oding步骤(50步)。
    Abstract While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).
    摘要 Diffusion生成模型在图像生成任务上取得了很大的成功,但是在语音生成特别是翻译任务中efficiently和有效地 интеGRATEDiffusion模型仍然是一个非轻松的问题。具体来说,由于语音数据的信息密度低,将转换后的整数语音序列与相应的文本译本进行比较,则长得多于文本译本,这会对现有的自动回归模型带来很大的挑战。另外,不是最佳的办法是在语音单元序列上直接施用抽象diffusion,而不考虑语音空间的连续结构,这将导致生成性能下降显著。在这篇论文中,我们提出了一种新的Diffusion模型,通过在连续语音表示空间中应用抽象diffusion进程,而在整数语音单元空间中使用反向抽象diffusion进程。这种方法保留了连续语音表示空间中的semantic结构,并将连续和整数Diffusion模型 integrate。我们在文本无直接语音翻译任务上进行了广泛的实验,并得到了与计算Intensive自动回归基准(500步平均)相比的相对较好的结果,但是需要更少的解码步骤(50步)。

  • paper_url: http://arxiv.org/abs/2310.17568
  • repo_url: None
  • paper_authors: Stephanie M. Lukin, Kimberly A. Pollard, Claire Bonial, Taylor Hudson, Ron Arstein, Clare Voss, David Traum
  • for: 本研究旨在探讨人类和机器人在远程位置的共同探索方法,以及不同modalities的使用对探索成功的影响。
  • methods: 本研究使用多modalities的交互方式,包括自然语言指令、2D LIDAR地图和 Upon-request的静止照片,以帮助参与者在远程位置进行探索。
  • results: 研究发现参与者在不同modalities的使用方式上有不同的策略,这些策略可能与探索不同任务的成功度有关。 addition, the study found that requesting photos may have improved the identification and counting of certain entities (such as doorways) without hindering overall area exploration.
    Abstract Human-guided robotic exploration is a useful approach to gathering information at remote locations, especially those that might be too risky, inhospitable, or inaccessible for humans. Maintaining common ground between the remotely-located partners is a challenge, one that can be facilitated by multi-modal communication. In this paper, we explore how participants utilized multiple modalities to investigate a remote location with the help of a robotic partner. Participants issued spoken natural language instructions and received from the robot: text-based feedback, continuous 2D LIDAR mapping, and upon-request static photographs. We noticed that different strategies were adopted in terms of use of the modalities, and hypothesize that these differences may be correlated with success at several exploration sub-tasks. We found that requesting photos may have improved the identification and counting of some key entities (doorways in particular) and that this strategy did not hinder the amount of overall area exploration. Future work with larger samples may reveal the effects of more nuanced photo and dialogue strategies, which can inform the training of robotic agents. Additionally, we announce the release of our unique multi-modal corpus of human-robot communication in an exploration context: SCOUT, the Situated Corpus on Understanding Transactions.
    摘要 人类指导式机器人探索是一种有用的方法,特别是在远程位置的信息收集方面,这些位置可能是危险、不适生存或不可达的。保持远程合作伙伴之间的共同点是一个挑战,这可以通过多模态通信来促进。在这篇论文中,我们探讨了参与者如何使用多种渠道来探索远程位置,并received from the robot:文本反馈、连续2D LIDAR地图和 Upon-request静止照片。我们注意到参与者在不同的渠道使用情况中采取了不同的策略,并 hypothesize 这些差异可能与探索子任务中的成功有关。我们发现,请求照片可能提高了某些关键实体(门户)的识别和计数,并且这种策略不会降低总区域探索的范围。未来的大样本研究可能会揭示更加细化的照片和对话策略的效果,这可以帮助训练机器人代理。此外,我们宣布了我们独特的多模态人机交互 corps:SCOUT,即 Situated Corpus on Understanding Transactions。

Towards Matching Phones and Speech Representations

  • paper_url: http://arxiv.org/abs/2310.17558
  • repo_url: None
  • paper_authors: Gene-Ping Yang, Hao Tang
  • for: 这个研究是为了使用自我监督学习来学习手机类型。
  • methods: 该研究使用了匹配中心点和手机嵌入的方法来解决问题。
  • results: 实验表明,匹配结果能够捕捉手机之间的关系。通过与常规自我监督学习loss函数合作训练,可以大幅提高下游手机分类的性能。
    Abstract Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.
    摘要 学习手机类型从手机实例中获取信息已经是一个长期存在的问题,而且还在开放状态下进行学习。在这项工作中,我们重新审视了这个问题,并将其转化为匹配中心点与手机嵌入的问题。我们研究了两个关键的特性,即自生 represencing 中心点是否减少手机实例的变化和是否尊重手机之间的关系。然后,我们使用匹配结果生成 Pseudo-标签,并引入了一个新的损失函数来改进自生 represencing。我们的实验表明,匹配结果能够捕捉手机之间的关系。在训练新的损失函数并与常见的自生损失函数,如APC和CPC,并行训练时,下流手机分类得到了显著改进。

Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models

  • paper_url: http://arxiv.org/abs/2310.17530
  • repo_url: https://github.com/coastalcph/gender-neutral-vl
  • paper_authors: Laura Cabello, Emanuele Bugliarello, Stephanie Brandl, Desmond Elliott
  • for: 本研究旨在探讨预训练模型内置的性别偏见如何影响模型性能,以及如何通过不同的预训练和终端训练方法来降低这种偏见。
  • methods: 本研究使用了三种视觉语言模型家族,通过分别在预训练和终端训练两个阶段进行训练,对模型的性别偏见进行了评估。
  • results: 研究发现,预训练和终端训练两个阶段的偏见增强是独立的,并且终端训练gender-neutral数据可以降低群体差异,提高模型的公平性。
    Abstract Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we define gender bias as our case study. We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. We investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. Overall, we find that bias amplification in pretraining and after fine-tuning are independent. We then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without significantly compromising task performance.
    摘要 preprained 机器学习模型已知可能导致和加剧现有偏见在数据中,从而导致不公正的结果,最终影响用户体验。因此,我们需要了解这些偏见的机制,以确保模型性能不会导致对某些群体或人口的歧视行为。在这项工作中,我们选择性别偏见作为我们的案例研究。我们量化在预训练和精度调整后的偏见增强。我们调查这两个学习阶段之间的连接,如果有,以及如何将偏见增强反映到模型性能中。总之,我们发现预训练和精度调整的偏见增强是独立的。然后,我们查看继续预训练 gender-neutral 数据后的效果,发现这会降低群体差异,即提高公平性,在 VQAv2 和检索任务上没有显著削弱任务性能。

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

  • paper_url: http://arxiv.org/abs/2310.17514
  • repo_url: https://github.com/facebookresearch/compositionalityvalidity
  • paper_authors: Kaiser Sun, Adina Williams, Dieuwke Hupkes
  • for: 本研究旨在探讨如何确定评价 datasets 是否能够准确衡量模型的能力。
  • methods: 研究使用了 6 种模型方法,并对 4 个 datasets 进行了8种 compositional splitting strategies的比较,并将模型按18个 compositional generalization splits进行排名。
  • results: 研究结果显示:一、不同的 datasets 对模型方法的评价有所不同; 二、人工生成的 datasets 与其他 datasets 之间的对比更为一致; 三、数据源的样本来源是评价模型的更重要因素,而不是Compositionality的解释; 四、数据中使用的词语项可以强烈影响结论。 overall, 这个研究表明了评价 datasets 的有效性还需要进一步的检验,并建议为领域设置更加严格的评价标准。
    Abstract NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.
    摘要 《NLP模型在最近几年内有了很大的进步,根据各种评估表现的数据集提出了许多。然而,有些数据集的设计方式可能会影响我们对模型能力的结论。在本工作中,我们研究了这个问题,以 Compositional generalization 领域为例。我们评估了六种模型方法,在四个数据集上进行了8种 Compositional splitting 策略,并将模型按18种 Compositional generalization 排名。我们的结果表明:一、不同的数据集,即使都是用于评估 Compositional generalization 的,对模型方法的评价有所不同;二、人类生成的数据集更加符合人类生成的数据集,而不是人类生成的数据集和人工生成的数据集之间的对比;三、数据集是否来自同一个源的样本是评估模型排名的更好的预测因素,而不是数据集是否保持同一种 Compositional 性的解释;四、数据集中使用的语言项可以强烈地影响结论。总之,我们的结果表明,评估 NLP 模型的领域还需要很多工作,并建议在设置评估数据集的时候,更加严格地遵循一些标准,以确保评估结果的有效性。》

The IMS Toucan System for the Blizzard Challenge 2023

  • paper_url: http://arxiv.org/abs/2310.17499
  • repo_url: https://github.com/digitalphonetics/ims-toucan
  • paper_authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu
  • for: 这个论文是为了提高法语音识别系统的性能而写的。
  • methods: 该论文使用了一种基于规则的文本到phoneme处理系统,包括法语中 homograph 的规则化解决方法。然后将phoneme转换为spectrogram作为中间表示,使用Conformer和Glow架构实现快速和高效的非autoregressive生成架构。最后,使用GAN基于神经网络 vocoder将spectrogram转换为最终波形。
  • results: 作者在Blizzard Challenge 2023中提交的系统实现了优化,与Blizzard Challenge 2021中提交的系统相比,提高了性能。
    Abstract For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.
    摘要 为我们在Blizzard Challenge 2023中的贡献,我们对我们在Blizzard Challenge 2021中提交的系统进行了改进。我们的方法包括一个基于规则的文本到音节处理系统,其中包括基于法语 homographs 的规则化扩展。然后,它将音节转换为中间表示使用一个高效的非 autoregressive 合成架构,基于 Conformer 和 Glow。一个基于 GAN 的神经 vocoder 将spectrogram转换为最终波形。我们仔细设计了挑战数据的数据处理、训练和推断过程。我们的系统标识符是 G。开源代码和示例程序可以获得。

LightLM: A Lightweight Deep and Narrow Language Model for Generative Recommendation

  • paper_url: http://arxiv.org/abs/2310.17488
  • repo_url: https://github.com/dongyuanjushi/lightlm
  • paper_authors: Kai Mei, Yongfeng Zhang
  • for: 这篇论文旨在提出一种轻量级的Transformer模型,用于生成推荐。
  • methods: 该模型使用了一种特定针对推荐任务的深度和窄针对Transformer结构,以及一种 Spectral Collaborative Indexing (SCI) 和 Graph Collaborative Indexing (GCI) 方法来优化模型的性能。
  • results: 实验结果表明,LightLM 可以在实际数据集上超过多种竞争对手, both in terms of 推荐准确率和效率。
    Abstract This paper presents LightLM, a lightweight Transformer-based language model for generative recommendation. While Transformer-based generative modeling has gained importance in various AI sub-fields such as NLP and vision, generative recommendation is still in its infancy due to its unique demand on personalized generative modeling. Existing works on generative recommendation often use NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are heavy-weight and are not specifically designed for recommendation tasks. LightLM tackles the issue by introducing a light-weight deep and narrow Transformer architecture, which is specifically tailored for direct generation of recommendation items. This structure is especially apt for straightforward generative recommendation and stems from the observation that language model does not have to be too wide for this task, as the input predominantly consists of short tokens that are well-suited for the model's capacity. We also show that our devised user and item ID indexing methods, i.e., Spectral Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables the deep and narrow Transformer architecture to outperform large-scale language models for recommendation. Besides, to address the hallucination problem of generating items as output, we propose the constrained generation process for generative recommenders. Experiments on real-world datasets show that LightLM outperforms various competitive baselines in terms of both recommendation accuracy and efficiency. The code can be found at https://github.com/dongyuanjushi/LightLM.
    摘要 To further enhance the performance of LightLM, the paper proposes two indexing methods, Spectral Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), which enable the model to outperform large-scale language models for recommendation. Additionally, the paper addresses the hallucination problem of generating items as output by proposing a constrained generation process for generative recommenders.Experiments on real-world datasets demonstrate that LightLM outperforms various competitive baselines in terms of both recommendation accuracy and efficiency. The code for LightLM can be found at https://github.com/dongyuanjushi/LightLM.

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

  • paper_url: http://arxiv.org/abs/2310.17448
  • repo_url: None
  • paper_authors: Tanel Alumäe, Jiaming Kong, Daniil Robnikov
  • for: 这篇论文描述了塔林大学技术(TalTech)为ASRU MADASR 2023挑战制定的系统。挑战的目标是自动识别具有方言多样性的印度语言,使用有限的训练音频和文本数据。
  • methods: 我们的方法与传统的精度训练模型 fine-tuning 不同,在两个关键点上有所创新:首先,通过实施协调数据扩充技术,提高训练数据的语言多样性;其次,通过深度预FIX Tuning 进行方言适应mod型。
  • results: 在两个跑道上,我们的方法实现了比基eline低的单词错误率,在所有参赛队伍中获得最低的成绩。
    Abstract This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data. In both tracks, we relied on wav2vec2.0 models. Our methodology diverges from the traditional procedure of finetuning pretrained wav2vec2.0 models in two key points: firstly, through the implementation of the aligned data augmentation technique to enhance the linguistic diversity of the training data, and secondly, via the application of deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, our approach yielded significant improvements over the provided baselines, achieving the lowest word error rates across all participating teams.
    摘要 这篇论文介绍了塔林大学科技(TalTech)为ASRU MADASR 2023 挑战开发的系统。挑战目标是使用少量训练音频和文本数据自动识别具有方言落差的印度语言。TalTech 参与了两个赛道:第一个赛道允许只使用提供的训练数据,第三个赛道允许使用额外的音频数据。在两个赛道中,我们采用了两点不同的方法:首先,通过实施对齐数据增强技术来提高训练数据的语言多样性,其次,通过深度预refix 调整来进行方言适应。在两个赛道中,我们的方法比提供的基准值具有显著改进,实现了所有参与队伍中最低的单词错误率。

‘’Fifty Shades of Bias’’: Normative Ratings of Gender Bias in GPT Generated English Text

  • paper_url: http://arxiv.org/abs/2310.17428
  • repo_url: None
  • paper_authors: Rishav Hada, Agrima Seth, Harshita Diddee, Kalika Bali
  • for: This paper aims to investigate the gender bias in language models and its impact on the generation of text.
  • methods: The authors use a dataset of GPT-generated English text with normative ratings of gender bias, and employ Best–Worst Scaling to obtain the ratings. They also analyze the variation of themes of gender biases in the observed ranking.
  • results: The authors show that identity-attack is most closely related to gender bias, and evaluate the performance of existing automated models trained on related concepts on their dataset.
    Abstract Language serves as a powerful tool for the manifestation of societal belief systems. In doing so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. With LLMs increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. Prior work often treats gender bias as a binary classification task. However, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. Specifically, we create the first dataset of GPT-generated English text with normative ratings of gender bias. Ratings were obtained using Best--Worst Scaling -- an efficient comparative annotation framework. Next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. Finally, we show the performance of existing automated models trained on related concepts on our dataset.
    摘要 文化信仰系统的表达工具,即语言,同时也推动社会中普遍存在的偏见。 gender bias 是社会中最普遍的偏见之一,可见在线上和线下的交流中。随着人工智能语言模型(LLM)的发展,我们必须有一个深刻的理解,这些系统可以生成的偏见是多么的。在先前的工作中, gender bias 常被视为二分类任务。然而,我们认为偏见应该被评估为相对的程度,我们开创了首个 GPT 生成的英文文本中的normative 评分。我们使用了 Best--Worst Scaling Comparative Annotation Framework 来获取评分。接下来,我们系统地分析了观察到的性别偏见主题的变化,并发现 identity-attack 最为相关于性别偏见。最后,我们展示了对我们数据集的现有自动化模型的性能。

  • paper_url: http://arxiv.org/abs/2310.17413
  • repo_url: None
  • paper_authors: Anas Belfathi, Nicolas Hernandez, Laura Monceaux
  • for: 本研究旨在为querying一个大型预训练的生成器 transformer(GPT-3.5-turbo)在法律案件中的辩论角色预测任务进行全面的研究。这个任务需要处理文本上下文。
  • methods: 本研究使用了零数少shot策略、任务特定定义和文本上下文的解释、通用提示和特定问题来探索适用于这个任务的一Stage抽象技术。
  • results: 研究发现,数量的增加、标签的定义、文本上下文的显示和特定问题对模型的性能有积极的影响。在不同的测试集配置下,我们发现,使用一些直接来自上下文的标签的例子提示可以使模型在Weighted F1 score中达到72%的性能,但还有一定的差距需要追赶到最佳系统(86%),这些系统需要专门的资源、体系和训练。
    Abstract We propose a comprehensive study of one-stage elicitation techniques for querying a large pre-trained generative transformer (GPT-3.5-turbo) in the rhetorical role prediction task of legal cases. This task is known as requiring textual context to be addressed. Our study explores strategies such as zero-few shots, task specification with definitions and clarification of annotation ambiguities, textual context and reasoning with general prompts and specific questions. We show that the number of examples, the definition of labels, the presentation of the (labelled) textual context and specific questions about this context have a positive influence on the performance of the model. Given non-equivalent test set configurations, we observed that prompting with a few labelled examples from direct context can lead the model to a better performance than a supervised fined-tuned multi-class classifier based on the BERT encoder (weighted F1 score of = 72%). But there is still a gap to reach the performance of the best systems = 86%) in the LegalEval 2023 task which, on the other hand, require dedicated resources, architectures and training.
    摘要 我们提议进行一项涵盖一阶段 solicitation 技术的大规模预训练生成器(GPT-3.5-turbo)在法律案例中的辩论角色预测任务中进行全面的研究。这项任务需要处理文本背景,我们的研究探讨了策略如零些例、任务特定定义和修饰、文本背景和推理的通用提示和特定问题。我们发现了数量、标签定义、文本背景的显示和特定问题的影响于模型的性能。给定不同的测试集配置,我们发现,使用 direct context 中标签的几个例子提问可以使模型达到比supervised fine-tuned multi-class classifier(基于 BERT 编码器)的性能(weighted F1 分数 = 72%),但还有一定的差距,与最佳系统(86%)在 LegalEval 2023 任务中的性能相比。

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

  • paper_url: http://arxiv.org/abs/2310.17408
  • repo_url: https://github.com/adcastel/exo_ukr_generator
  • paper_authors: Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, Héctor Martínez
  • for: 本研究旨在提高矩阵乘法(GEMM)的优化,以提高现代线性代数库(如BLIS、OpenBLAS、Intel OneAPI)的性能。
  • methods: 本研究使用Exo编译器生成微型核心代码,以实现close to(或更好于)手动编写的微型核心代码。此外,本解决方案提高了代码的可移植性,因为硬件目标完全由一个简洁的库所 descriptions。
  • results: 本研究显示,使用Exo编译器生成微型核心代码可以实现 close to(或更好于)手动编写的微型核心代码,并提高代码的可移植性。
    Abstract The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that performs close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
    摘要 “矩阵乘法(GEMM)的优化在过去几十年中一直是一个需求。这个操作被当今的线性代数库(如BLIS、OpenBLAS或Intel OneAPI)视为标杆操作,因为它在许多科学应用中广泛使用。GEMM通常采用GotoBLAS哲学,即将GEMM参数瓦bben分割并使用一系列嵌套循环来提高性能。这些方法利用硬件的最大计算能力,通过小块硬件特定、高性能的代码块(微型核心)来提高性能。然而,这种方法需要开发者投入一定的努力来生成每个新硬件的专门微型核心。在这个工作中,我们提供了一个步骤式的过程,使用Exo编译器生成微型核心,其性能与手动编写的微型核心或 Assembly 语言代码几乎相当。我们的解决方案还提高了生成代码的可移植性,因为硬件目标可以通过一个简洁的库所示的指令来完全定义。”

Meaning and understanding in large language models

  • paper_url: http://arxiv.org/abs/2310.17407
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Vladimír Havlík
  • for: 本文总结了现代人工智能语言模型的发展,并评估了传统哲学假设中关于机器语言理解的假设。
  • methods: 本文使用了生成式大语言模型来评估机器语言理解的水平,并检验了现代语言模型是否具有真正的语言理解能力。
  • results: 研究结果显示,现代语言模型可以具有深层次的语言理解能力,不仅仅是 superficielle 的语法处理。
    Abstract Can a machine understand the meanings of natural language? Recent developments in the generative large language models (LLMs) of artificial intelligence have led to the belief that traditional philosophical assumptions about machine understanding of language need to be revised. This article critically evaluates the prevailing tendency to regard machine language performance as mere syntactic manipulation and the simulation of understanding, which is only partial and very shallow, without sufficient referential grounding in the world. The aim is to highlight the conditions crucial to attributing natural language understanding to state-of-the-art LLMs, where it can be legitimately argued that LLMs not only use syntax but also semantics, their understanding not being simulated but duplicated; and determine how they ground the meanings of linguistic expressions.
    摘要 can 机器理解自然语言吗?最近的人工智能生成大语言模型(LLMs)的发展已经让人们认为传统哲学假设关于机器语言理解需要修订。这篇文章批判现在人们通常认为机器语言表现只是Syntax的 manipulate和模拟理解,而不是真正的 semantics 理解,而且这种理解并不具备充分的 referential 基础。文章的目标是 highlighting state-of-the-art LLMs 才能够被合理地 argue 不仅使用 syntax 还使用 semantics,其理解不是 simulated 而是 duplicated,并 determin 语言表达意义的基础。

Language and Mental Health: Measures of Emotion Dynamics from Text as Linguistic Biosocial Markers

  • paper_url: http://arxiv.org/abs/2310.17369
  • repo_url: None
  • paper_authors: Daniela Teodorescu, Tiffany Cheng, Alona Fyshe, Saif M. Mohammad
  • For: This paper aims to investigate the relationship between tweet emotion dynamics and mental health disorders.* Methods: The authors use a dataset of tweets and employ recent approaches to determining emotion dynamics from everyday utterances.* Results: The study finds that each of the emotion dynamics metrics studied varies by the user’s self-disclosed diagnosis, and that linguistic cues pertaining to emotion dynamics can play a crucial role as biosocial markers for mental illnesses.Here’s the same information in Simplified Chinese text:* For: 这篇论文目的是调查推特情感动力与心理疾病的关系。* Methods: 作者使用了一个推特数据集,并采用了现代的语言预测方法来确定情感动力。* Results: 研究发现,每一个情感动力指标都与用户自我报告的疾病有关,并且语言上的情感动力征具有识别心理疾病的重要作用。
    Abstract Research in psychopathology has shown that, at an aggregate level, the patterns of emotional change over time -- emotion dynamics -- are indicators of one's mental health. One's patterns of emotion change have traditionally been determined through self-reports of emotions; however, there are known issues with accuracy, bias, and convenience. Recent approaches to determining emotion dynamics from one's everyday utterances, addresses many of these concerns, but it is not yet known whether these measures of utterance emotion dynamics (UED) correlate with mental health diagnoses. Here, for the first time, we study the relationship between tweet emotion dynamics and mental health disorders. We find that each of the UED metrics studied varied by the user's self-disclosed diagnosis. For example: average valence was significantly higher (i.e., more positive text) in the control group compared to users with ADHD, MDD, and PTSD. Valence variability was significantly lower in the control group compared to ADHD, depression, bipolar disorder, MDD, PTSD, and OCD but not PPD. Rise and recovery rates of valence also exhibited significant differences from the control. This work provides important early evidence for how linguistic cues pertaining to emotion dynamics can play a crucial role as biosocial markers for mental illnesses and aid in the understanding, diagnosis, and management of mental health disorders.
    摘要 研究发现,在总体水平上,情感变化趋势 -- 情感动力学 -- 是诊断心理健康的指标。传统的情感变化评估方法是通过自我报告情感,但这些方法存在准确性、偏见和便利性问题。现在的方法是从日常语言中提取情感动力学,解决了这些问题,但是不知道这些措施与心理疾病诊断是否相关。本研究是第一次研究推特情感动力学与心理疾病的关系。我们发现,每一个UED指标都与用户自我报告的疾病有关。例如:正常组比对ADHD、抑郁症、PTSD等用户的平均值更高(即更正面的文本)。变化的值的变化率低于正常组,但是与抑郁症、跨悲症、偏抑郁症、抑郁症、PTSD和OCD有关。这项工作提供了关键的早期证据,证明语言上的情感动力学指标可以作为心理疾病的生物社会标志,帮助理解、诊断和管理心理健康疾病。

ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought

  • paper_url: http://arxiv.org/abs/2310.17342
  • repo_url: https://github.com/x-lance/text2sql-gpt
  • paper_authors: Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, Kai Yu
  • for: 提高大语言模型(LLMs)在文本到SQL任务中的逻辑能力
  • methods: 使用链条思维(CoT)提问和自动生成Auto-CoT参考实例,不需手动标注
  • results: LLMs的性能得到改善,在多turn文本到SQL任务中也有优秀表现,达到了现有的SOTA水平
    Abstract Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks. We study the problem of prompt designing in the text-to-SQL task and attempt to improve the LLMs' reasoning ability when generating SQL queries. Besides the trivial few-shot in-context learning setting, we design our chain-of-thought (CoT) prompt with a similar method to schema linking. We provide a method named ACT-SQL to automatically generate auto-CoT exemplars and thus the whole process doesn't need manual labeling. Our approach is cost-saving since we only use the LLMs' API call once when generating one SQL query. Furthermore, we extend our in-context learning method to the multi-turn text-to-SQL task. The experiment results show that the LLMs' performance can benefit from our ACT-SQL approach. Our approach achieves SOTA performance on the Spider dev set among existing in-context learning approaches.
    摘要

Arabic Fine-Grained Entity Recognition

  • paper_url: http://arxiv.org/abs/2310.17333
  • repo_url: None
  • paper_authors: Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad AbdulMageed
  • For: This paper aims to advance Arabic Named Entity Recognition (NER) with fine-grained entities, specifically by extending the Wojood corpus with 31 subtypes for four main entity types (GPE, LOC, ORG, and FAC).* Methods: The authors first revised Wojood’s annotations to be compatible with the LDC’s ACE guidelines, and then manually annotated all mentions of GPE, LOC, ORG, and FAC (~44K) with the LDC’s ACE sub-types. They also fine-tuned three pre-trained Arabic BERT encoders in three settings to compute the baselines of WojoodF ine.* Results: The authors achieved inter-annotator agreement (IAA) of 0.9861 and 0.9889 using Cohen’s Kappa and F1 score, respectively. They also achieved F1 score of 0.920, 0.866, and 0.885 in three settings of fine-tuning the pre-trained Arabic BERT encoders.Here are the three points in Simplified Chinese text:* 为:本文目的是提高阿拉伯语名实体识别(NER)的细化实体识别,具体是将 Wojood корпусу扩展到4类主要实体类型(GPE、LOC、ORG、FAC)中的31个子类型。* 方法:作者首先将 Wojood 的注释更改为与 LDC 的 ACE 指南兼容,并手动注释了 Wojood 中约44,000 个 GPE、LOC、ORG 和 FAC 的所有提及。他们还使用三个预训练的阿拉伯语 BERT 核心进行三种设定的 fine-tuning,以计算 WojoodF ine 的基线。* 结果:作者达到了 Cohen 的卡方和 F1 得分的共识度(IAA)为 0.9861 和 0.9889,并在三种设定中达到了 fine-tuning 预训练的阿拉伯语 BERT 核心的 F1 得分为 0.920、0.866 和 0.885。
    Abstract Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.
    摘要 传统的NER系统通常只会训练粗粒度的实体识别,对于细致的实体类型进行分类不受充分关注。这篇文章旨在提高阿拉伯语NER的细致性。我们选择将 Wojood(一个开源的嵌套式阿拉伯语命名实体词典)扩展到子类型。特别是,Wojood中的四个主要实体类型(地域政治实体(GPE)、位置(LOC)、组织(ORG)和设施(FAC))被扩展到31个子类型。为此,我们首先修改了Wojood中GPE、LOC、ORG和FAC的注释,使其与LDC的ACE指南相容,共计5,614个更改。然后,Wojood中所有GPE、LOC、ORG和FAC的提及(约44,000个)都被手动注释为LDC的ACE子类型。我们称这个扩展版本为Wojood Fine。为了评估我们的注释,我们使用了COhen的Kappa和F1得分来计算同义词一致性,得到0.9861和0.9889的值。为了计算Wojood Fine的基准值,我们在三种设置下练习三个预训练的阿拉伯语BERTEncoder:普通的NER、嵌套NER和嵌套NER+子类型,并 achieved F1得分为0.920、0.866和0.885,分别。我们的词典和模型都是开源的,可以在https://sina.birzeit.edu/wojood/获取。

Nabra: Syrian Arabic Dialects with Morphological Annotations

  • paper_url: http://arxiv.org/abs/2310.17315
  • repo_url: None
  • paper_authors: Amal Nayouf, Tymaa Hammouda, Mustafa Jarrar, Fadi Zaraket, Mohamad-Bassam Kurdy
  • For: 这个论文是为了提供一个包含叙利亚阿拉伯语言方言的 corpora,以便进行语言研究和应用。* Methods: 这个论文使用了社交媒体帖子、电影和电视剧cript、歌曲歌词和地方谚语等多种来源,收集了超过6万句话,共60万个单词,并对这些单词进行了全 morphological 注释。* Results: 这个论文通过使用 nine 名native annotator 进行注释,实现了高质量的注释,F1和κ合理性分数在不同特征上的范围为74%-98%。该 corpora 已经公开发布,可以在 Currasat 门户网站上获取:https://sina.birzeit.edu/currasat。
    Abstract This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nabra. Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and kappa agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nabra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat.
    摘要

An Ensemble Method Based on the Combination of Transformers with Convolutional Neural Networks to Detect Artificially Generated Text

  • paper_url: http://arxiv.org/abs/2310.17312
  • repo_url: None
  • paper_authors: Vijini Liyanage, Davide Buscaldi
  • for: 本研究旨在探讨使用自然语言生成器(Natural Language Generation,NLG)生成的文本是否能够自动分类为人工生成或人类写作。
  • methods: 本研究使用了一些 ensemble transformer 模型,包括 Sci-BERT、DeBERTa 和 XLNet,以及卷积神经网络(Convolutional Neural Networks,CNNs)。
  • results: 我们的实验结果表明,使用 ensemble 模型可以超越单独的 transformer 模型的性能,而 SciBERT-CNN ensemble 模型在 ALTA 共享任务 2023 数据集上达到了 F1 分数为 98.36%。
    Abstract Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.
    摘要 thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.Here's the translation in Traditional Chinese:感谢现代大型语言模型(LLMs),语言生成已经到达了出色的水平。这些模型能够生成高品质的内容,因此对于实际生成文本和人写文本的区别成为一个挑战。尽管自然语言生成具有许多优点,但是无法自动识别生成文本的问题可能会导致道德问题。因此,设计和开发检测人工内容的方法ologies是非常重要的。在我们的工作中,我们提出了一些由transformer模型(如Sci-BERT、DeBERTa和XLNet)和卷积神经网(CNNs)构成的分类模型。我们的实验结果显示,考虑的结合架构在分类方面表现更高。此外,我们提出的SciBERT-CNN结合模型在ALTA共享任务2023数据集上产生了98.36%的F1分。

Learning to Abstract with Nonparametric Variational Information Bottleneck

  • paper_url: http://arxiv.org/abs/2310.17284
  • repo_url: None
  • paper_authors: Melika Behjati, Fabio Fehr, James Henderson
  • for: 提高语言模型的鲁棒性和抗干扰能力
  • methods: 使用Nonparametric Variational Information Bottleneck(NVIB)压缩Transformer自注意层,实现模型层次结构中的压缩和抽象
  • results: 模型可以更好地捕捉语言特征,并且具有更高的鲁棒性和抗干扰能力
    Abstract Learned representations at the level of characters, sub-words, words and sentences, have each contributed to advances in understanding different NLP tasks and linguistic phenomena. However, learning textual embeddings is costly as they are tokenization specific and require different models to be trained for each level of abstraction. We introduce a novel language representation model which can learn to compress to different levels of abstraction at different layers of the same model. We apply Nonparametric Variational Information Bottleneck (NVIB) to stacked Transformer self-attention layers in the encoder, which encourages an information-theoretic compression of the representations through the model. We find that the layers within the model correspond to increasing levels of abstraction and that their representations are more linguistically informed. Finally, we show that NVIB compression results in a model which is more robust to adversarial perturbations.
    摘要 学习的表示形式在字符、子词、词和句子等多个层次上,各自为不同的自然语言处理任务和语言现象带来了进步。然而,学习文本嵌入是费时的,因为它们是特定的Tokenization的,需要为每个层次投入不同的模型。我们介绍了一种新的语言表示模型,可以在同一个模型中学习压缩到不同的层次。我们在编码器中使用非参数的可变信息瓶颈(NVIB),将核心Transformer自我注意层堆叠在一起,这使得模型中的表示进行了信息学 compression。我们发现模型中的层次对应于不同的层次抽象,并且它们的表示更加语言化。最后,我们发现NVIB压缩后的模型更加抗性 adversarial perturbations。

Automatic Logical Forms improve fidelity in Table-to-Text generation

  • paper_url: http://arxiv.org/abs/2310.17279
  • repo_url: https://github.com/alonsoapp/tlt
  • paper_authors: Iñigo Alonso, Eneko Agirre
  • for: 这个论文主要写于如何从表格生成自然语言陈述。
  • methods: 该论文使用了自动生成的逻辑形式(LF),以提高文本的事实准确性。
  • results: 研究发现,使用自动生成的LF可以提高文本的事实准确性,相比之前的系统不使用LF。此外,研究还发现了高事实准确性的主要挑战,包括自动选择内容、逻辑到文本转换和表格到逻辑转换。
    Abstract Table-to-text systems generate natural language statements from structured data like tables. While end-to-end techniques suffer from low factual correctness (fidelity), a previous study reported gains when using manual logical forms (LF) that represent the selected content and the semantics of the target text. Given the manual step, it was not clear whether automatic LFs would be effective, or whether the improvement came from content selection alone. We present TlT which, given a table and a selection of the content, first produces LFs and then the textual statement. We show for the first time that automatic LFs improve quality, with an increase in fidelity of 30 points over a comparable system not using LFs. Our experiments allow to quantify the remaining challenges for high factual correctness, with automatic selection of content coming first, followed by better Logic-to-Text generation and, to a lesser extent, better Table-to-Logic parsing.
    摘要 tables-to-text 系统可以从结构化数据中生成自然语言声明。而通过终端技术,产生的声明准确性(loyalty)很低。一项之前的研究发现,使用手动逻辑形式(LF)可以提高声明的准确性。然而,使用自动生成的LF是否有效?或者是选择内容alone 贡献了改进。我们提出了 TlT,它可以从表格和选择的内容中生成逻辑形式,然后生成文本声明。我们首次发现,使用自动生成的LF可以提高质量,准确性提高30个点。我们的实验表明,自动选择内容是首要挑战,然后是逻辑到文本转换,并且这些挑战减少了一些。

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

  • paper_url: http://arxiv.org/abs/2310.17271
  • repo_url: None
  • paper_authors: Ahmed Alajrami, Katerina Margatina, Nikolaos Aletras
  • for: 本研究旨在探讨如何在自然语言处理中理解预训练语言模型(PLMs)对语言的学习。
  • methods: 本研究使用了小 subsets of characters from individual tokens进行预训练语言模型。
  • results: surprisingly, 我们发现,即使在极端设置下(即只使用每个токен中的一个字符),预训练后的模型在标准NLU任务和探测任务中的性能保留率相对较高,比如使用单个第一个字符从tokentoken中预训练的模型在SuperGLUE和GLUE任务中的性能保留率分别为 approximately $90$%和$77$%。
    Abstract Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE tasks, respectively.
    摘要 理解PLMs如何学习语言是自然语言处理领域的开放挑战。先前的工作主要集中在确定PLMs是否捕捉 semantics和 sintaxis信息,以及数据或预训练目标对其性能的影响。然而,根据我们所知,没有任何前一项工作专门检查了输入token字符损失对PLMs的影响。在这项研究中,我们解决这个空白,通过使用个体token中的小subset的字符进行预训练语言模型。Resultingly, we find that even under extreme settings, i.e., using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately 90% and 77% of the full-token model in SuperGLUE and GLUE tasks, respectively.

EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning

  • paper_url: http://arxiv.org/abs/2310.17233
  • repo_url: None
  • paper_authors: Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei Huang, Jun Xie
  • for: 本研究探讨了如何学习跨语言共同表示,以增强机器翻译和其他语言处理任务的性能。
  • methods: 该研究提出了一种基于EM算法的多语言预训练算法,称为EMMA-X,用于学习跨语言共同表示。EMMA-X使用了大量的多语言非параллель数据,并将跨语言表示学习任务和额外semantic relation预测任务结合在一起。
  • results: 实验表明,EMMA-X在新引入的XRETEBenchmark上达到了状态对性能。此外,对于建立的表示空间的几何分析表明,EMMA-X在三个需求下表现出了superiority。
    Abstract Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models.
    摘要 <> translate("Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic 'universals' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models.)Here's the translation in Traditional Chinese:<>翻译("表达通用 semantics 对所有语言都很有帮助,对复杂和文化特有的句子进行理解。研究主题下面这个 scenario 是通过大量平行 corpora 来学习语言之间的通用表现。然而,由于平行数据的稀缺和罕见性,还是有一个大的挑战是从任何两个语言中学习真正的通用。在这篇文章中,我们提出了 EMMA-X:一种 EM-like 多语言预训 Algorithm,以learn (X) Cross-lingual universals 的 aid excessive multilingual non-parallel data。EMMA-X 将 Cross-lingual representation learning task 和 extra semantic relation prediction task 统一在 EM 框架内。两个类别的 semantic classifier 和 Cross-lingual sentence encoder 都会 approximates 两个句子之间的 semantic relation,并且彼此监控 until convergence。为了评估 EMMA-X,我们在 XRETE 上进行了实验,XRETE 是一个 newly introduced 的 benchmark,包含 12 种通过句子水平表现来研究的 cross-lingual task。结果显示 EMMA-X 实现了 state-of-the-art 性能。进一步的 geometric analysis 显示 EMMA-X 在三个需求下建立的表示空间优化。)

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

  • paper_url: http://arxiv.org/abs/2310.17230
  • repo_url: https://github.com/taufeeque9/codebook-features
  • paper_authors: Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman
  • for: 这种方法可以帮助我们更好地理解神经网络的行为和性质。
  • methods: 我们使用了vector quantization bottleneck来压缩神经网络的潜在特征,从而生成一个具有稀疏、整数特征的神经网络。
  • results: 我们发现这种方法可以减少神经网络的性能下降,并且可以帮助我们更好地控制神经网络的行为。我们在几个不同的数据集上训练了codebook Transformers,并发现可以通过活化相应的代码来控制神经网络的输出。
    Abstract Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.
    摘要 理解神经网络是困难的一个原因之一是它们的隐藏状态的紧密、连续性。我们研究是否可以训练神经网络,使其隐藏状态变得稀疏、简单化并更容易理解,通过将连续的特征映射到我们称之为“代码库特征”中。代码库特征是通过在每层神经网络中加入vector量化瓶颈,生成一个网络,其隐藏特征是由一小数量的简单vector码选择从大型代码库中。我们发现,可以在这种极端瓶颈下训练神经网络,只具有轻微的性能下降。这种稀疏、简单的瓶颈还提供了一种直观的方式控制神经网络行为:首先,找到表示愿景存在的代码,然后在生成时 aktivate 这些代码,以诱发愿景。我们验证了我们的方法,通过在多个不同的数据集上训练代码库Transformer。首先,我们研究了一个有 infinitely many 隐藏状态的 finite state machine 数据集,在这种情况下,我们的方法可以把状态分配给独特的代码,从而解决超position 问题。我们发现,可以通过活动相应的代码来让神经网络 behave 如果是不同的状态。其次,我们在 two 个自然语言数据集上训练了 Transformer 语言模型,包含多达 410M 参数。我们在这些模型中找到了表示多元、分离的概念的代码(从负情感到月份),并发现可以通过在推理中活动相应的代码来引导模型生成不同的话题。总之,代码库特征看来是神经网络和解释性的有希望的单元。我们的代码库和模型在 上公开。

X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity

  • paper_url: http://arxiv.org/abs/2310.17166
  • repo_url: None
  • paper_authors: Taejun Yun, Jinhyeon Kim, Deokyeong Kang, Seong Hoon Lim, Jihoon Kim, Taeuk Kim
  • for: 本研究旨在预测跨语言传递(XLT)的Compatibility,以便选择适合的源语言来提高模型的性能。
  • methods: 我们提出了基于语言卷积网络相似性的方法,通过分析模型的内部工作机制来预测XLT的Compatibility。我们的方法只需要一小量的原始文本,与大多数前一些方法不同。
  • results: 我们在多个任务中进行了实验,并证明了我们的方法比基eline更有效,具体来说,它在NDCG@3中平均提高了4.6%。我们还提供了详细的分析,证明了语言卷积网络的相似性对XLT预测的重要性。
    Abstract Cross-lingual transfer (XLT) is an emergent ability of multilingual language models that preserves their performance on a task to a significant extent when evaluated in languages that were not included in the fine-tuning process. While English, due to its widespread usage, is typically regarded as the primary language for model adaption in various tasks, recent studies have revealed that the efficacy of XLT can be amplified by selecting the most appropriate source languages based on specific conditions. In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources. In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.
    摘要 In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources.In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.

Supercharging academic writing with generative AI: framework, techniques, and caveats

  • paper_url: http://arxiv.org/abs/2310.17143
  • repo_url: None
  • paper_authors: Zhicheng Lin
  • For: This paper aims to improve the quality and efficiency of academic writing by leveraging generative artificial intelligence (AI) and large language models (LLMs).* Methods: The authors propose a human-AI collaborative framework for writing that delineates the rationale, process, and nature of AI engagement in writing. They also describe effective prompting techniques for incorporating AI into the writing routine and strategies for maintaining rigorous scholarship.* Results: The authors argue that the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.
    Abstract Academic writing is an indispensable yet laborious part of the research enterprise. This Perspective maps out principles and methods for using generative artificial intelligence (AI), specifically large language models (LLMs), to elevate the quality and efficiency of academic writing. We introduce a human-AI collaborative framework that delineates the rationale (why), process (how), and nature (what) of AI engagement in writing. The framework pinpoints both short-term and long-term reasons for engagement and their underlying mechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals the role of AI throughout the writing process, conceptualized through a two-stage model for human-AI collaborative writing, and the nature of AI assistance in writing, represented through a model of writing-assistance types and levels. Building on this framework, we describe effective prompting techniques for incorporating AI into the writing routine (outlining, drafting, and editing) as well as strategies for maintaining rigorous scholarship, adhering to varied journal policies, and avoiding overreliance on AI. Ultimately, the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.
    摘要

M2C: Towards Automatic Multimodal Manga Complement

  • paper_url: http://arxiv.org/abs/2310.17130
  • repo_url: https://github.com/hc-guo/m2c
  • paper_authors: Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, Zhoujun Li
  • for: 提高漫画理解,结合视觉和文本特征
  • methods: 使用大语言模型 mines 漫画事件知识,并使用细化视觉提示支持漫画补充
  • results: FVP-M$^{2}$ 方法实现了 Multimodal Manga Complement 任务的有效性
    Abstract Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features, which has attracted considerable attention from both natural language processing and computer vision communities. Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging, resulting in missing comic text content and seriously hindering human comprehension. In other words, the Multimodal Manga Complement (M2C) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for vision and language understanding. To this end, we first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages. First, we design a manga argumentation method called MCoT to mine event knowledge in comics with large language models. Then, an effective baseline FVP-M$^{2}$ using fine-grained visual prompts is proposed to support manga complement. Extensive experimental results show the effectiveness of FVP-M$^{2}$ method for Multimodal Mange Complement.
    摘要 多模态漫画分析强调使用视觉和文本特征,吸引了自然语言处理和计算机视觉领域的广泛关注。目前大多数漫画是手动绘制的,容易出现缺失页码、文本污染和衰老等问题,导致漫画文本内容丢失和人类理解受阻。即使是现有的多模态漫画补充(M2C)任务也没有被 investigate,该任务目的是通过提供视觉和语言理解共享semantic空间来解决上述问题。为此,我们首先提出了多模态漫画补充任务,并设置了一个新的M2Cbenchmark dataset,覆盖两种语言。然后,我们设计了漫画论证方法(MCoT),用于在大型语言模型中挖掘漫画中的事件知识。最后,我们提出了一个有效的基线方法FVP-M$^{2}$,使用细化的视觉提示来支持多模态漫画补充。广泛的实验结果表明FVP-M$^{2}$方法的效iveness для多模态漫画补充。

Test-time Augmentation for Factual Probing

  • paper_url: http://arxiv.org/abs/2310.17121
  • repo_url: https://github.com/gokamoda/TTA4FactualProbing
  • paper_authors: Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui
  • for: 验证语言模型是否具备certain world knowledge facts的能力。
  • methods: 使用提示来测试语言模型的知识。
  • results: 使用test-time augmentation(TTA)可以减少提示变化导致的模型敏感性,并提高模型评估准确性。但是,一些模型可能会受到TTA的影响,导致质量下降。
    Abstract Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
    摘要 factual probing 是一种使用提示测试语言模型是否具备certain world knowledge fact的方法。问题在于小Changes to the prompt can lead to large changes in model output。 previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning。However, such approaches are relation-specific and do not generalize to unseen relation types。Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time。experiments show improved model calibration,i.e., with TTA, model confidence better reflects prediction accuracy。improvements in prediction accuracy are observed for some models,but for other models,TTA leads to degradation。error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA。

FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge

  • paper_url: http://arxiv.org/abs/2310.17119
  • repo_url: None
  • paper_authors: Farima Fatahi Bayat, Kun Qian, Benjamin Han, Yisi Sang, Anton Belyi, Samira Khorshidi, Fei Wu, Ihab F. Ilyas, Yunyao Li
  • for: 检测文本信息中的事实错误,以便做出有知识基础的决策。
  • methods: 使用 prototype 工具 FLEEK,自动提取文本中的事实CLAIM,从外部知识源收集证据,评估每个CLAIM的事实性,并提供修复错误的建议。
  • results: 初步实验显示 FLEEK 可以准确地检测事实错误(77-85% F1)。
    Abstract Detecting factual errors in textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correction of factual errors is labor-intensive, developing an automatic approach can greatly reduce human effort. We present FLEEK, a prototype tool that automatically extracts factual claims from text, gathers evidence from external knowledge sources, evaluates the factuality of each claim, and suggests revisions for identified errors using the collected evidence. Initial empirical evaluation on fact error detection (77-85\% F1) shows the potential of FLEEK. A video demo of FLEEK can be found at https://youtu.be/NapJFUlkPdQ.
    摘要 检测文本信息中的事实错误,无论是由大型语言模型(LLM)生成或由人类编辑,都是决策时的关键。 LLM 的声称无法归因于外部知识并且倾向于幻见,使得不可靠地依赖其回答。人类也容易在写作时犯下事实错误。由于手动检测和修正事实错误是劳动密集的,因此开发自动化方法可以减少人类努力。我们提出了 FLEEK,一种 прототип工具,可以自动从文本中提取事实声称,从外部知识源收集证据,评估每个声称的事实性,并使用收集的证据进行标注错误。我们的初始实验结果(77-85\% F1)表明 FLEEK 有潜力。有关 FLEEK 的视频demo可以在 https://youtu.be/NapJFUlkPdQ 中找到。