cs.CL - 2023-07-20

What Twitter Data Tell Us about the Future?

  • paper_url: http://arxiv.org/abs/2308.02035
  • repo_url: None
  • paper_authors: Alina Landowska, Marek Robak, Maciej Skorski
  • for: This paper aims to investigate the futures projected by futurists on Twitter and explore the impact of language cues on anticipatory thinking among social media users.
  • methods: The study uses a compiled dataset of over 1 million publicly shared tweets by future influencers and develops a scalable NLP pipeline using state-of-the-art models. The research employs topic modeling techniques, such as LDA and BERT, to identify the topics and language cues used by futurists.
  • results: The study finds that the futurists’ language cues signal futures-in-the-making, which enhance social media users’ ability to anticipate and respond to their own scenarios in the present. The research identifies 15 topics from the LDA approach and 100 distinct topics from the BERTopic approach within the futurists’ tweets, providing insights into the futures anticipated by Twitter’s futurists.
    Abstract Anticipation is a fundamental human cognitive ability that involves thinking about and living towards the future. While language markers reflect anticipatory thinking, research on anticipation from the perspective of natural language processing is limited. This study aims to investigate the futures projected by futurists on Twitter and explore the impact of language cues on anticipatory thinking among social media users. We address the research questions of what futures Twitter's futurists anticipate and share, and how these anticipated futures can be modeled from social data. To investigate this, we review related works on anticipation, discuss the influence of language markers and prestigious individuals on anticipatory thinking, and present a taxonomy system categorizing futures into "present futures" and "future present". This research presents a compiled dataset of over 1 million publicly shared tweets by future influencers and develops a scalable NLP pipeline using SOTA models. The study identifies 15 topics from the LDA approach and 100 distinct topics from the BERTopic approach within the futurists' tweets. These findings contribute to the research on topic modelling and provide insights into the futures anticipated by Twitter's futurists. The research demonstrates the futurists' language cues signals futures-in-the-making that enhance social media users to anticipate their own scenarios and respond to them in present. The fully open-sourced dataset, interactive analysis, and reproducible source code are available for further exploration.
    摘要 人类思维能力中的预期是一种基本的认知能力,它涉及到思考和生活在未来的事物。虽然语言标记反映了预期思维,但从自然语言处理的角度进行研究的研究却很有限。这项研究希望通过Twitter上的未来推测者来调查他们预测的未来,并explore语言提示对预期思维的影响。我们回答了关于Twitter上未来推测者预测和分享的未来是什么,以及这些预测的语言标记如何模型社交数据。为了调查这一点,我们查看相关的研究成果,讨论语言标记和具有影响力的人员对预期思维的影响,并提出一个分类系统,将未来分为“现在未来”和“未来现在”。这项研究通过对公共分享的微博上的未来推测者的数据进行编译,并使用最新的NLP管道实现了可扩展的NLU。我们通过LDA方法和BERTopic方法对未来推测者的微博中提取了15个话题和100个特定话题。这些发现对话题模型进行研究做出了贡献,并为Twitter上未来推测者的语言提示提供了深入的理解。研究表明,未来推测者的语言提示是未来在创造的信号,可以使社交媒体用户预测自己的enario并在现在回应。完整的开源数据、交互分析和可重复的源代码都可以进一步探索。

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

  • paper_url: http://arxiv.org/abs/2307.10867
  • repo_url: https://github.com/figcapshf/figcapshf
  • paper_authors: Ashish Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nikos Vlassis, Ryan A. Rossi
  • for: 这篇论文是为了提高科学图像标题的自动生成技术,以满足读者的需求。
  • methods: 该论文使用了一种新的框架,即 FigCaps-HF,来生成图像标题。该框架包括自动评估图像标题对的质量以及人工反馈学习(RLHF)方法,以便根据读者的喜好进行标题生成。
  • results: 该论文的实验结果表明,使用 FigCaps-HF 框架可以提高标题生成的性能,特别是当使用 BLIP 作为基础模型时,RLHF 方法可以实现mean gain的35.7%、16.9%和9%在 ROUGE、BLEU 和 Meteor 等指标上。此外,该论文还释放了一个大规模的人工反馈标题对数据集,以便进一步评估和开发RLHF技术。
    Abstract Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness [15] leading to generated captions being misaligned with reader preferences. To enable the generation of high-quality figure captions, we introduce FigCaps-HF a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our simple learning framework by improving performance over standard fine-tuning across different types of models. In particular, when using BLIP as the base model, our RLHF framework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and Meteor, respectively. Finally, we release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.
    摘要 科学视觉和文档中的标签是非常重要的,但现有的科学标签生成方法依然存在一些缺陷,如帮助度、解释性和视觉描述性等指标不够高。这些缺陷导致生成的标签与读者需求不匹配。为了生成高质量的标签,我们提出了一种新的figure-to-caption生成框架,可以根据领域专家反馈来优化标签,以满足读者需求。我们的框架包括以下两个部分:1. 一种自动评估figure-to-caption对的质量方法。2. 一种基于人工反馈的强化学习(RLHF)方法,用于优化一个生成figure-to-caption模型,以满足读者需求。我们的简单学习框架可以在不同类型的模型上提高性能,特别是当使用BLIP作为基础模型时,我们的RLHF框架可以 achieve a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and Meteor, respectively. finally,我们发布了一个大规模的人工反馈 benchmark dataset,以便进一步评估和发展RLHF技术。

Adversarial Conversational Shaping for Intelligent Agents

  • paper_url: http://arxiv.org/abs/2307.11785
  • repo_url: None
  • paper_authors: Piotr Tarasiewicz, Sultan Kenjeyev, Ilana Sebag, Shehab Alshehabi
  • for: 提高对话机器人的智能性和准确性
  • methods: 使用生成对抗网络(GANPG)和奖励每个生成步骤(REGS)模型,并在强化学习框架中使用 seq2seq 和 transformers 等不同的训练细节
  • results: 研究表明,使用 GANPG 和 REGS 模型可以提高对话机器人的对话能力和准确性,并且不同的训练细节可以影响模型的性能
    Abstract The recent emergence of deep learning methods has enabled the research community to achieve state-of-the art results in several domains including natural language processing. However, the current robocall system remains unstable and inaccurate: text generator and chat-bots can be tedious and misunderstand human-like dialogue. In this work, we study the performance of two models able to enhance an intelligent conversational agent through adversarial conversational shaping: a generative adversarial network with policy gradient (GANPG) and a generative adversarial network with reward for every generation step (REGS) based on the REGS model presented in Li et al. [18] . This model is able to assign rewards to both partially and fully generated text sequences. We discuss performance with different training details : seq2seq [ 36] and transformers [37 ] in a reinforcement learning framework.
    摘要 Recent advances in deep learning methods have enabled the research community to achieve state-of-the-art results in various domains, including natural language processing. However, the current robocall system remains unstable and inaccurate, with text generators and chatbots often producing tedious and inhuman-like dialogue. In this study, we evaluate the performance of two models that can enhance an intelligent conversational agent through adversarial conversational shaping: a generative adversarial network with policy gradient (GANPG) and a generative adversarial network with reward for every generation step (REGS) based on the REGS model presented in Li et al. [18] . This model can assign rewards to both partially and fully generated text sequences. We discuss the performance of these models with different training details, such as seq2seq [36] and transformers [37], in a reinforcement learning framework.

Yelp Reviews and Food Types: A Comparative Analysis of Ratings, Sentiments, and Topics

  • paper_url: http://arxiv.org/abs/2307.10826
  • repo_url: None
  • paper_authors: Wenyu Liao, Yiqing Shi, Yujia Hu, Wei Quan
  • for: 这项研究探讨了 Yelp 评论与食品类型之间的关系,并研究评论中的评分、情感和话题如何随食品类型而变化。
  • methods: 研究使用了评论分析和机器学习模型来描述评论中的话题,并将食品类型分为四个群组基于评分和情感。
  • results: 研究发现,一些食品类型的评分、情感和话题呈现相似的特征,而其他类型则具有明显的特征。 评论者对不同类型的食品进行评论时,往往会关注不同的话题。
    Abstract This study examines the relationship between Yelp reviews and food types, investigating how ratings, sentiments, and topics vary across different types of food. Specifically, we analyze how ratings and sentiments of reviews vary across food types, cluster food types based on ratings and sentiments, infer review topics using machine learning models, and compare topic distributions among different food types. Our analyses reveal that some food types have similar ratings, sentiments, and topics distributions, while others have distinct patterns. We identify four clusters of food types based on ratings and sentiments and find that reviewers tend to focus on different topics when reviewing certain food types. These findings have important implications for understanding user behavior and cultural influence on digital media platforms and promoting cross-cultural understanding and appreciation.
    摘要 Translated into Simplified Chinese:这个研究研究了Yelp评论和食品类型之间的关系,具体来说是检查评论中的评分和情感是如何随食品类型而变化。我们分析了评论中的评分和情感是如何随食品类型而变化,使用机器学习模型推断评论中的话题,并比较不同食品类型中的话题分布。我们的分析发现,一些食品类型的评分和情感都很相似,而其他些类型则有明显的差异。我们将食品类型分为四个群组 based on ratings and sentiments,并发现在评论某些食品类型时,评论者们更关注的话题不同。这些发现有关于用户行为和文化影响在数字媒体平台上的理解,以及促进跨文化理解和喜爱的重要意义。

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

  • paper_url: http://arxiv.org/abs/2307.10814
  • repo_url: None
  • paper_authors: Ephrem Afele Retta, Richard Sutcliffe, Jabar Mahmood, Michael Abebe Berwo, Eiad Almekhlafi, Sajjad Ahmed Khan, Shehzad Ashraf Chaudhry, Mustafa Mhamed, Jun Feng
  • for: 这项研究是为了解决缺乏语言资源的情感识别 зада务。
  • methods: 这项研究使用了跨语言和多语言的情感识别方法,使用了阿姆哈里语、英语、德语和乌尔都语的数据集。
  • results: 研究发现,使用英语或德语作为源语言,并将其转换到阿姆哈里语为目标语言,可以获得最佳效果。此外,使用多种非阿姆哈里语言进行训练,可以获得更高的准确率。
    Abstract In a conventional Speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language does not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German and URDU. For Amharic, we use our own publicly-available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets. We followed previous research in mapping labels for all datasets to just two classes, positive and negative. Thus we can compare performance on different languages directly, and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for the three models were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each pair: Amharic<->German, Amharic<->English, and Amharic<->Urdu. Results with Amharic as target suggested that using English or German as source will give the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percent greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training a SER classifier when resources for a language are scarce.
    摘要 传统的语音情感识别(SER)任务中,一个分类器会在一种语言的已有数据集上进行训练。然而,当数据集不存在时,可以使用其他语言的数据集。我们在阿姆哈里亚语、英语、德语和 Urdu 语言上进行了实验,使用我们自己的公共可用的阿姆哈里亚语 Speech Emotion 数据集(ASED),以及现有的 RAVDESS、EMO-DB 和 URDU 数据集。我们按照之前的研究方法,将所有数据集的标签映射到两个类别中,即正面和负面。这样我们可以直接比较不同语言的性能,并将不同语言组合在训练和测试中。在实验 1 中,我们使用了三个模型:AlexNet、VGGE 和 ResNet50,进行了单语言 SER 试验。结果表明,ASED 和 RAVDESS 的性能很相似, suggesting that Amharic 和 English SER 是等效的。同时,德语 SER 更加困难,而 Urdu SER 更加容易。在实验 2 中,我们将一种语言作为输入,并将另一种语言作为目标进行测试,在每个对的两个方向上进行了测试。结果表明,使用英语或德语作为源语言,可以获得最好的结果。在实验 3 中,我们将多种非阿姆哈里亚语言作为训练数据,然后测试在阿姆哈里亚语言上。最好的准确率比实验 2 中的最好准确率高出几个百分点,表明使用两三种非阿姆哈里亚语言进行训练可以获得更好的结果。总的来说,结果表明,跨语言和多语言训练是一种有效的方法,当语言资源匮乏时。

Layer-wise Representation Fusion for Compositional Generalization

  • paper_url: http://arxiv.org/abs/2307.10799
  • repo_url: None
  • paper_authors: Yafang Zheng, Lei Lin, Zhaohong Lai, Binling Wang, Shan Liu, Biao Fu, Wenhao Rao, Peigen Ye, Yidong Chen, Xiaodong Shi
  • for: 提高序列模型的可compose普遍性,即使在各种应用场景中已经取得了成功,但是这些模型的解决方案被指控为不具有人类化普遍性。
  • methods: 我们提出了一种名为\textsc{FuSion}的扩展,它通过在编码和解码过程中引入一个\emph{融合注意模块}来适当地融合前几层信息。
  • results: 我们在两个实际的 benchmark 上测试了\textsc{FuSion},得到了竞争力和even state-of-the-art 的结果,这种结果证明了我们的提议的有效性。I hope this helps! Let me know if you have any other questions.
    Abstract Despite successes across a broad range of applications, sequence-to-sequence models' construct of solutions are argued to be less compositional than human-like generalization. There is mounting evidence that one of the reasons hindering compositional generalization is representations of the encoder and decoder uppermost layer are entangled. In other words, the syntactic and semantic representations of sequences are twisted inappropriately. However, most previous studies mainly concentrate on enhancing token-level semantic information to alleviate the representations entanglement problem, rather than composing and using the syntactic and semantic representations of sequences appropriately as humans do. In addition, we explain why the entanglement problem exists from the perspective of recent studies about training deeper Transformer, mainly owing to the ``shallow'' residual connections and its simple, one-step operations, which fails to fuse previous layers' information effectively. Starting from this finding and inspired by humans' strategies, we propose \textsc{FuSion} (\textbf{Fu}sing \textbf{S}yntactic and Semant\textbf{i}c Representati\textbf{on}s), an extension to sequence-to-sequence models to learn to fuse previous layers' information back into the encoding and decoding process appropriately through introducing a \emph{fuse-attention module} at each encoder and decoder layer. \textsc{FuSion} achieves competitive and even \textbf{state-of-the-art} results on two realistic benchmarks, which empirically demonstrates the effectiveness of our proposal.
    摘要 不withstanding its success across a broad range of applications, sequence-to-sequence models的构建方法受到了人类化泛化的批评,其中一个原因是编码和解码层的表示不能正确地分离。即序列的语法和 semantics表示被不当地杂mix。然而,大多数前一些研究主要集中在增强токен级别的 semantic信息,以降低表示杂mix问题,而不是正确地使用序列的语法和 semantics表示。此外,我们解释了表示杂mix问题的起因,即由于 recient studies about training deeper Transformer模型,主要归因于“浅”的径向连接和简单的一步操作,无法有效地融合上一层的信息。从这个发现出发,我们提出了\textsc{FuSion}( Fu 合并 Syn 统和 Sem antics 表示),一种基于 sequence-to-sequence 模型的扩展,通过引入一个“融合注意模块”来在编码和解码过程中正确地融合上一层的信息。\textsc{FuSion}在两个实际的 benchmark 上实现了竞争力和even state-of-the-art 的结果,这使得我们的提议得到了实质性的证明。

Extreme Multi-Label Skill Extraction Training using Large Language Models

  • paper_url: http://arxiv.org/abs/2307.10778
  • repo_url: None
  • paper_authors: Jens-Joris Decorte, Severine Verlinden, Jeroen Van Hautte, Johannes Deleu, Chris Develder, Thomas Demeester
  • for: 本研究旨在提高在线招聘广告中技能的自动检测精度,以便在劳动市场分析和电子招聘过程中更好地了解技能需求。
  • methods: 本研究使用自然语言处理(NLP)技术来自动处理在线招聘广告,并利用大量技能ontology来链接技能。
  • results: 研究结果显示,使用与文本相关的抽象技能生成器和对比学习策略可以提高技能检测精度,并在三个技能检测标准测试数据集上显示出15-25%的提高。
    Abstract Online job ads serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and e-recruitment processes. Since such ads are typically formatted in free text, natural language processing (NLP) technologies are required to automatically process them. We specifically focus on the task of detecting skills (mentioned literally, or implicitly described) and linking them to a large skill ontology, making it a challenging case of extreme multi-label classification (XMLC). Given that there is no sizable labeled (training) dataset are available for this specific XMLC task, we propose techniques to leverage general Large Language Models (LLMs). We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction, and present a contrastive learning strategy that proves effective in the task. Our results across three skill extraction benchmarks show a consistent increase of between 15 to 25 percentage points in \textit{R-Precision@5} compared to previously published results that relied solely on distant supervision through literal matches.
    摘要 Translated into Simplified Chinese:在线职位广告 serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and e-recruitment processes. 自然语言处理(NLP)技术被用来自动处理这些广告,特别是在检测技能(直接或间接描述)并将其链接到大量技能 ontology 中,这是一种EXTREME multi-label classification(XMLC)任务。由于没有可用的大规模标注(训练)数据集,我们提出了利用通用 Large Language Models(LLMs)的技术。我们描述了一种经济可行的方法来生成准确的、完全 sintetic 标注数据集,并提出了一种对比学习策略,其在任务中证明有效。我们在三个技能抽取benchmark上得到的结果表明,我们的方法可以与之前基于Literal Matches的结果相比,提高R-Precision@5的准确率(15%-25%)。

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2307.10757
  • repo_url: https://github.com/happycolor/vesper
  • paper_authors: Weidong Chen, Xiaofen Xing, Peihao Chen, Xiangmin Xu
  • for: 这个论文提出了一种适应大规模预训练模型(PTM)到语音情感识别任务的 paradigm。
  • methods: 该论文提出了一种基于 WavLM 的语音 dataset 的听说预训练 encoder 的改进方法,称为 Vesper。 Vesper 采用了情感导向的面积层束策略,以提高对情感信息的敏感度。
  • results: 实验结果表明,与 WavLM Base 的 12 层模型相比,Vesper WITH 4 层模型在 IEMOCAP、MELD 和 CREMA-D datasets 上表现出色,而 Vesper WITH 12 层模型则超越了 WavLM Large 的 24 层模型。
    Abstract This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
    摘要 The authors focus on the speech emotion recognition task and propose an improved emotion-specific pre-trained encoder called Vesper. Vesper is pre-trained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Additionally, Vesper uses hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, which are crucial for emotion recognition.The authors evaluate Vesper on the IEMOCAP, MELD, and CREMA-D datasets and compare its performance to that of WavLM Base and WavLM Large. The results show that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.In summary, this paper presents a new approach to speech emotion recognition that adapts large-scale pre-trained models to the specific task of speech emotion recognition. The proposed model, Vesper, is pre-trained on a speech dataset based on WavLM and employs an emotion-guided masking strategy and hierarchical self-supervision to improve its performance. The authors evaluate Vesper on three datasets and show that it outperforms WavLM Base and WavLM Large.

Large language models shape and are shaped by society: A survey of arXiv publication patterns

  • paper_url: http://arxiv.org/abs/2307.10700
  • repo_url: None
  • paper_authors: Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, Emma Pierson
  • for: 本研究目的是分析大型自然语言模型(LLM)论文的发展趋势,特别是2023年 vs. 2018-2022年的发表模式。
  • methods: 该研究基于388K篇CS和Stat arXiv上的论文,分析了LLM相关论文的发展趋势,包括论文数量的增加、主题的分布、作者的背景和研究方向的相关性、引用率的分布、国际合作的趋势等方面。
  • results: 研究发现,LLM研究在社会影响方面呈现18倍增长趋势,新参与LLM研究的作者更likely关注应用和社会影响,而经验较深的作者则更关注理论和基础研究。此外,研究还发现了性别和学术/产业领域的差异,以及美国和中国在合作网络中的分歧。总的来说,本研究证明了LLM研究不仅被社会 shapes,而且也 shapes society。
    Abstract There has been a steep recent increase in the number of large language model (LLM) papers, producing a dramatic shift in the scientific landscape which remains largely undocumented through bibliometric analysis. Here, we analyze 388K papers posted on the CS and Stat arXivs, focusing on changes in publication patterns in 2023 vs. 2018-2022. We analyze how the proportion of LLM papers is increasing; the LLM-related topics receiving the most attention; the authors writing LLM papers; how authors' research topics correlate with their backgrounds; the factors distinguishing highly cited LLM papers; and the patterns of international collaboration. We show that LLM research increasingly focuses on societal impacts: there has been an 18x increase in the proportion of LLM-related papers on the Computers and Society sub-arXiv, and authors newly publishing on LLMs are more likely to focus on applications and societal impacts than more experienced authors. LLM research is also shaped by social dynamics: we document gender and academic/industry disparities in the topics LLM authors focus on, and a US/China schism in the collaboration network. Overall, our analysis documents the profound ways in which LLM research both shapes and is shaped by society, attesting to the necessity of sociotechnical lenses.
    摘要 Translated into Simplified Chinese: Recently, there has been a sharp increase in the number of large language model (LLM) papers, leading to a significant shift in the scientific landscape, but this has been largely undocumented through bibliometric analysis. In this study, we analyze 388,000 papers posted on the CS and Stat arXivs, focusing on changes in publication patterns in 2023 compared to 2018-2022. We examine how the proportion of LLM papers is increasing, which LLM-related topics are receiving the most attention, and the authors writing LLM papers. We also explore how authors' research topics correlate with their backgrounds, the factors that distinguish highly cited LLM papers, and the patterns of international collaboration. Our findings show that LLM research is increasingly focused on societal impacts: there has been an 18-fold increase in the proportion of LLM-related papers on the Computers and Society sub-arXiv, and authors who are new to LLM research are more likely to focus on applications and societal impacts than more experienced authors. LLM research is also influenced by social dynamics, such as gender and academic/industry disparities in the topics LLM authors focus on, and a US/China schism in the collaboration network. Overall, our analysis demonstrates the profound ways in which LLM research both shapes and is shaped by society, highlighting the importance of sociotechnical lenses.

A Dataset and Strong Baselines for Classification of Czech News Texts

  • paper_url: http://arxiv.org/abs/2307.10666
  • repo_url: https://github.com/hynky1999/czech-news-classification-dataset
  • paper_authors: Hynek Kydlíček, Jindřich Libovický
  • for: 评估捷克自然语言处理模型的可行性,通过使用多种新闻来源和多个新闻类别,以及推测作者的性别和日期等四个分类任务。
  • methods: 使用各种先进的自然语言处理技术和大规模生成语言模型进行评估。
  • results: 人工评估表明机器学习基于预训练变换器模型的性能落后人类表现,而语言特定预训练Encoder分析表现胜过选择的商业化大规模生成语言模型。
    Abstract Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
    摘要 很多预训练模型在捷克自然语言处理领域通常会在语言学任务(POS标记、分析、NER)和简单的分类任务(情感分类或文章分类)中进行评估。作为一个 alternatif,我们介绍了 CZEch~NEws~Classification~dataset(CZE-NEC),这是一个包含新闻文章来源于多种新闻来源,覆盖了两十年的大型捷克分类数据集。这个数据集允许更加严格地评估这些模型。我们定义了四个分类任务:新闻来源、新闻类别、推测作者的性别和天数。为了证明任务的困难,我们进行了人类评估,发现人类表现落后于基于预训练变换器模型的强大机器学习基线。此外,我们还表明语言特定预训练Encoder分析的语言模型在选择的大规模生成语言模型中表现出色。

Exploring the Landscape of Natural Language Processing Research

  • paper_url: http://arxiv.org/abs/2307.10652
  • repo_url: https://github.com/sebischair/exploring-nlp-research
  • paper_authors: Tim Schopf, Karim Arabi, Florian Matthes
  • for: 本研究旨在提供一份系统性地分类和分析ACL Anthology中的NLP研究论文,以提供研究领域的结构化概述、领域分类、最新发展和未来研究方向。
  • methods: 本研究使用系统性的分类和分析方法,对ACL Anthology中的NLP研究论文进行了分类和分析,从而提供了研究领域的结构化概述、领域分类和最新发展。
  • results: 本研究结果显示,NLP领域的研究主要涉及到语义理解、语言模型、自然语言处理、语音识别等领域,并且在最新的发展中,深度学习、word embeddings等技术在NLP领域中得到了广泛的应用和发展。
    Abstract As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.
    摘要 natural language processing (NLP) 研究在最近几年内展示了快速扩散和广泛采用的趋势。随着研究工作的增加,NLP相关领域的研究也逐渐增多。然而,一个全面的研究,把已有的主题分类、趋势识别出来,并对未来研究领域提出建议,仍然缺失。为了填补这一漏洞,我们在ACL Anthology中 sistematically 分类和分析了研究论文。以下是我们的结果:1. 研究领域分类:我们对NLP研究领域进行了系统分类,并将其分为多个子领域。2. 趋势分析:我们分析了最近几年NLP研究的趋势,并对其进行了总结。3. 研究成果概述:我们对NLP研究成果进行了概述,并提出了未来研究的建议。以下是我们的发现:1. 在NLP领域,最近几年内有很多新的研究方向出现,如语义理解、语言生成、机器翻译等。2. 许多研究都在尝试将NLP应用于实际场景中,如语音识别、自然语言处理等。3. 随着数据集的不断扩大和改进,NLP模型的性能也在不断提高。以上是我们对NLP研究领域的一个系统性的分析和概述。未来,我们可能会看到更多的新的研究方向和应用场景出现,同时,我们也需要继续关注NLP领域的发展和进步。

Generative Language Models on Nucleotide Sequences of Human Genes

  • paper_url: http://arxiv.org/abs/2307.10634
  • repo_url: https://github.com/boun-tabi/generativelm-genes
  • paper_authors: Musa Nuri Ihtiyar, Arzucan Ozgur
  • for: 本研究旨在开发一个基于转换器的生成语言模型,以探讨DNA序列生成的可能性。
  • methods: 研究使用了RNN和N-gram等简单技术,以及一些实际生活中的任务来评估模型性能。
  • results: 研究发现,使用生成模型可以在DNA序列生成中 дости得比较好的效果,但是数据充足性仍然是一个问题。
    Abstract Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed.
    摘要 语言模型,主要是基于转换器的一种,在NLU和NLG等领域取得了巨大的成功。DNasekwalence structure和自然语言很相似,因此DNABert等探索性模型在生物信息学领域具有重要意义。然而,生成方面尚未得到充分的探索,我们因此决定开发一种基于GPT-3的探索性生成语言模型,专门针对DNasekwalence。由于处理整个DNasekwalence的计算资源需求很高,我们决定对小规模的NUcleotide序列进行研究,而不是整个DNasekwalence。这种决定并没有改变问题结构,因为DNasekwalence和NUcleotide序列都可以看作1D序列,由四种不同的核苷酸组成,无需失去太多信息和简化太多。我们首先系统地探讨了几乎未曾被探索的问题,并发现RNNs表现最佳,而简单的技术如N-grams也表现了良好。此外,我们发现在使用生成模型时,不需要了解语言,与自然语言不同。此外,我们发现通过使用实际任务而不仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅

Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa

  • paper_url: http://arxiv.org/abs/2307.10633
  • repo_url: None
  • paper_authors: Shriyash K. Upadhyay, Etan J. Ginsberg
  • for: 提高语言模型的可用性和性能
  • methods: 多种方法自动训练
  • results: 1) 提高较弱方法性能(最高提升30%),2) 提高较强方法性能(最高提升32.2%),3) 提高相关 yet distinct tasks的性能(最高提升10.3%)Here’s a breakdown of each point:
  • for: The paper aims to improve the availability and performance of language models by introducing a novel training method called Multi-Method Self-Training (MMST).
  • methods: The paper uses multiple methods for self-training, including the filtered outputs of another method, to augment the strengths and ameliorate the weaknesses of each method.
  • results: The paper shows that MMST can improve the performance of less performant methods (up to 30%), more performant methods (up to 32.2%), and related yet distinct tasks (up to 10.3%) by improving the ability of the model to generate rationales. The improvement in performance is driven by the use of multiple methods, and the paper also explores prompt-engineering and anti-correlated performance between methods as means of making MMST more effective.
    Abstract Large Language Models have many methods for solving the same problem. This introduces novel strengths (different methods may work well for different problems) and weaknesses (it may be difficult for users to know which method to use). In this paper, we introduce Multi-Method Self-Training (MMST), where one method is trained on the filtered outputs of another, allowing us to augment the strengths and ameliorate the weaknesses of each method. Using a 176B parameter model trained on both language and code, we show that MMST can 1) improve the less performant method (up to 30%) making the model easier to use, 2) improve the more performant method (up to 32.2%) making the model more performant, and 3) improve the performance of related but distinct tasks (up to 10.3%) by improving the ability of the model to generate rationales. We then conduct ablation analyses to explore why MMST works. We show that MMST generates more data than traditional self-training, but the improvement in performance is driven by the use of multiple methods. We also analyze prompt-engineering and anti-correlated performance between methods as means of making MMST more effective. We hope the evidence from our paper motivates machine learning researchers to explore ways in which advances in language models allow for new forms of training.
    摘要 Note: Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore. It is written using the same characters as Traditional Chinese, but with some differences in stroke order and vocabulary.Here are some key differences between Simplified Chinese and Traditional Chinese:* Simplified Chinese has fewer characters than Traditional Chinese, with about 2,000 commonly used characters compared to over 5,000 in Traditional Chinese.* Simplified Chinese has simpler stroke order and character forms, making it easier to write and read.* Simplified Chinese uses more homophones, which can make it more difficult to understand for non-native speakers.* Simplified Chinese has a more standardized vocabulary and grammar, while Traditional Chinese has more regional variations and idiomatic expressions.I hope this helps! Let me know if you have any other questions.

A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos

  • paper_url: http://arxiv.org/abs/2307.10587
  • repo_url: None
  • paper_authors: Anand Kumar Rai, Siddharth D Jaiswal, Animesh Mukherjee
  • for: 这个论文是为了评估自动语音识别系统在不同地区和民族群体中的性能,并提出了更加包容和可靠的ASR系统和数据集。
  • methods: 该论文使用了大量的NPTEL MOOC平台上的技术讲解视频和译文,并使用了YouTube自动字幕和OpenAI Whisper模型来评估印度各地区和民族群体的语音特征对自动语音识别系统的影响。
  • results: 研究发现,存在gender、native region、age和speech rate等因素导致自动语音识别系统的性能差异,但cast不存在差异。此外,研究还发现了不同讲解领域的语音特征差异,这些结果表明需要更加包容和可靠的ASR系统和更多的 represenative 数据集来评估差异。
    Abstract Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.
    摘要

Instruction-following Evaluation through Verbalizer Manipulation

  • paper_url: http://arxiv.org/abs/2307.10558
  • repo_url: None
  • paper_authors: Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, Hongxia Jin
  • For: The paper aims to evaluate the ability of instruction-tuned language models to follow instructions accurately, particularly in responding to less natural or unexpected instructions.* Methods: The proposed evaluation protocol is called verbalizer manipulation, which instructs the model to verbalize the task label with words that align with the model’s priors to different extents. This protocol can be integrated with any classification benchmark to assess the model’s reliance on priors and its ability to override them.* Results: The evaluation results show that the instruction-following abilities of different model families and scales are significantly distinguished by their performance on less natural verbalizers. Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer, highlighting the need for further advancements to improve their instruction-following abilities.Here’s the Chinese translation of the three key points:* For: 这篇论文旨在评估指令驱动的自然语言处理模型是否能够准确遵循指令。* Methods: 该论文提出了一种新的评估协议,即使语言模型的任务标签用语。这种协议可以让模型根据不同的扩展来采用不同的语言表达方式,以评估模型是否能够准确遵循指令。* Results: 评估结果显示,不同的模型家族和规模在遵循指令的能力上存在显著的差异,尤其是在使用不太自然的语言表达时。即使使用最强的GPT-4模型,它也难以在最Difficult的语言表达下超过随机猜测的水平,这 highlights the need for continued advancements to improve their instruction-following abilities。
    Abstract While instruction-tuned models have shown remarkable success in various natural language processing tasks, accurately evaluating their ability to follow instructions remains challenging. Existing benchmarks primarily focus on common instructions that align well with what the model learned during training. However, proficiency in responding to these instructions does not necessarily imply strong ability in instruction following. In this paper, we propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents, adopting verbalizers from highly aligned (e.g., outputting ``postive'' for positive sentiment), to minimally aligned (e.g., outputting ``negative'' for positive sentiment). Verbalizer manipulation can be seamlessly integrated with any classification benchmark to examine the model's reliance on priors and its ability to override them to accurately follow the instructions. We conduct a comprehensive evaluation of four major model families across nine datasets, employing twelve sets of verbalizers for each of them. We observe that the instruction-following abilities of models, across different families and scales, are significantly distinguished by their performance on less natural verbalizers. Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer, emphasizing the need for continued advancements to improve their instruction-following abilities.
    摘要 而ん行模型在不同的自然语言处理任务上有显著的成功,但确切评估它们能否按照指令行为仍然是一个挑战。现有的标准benchmark主要集中在训练中学习的指令上,但是能够准确回应这些指令并不意味着强大的指令遵从能力。在这篇论文中,我们提出了一种新的指令遵从评估协议,即语言映射(Verbalizer Manipulation)。它要求模型将任务标签用与模型在训练中学习的词汇进行映射,从高度相似(例如,输出“正面” для正面情感)到最低度相似(例如,输出“负面” для正面情感)。语言映射可以轻松地与任何分类benchmark集成,以检验模型是否能够根据指令而准确遵从。我们在九个数据集上进行了四家主要模型家族的全面评估,使用每个模型的十二个语言映射。我们发现,不同家族和规模的模型在使用不同的语言映射时,其指令遵从能力异常分化。 zelfs the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer,这 подчеркивает我们需要继续进行技术创新,以提高模型的指令遵从能力。

Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language Models

  • paper_url: http://arxiv.org/abs/2307.10522
  • repo_url: None
  • paper_authors: Somayeh Ghanbarzadeh, Yan Huang, Hamid Palangi, Radames Cruz Moreno, Hamed Khanpour
  • for: 降低 PLM 中的社会偏见,提高 PLM 的性别准确率。
  • methods: 提出了 Gender-tuning 方法,通过在下游任务的数据集上进行粘贴语言模型(MLM)训练目标的综合使用,以降低 PLM 中的社会偏见。
  • results: Gender-tuning 方法可以在 PLM 中减少平均性别偏见得分,同时提高 PLM 在下游任务上的性能,不需要额外的训练数据集和资源投入。
    Abstract Recent studies have revealed that the widely-used Pre-trained Language Models (PLMs) propagate societal biases from the large unmoderated pre-training corpora. Existing solutions require debiasing training processes and datasets for debiasing, which are resource-intensive and costly. Furthermore, these methods hurt the PLMs' performance on downstream tasks. In this study, we propose Gender-tuning, which debiases the PLMs through fine-tuning on downstream tasks' datasets. For this aim, Gender-tuning integrates Masked Language Modeling (MLM) training objectives into fine-tuning's training process. Comprehensive experiments show that Gender-tuning outperforms the state-of-the-art baselines in terms of average gender bias scores in PLMs while improving PLMs' performance on downstream tasks solely using the downstream tasks' dataset. Also, Gender-tuning is a deployable debiasing tool for any PLM that works with original fine-tuning.
    摘要

Transsion TSUP’s speech recognition system for ASRU 2023 MADASR Challenge

  • paper_url: http://arxiv.org/abs/2307.11778
  • repo_url: None
  • paper_authors: Xiaoxiao Li, Gaosheng Zhang, An Zhu, Weiyong Li, Shuming Fang, Xiaoyue Yang, Jianchao Zhu
  • For: 这篇论文描述了由Transsion Speech Understanding Processing Team (TSUP)开发的一种扩展ASR模型,用于ASRU 2023 MADASR Challenge。该系统强调适应低资源印度语言的ASR模型,并覆盖了挑战赛的四个轨道。* Methods: 在轨道1和2中,音响模型使用了压缩形态器编码器和双向转换器解码器,并在CTC-Attention培育训练中使用了共同训练损失。此外,在TLG beam search解码中使用了外部KenLM语言模型。在轨道3和4中,采用了预训练的IndicWhisper模型,并在挑战数据集和公共可用数据集上进行了finetuning。另外,在喊叫搜索解码中支持了外部KenLM语言模型,以便更好地利用挑战中提供的额外文本。* Results: 提案的方法在四个轨道中取得了 Bengali语言的单词错误率(WER)为24.17%、24.43%、15.97%和15.97%,以及 Bhojpuri语言的WER为19.61%、19.54%、15.48%和15.48%。这些结果表明提案的方法的效果。
    Abstract This paper presents a speech recognition system developed by the Transsion Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge. The system focuses on adapting ASR models for low-resource Indian languages and covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding. For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.
    摘要 For tracks 1 and 2, the acoustic model used a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding.For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge.The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.

General Debiasing for Multimodal Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2307.10511
  • repo_url: https://github.com/Teng-Sun/GEAR
  • paper_authors: Teng Sun, Juntong Ni, Wenjie Wang, Liqiang Jing, Yinwei Wei, Liqiang Nie
  • for: 这个论文主要针对 Multimodal Sentiment Analysis (MSA) 领域的问题,即如何减少模型对偏扰关系的依赖性,以提高模型的 Out-Of-Distribution (OOD) 泛化能力。
  • methods: 该论文提出了一种通用的减少偏扰关系的框架,基于 Inverse Probability Weighting (IPW) 技术,可以适应不同的数据集和模型。这个框架包括两个主要步骤:1) 分解每个模式中的可靠特征和偏扰特征,2) 使用偏扰特征来估算样本的偏扰程度。最后,使用 IPW 技术来减少大偏扰样本的影响,以便学习有 robustness 的特征 для 情感预测。
  • results: 该论文通过使用多个 benchmark 和 OOD 测试集来评估模型的泛化能力,并证明了其在不同的数据集和模型下的超越性。 codes 和数据可以在 https://github.com/Teng-Sun/GEAR 上下载。
    Abstract Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal information for prediction yet unavoidably suffers from fitting the spurious correlations between multimodal features and sentiment labels. For example, if most videos with a blue background have positive labels in a dataset, the model will rely on such correlations for prediction, while "blue background" is not a sentiment-related feature. To address this problem, we define a general debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD) generalization ability of MSA models by reducing their reliance on spurious correlations. To this end, we propose a general debiasing framework based on Inverse Probability Weighting (IPW), which adaptively assigns small weights to the samples with larger bias (i.e., the severer spurious correlations). The key to this debiasing framework is to estimate the bias of each sample, which is achieved by two steps: 1) disentangling the robust features and biased features in each modality, and 2) utilizing the biased features to estimate the bias. Finally, we employ IPW to reduce the effects of large-biased samples, facilitating robust feature learning for sentiment prediction. To examine the model's generalization ability, we keep the original testing sets on two benchmarks and additionally construct multiple unimodal and multimodal OOD testing sets. The empirical results demonstrate the superior generalization ability of our proposed framework. We have released the code and data to facilitate the reproduction https://github.com/Teng-Sun/GEAR.
    摘要 现有的多模态情感分析(MSA)研究使用多模态信息进行预测,但是不可避免地受到多模态特征和情感标签之间的误 corrrelation的影响。例如,如果 datasets 中的大多数视频具有蓝色背景,模型就会依赖于这些 corrrelation 进行预测,而“蓝色背景”不是情感相关的特征。为解决这个问题,我们定义了一种总体debiasing MSA任务,它的目的是提高多模态情感分析模型的 Out-Of-Distribution(OOD)泛化能力,减少它们对误 corrrelation 的依赖。为此,我们提出了一种基于 inverse probability weighting(IPW)的通用debiasing框架。该框架可适应地将样本中具有更大偏见(即更强的误 corrrelation)的样本 assign 小权重。键点在于估计样本的偏见,我们通过以下两步来实现:1)在每种模式中分解robust feature和偏见 feature,2)利用偏见特征来估计样本的偏见。最后,我们利用 IPW 减少大偏见样本的影响,促进 Robust feature learning для情感预测。为评估模型的泛化能力,我们保留原始测试集在两个 benchmark 上,并在多种单模态和多模态 OOD 测试集上进行评估。实际结果表明我们的提出的框架具有更高的泛化能力。我们已经将代码和数据发布到https://github.com/Teng-Sun/GEAR,以便复制。

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

  • paper_url: http://arxiv.org/abs/2307.10488
  • repo_url: https://github.com/thakur-nandan/sprint
  • paper_authors: Nandan Thakur, Kexin Wang, Iryna Gurevych, Jimmy Lin
  • for: 本研究的目的是提供一个Python工具包(SPRINT),用于评估神经稀疏检索模型。
  • methods: 本研究使用了Pyserini和Lucene创建了一个共同接口,用于支持多种神经稀疏检索模型的评估。现有五种内置模型:uniCOIL、DeepImpact、SPARTA、TILDEv2和SPLADEv2。用户也可以轻松地添加自定义模型,只需要定义权重方法即可。
  • results: 使用SPRINT工具包,我们在BEIR benchmark上建立了强大和可重复的零基eline神经稀疏检索基线。我们的结果显示,SPLADEv2在BEIR上的平均得分为0.470 nDCG@10,比其他神经稀疏检索模型高。此外,我们还发现了SPLADEv2的性能提升的原因,即它生成的稀疏表示中大多数的字符在查询和文档之外,这经常是其性能提升的关键。我们在https://github.com/thakur-nandan/sprint中公开了我们的SPRINT工具包、模型和实验所用的数据。
    Abstract Traditionally, sparse retrieval systems relied on lexical representations to retrieve documents, such as BM25, dominated information retrieval tasks. With the onset of pre-trained transformer models such as BERT, neural sparse retrieval has led to a new paradigm within retrieval. Despite the success, there has been limited software supporting different sparse retrievers running in a unified, common environment. This hinders practitioners from fairly comparing different sparse models and obtaining realistic evaluation results. Another missing piece is, that a majority of prior work evaluates sparse retrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO. However, a key requirement in practical retrieval systems requires models that can generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In this work, we provide SPRINT, a unified Python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval. The toolkit currently includes five built-in models: uniCOIL, DeepImpact, SPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by defining their term weighting method. Using our toolkit, we establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2 achieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural sparse retrievers. In this work, we further uncover the reasons behind its performance gain. We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document which is often crucial for its performance gains, i.e. a limitation among its other sparse counterparts. We provide our SPRINT toolkit, models, and data used in our experiments publicly here at https://github.com/thakur-nandan/sprint.
    摘要 传统上,稀疏检索系统都是基于 lexical representation来检索文档的,如BM25,这些系统在信息检索任务中占据了主导地位。 però,随着预训练 transformer 模型如 BERT 的出现, neural sparse retrieval 引入了一个新的 paradigm 内 Retrieval。 despite the success, there has been limited software supporting different sparse retrievers running in a unified, common environment. This hinders practitioners from fairly comparing different sparse models and obtaining realistic evaluation results. Another missing piece is, that a majority of prior work evaluates sparse retrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO. However, a key requirement in practical retrieval systems requires models that can generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In this work, we provide SPRINT, a unified Python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval. The toolkit currently includes five built-in models: uniCOIL, DeepImpact, SPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by defining their term weighting method. Using our toolkit, we establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2 achieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural sparse retrievers. In this work, we further uncover the reasons behind its performance gain. We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document which is often crucial for its performance gains, i.e. a limitation among its other sparse counterparts. We provide our SPRINT toolkit, models, and data used in our experiments publicly here at .

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

  • paper_url: http://arxiv.org/abs/2307.10485
  • repo_url: https://github.com/ai4finance-foundation/fingpt
  • paper_authors: Xiao-Yang Liu, Guoxuan Wang, Daochen Zha
  • For: FinGPT aims to democratize FinLLMs and stimulate innovation in open finance by providing researchers and practitioners with accessible and transparent resources for developing their FinLLMs.* Methods: FinGPT uses an open-sourced and data-centric framework to automate the collection and curation of real-time financial data from >34 diverse sources on the Internet. It also proposes a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). Additionally, it adopts the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from open-source general-purpose LLMs at a low cost.* Results: FinGPT showcases several applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. These applications demonstrate the potential of FinGPT in unlocking new opportunities in open finance.
    Abstract Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial text datasets available (quite small size), and BloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the training logs were released). In light of this, we aim to democratize Internet-scale financial data for LLMs, which is an open challenge due to diverse data sources, low signal-to-noise ratio, and high time-validity. To address the challenges, we introduce an open-sourced and data-centric framework, \textit{Financial Generative Pre-trained Transformer (FinGPT)}, that automates the collection and curation of real-time financial data from >34 diverse sources on the Internet, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. Additionally, we propose a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from open-source general-purpose LLMs at a low cost. Finally, we showcase several FinGPT applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance. The codes are available at https://github.com/AI4Finance-Foundation/FinGPT and https://github.com/AI4Finance-Foundation/FinNLP
    摘要 大型自然语言模型(LLM)已经表现出了人类语言理解和生成的惊人能力,这可能会革命化金融业。然而,现有的LLM通常在金融领域下手,这主要归结于通用文本数据和金融文本数据之间的差异。尽管存在只有有限的金融文本数据集 available(数据集较小),而且BloombergGPT,首个金融LLM(FinLLM),是关闭源的(只发布了训练日志)。为了普及互联网级金融数据 для LLM,这是一个开放的挑战,因为数据来源多样化,信号噪声比较低,时效性很高。为了解决这些挑战,我们提出了一个开源和数据中心的框架,名为金融生成预训练变换器(FinGPT)。FinGPT自动收集和筛选互联网上>34种多样化的金融数据源,为研究人员和实践者提供了可访问的和透明的资源,以便开发自己的FinLLM。此外,我们还提出了一种简单 yet有效的策略,用于FinLLM的追加训练,称为股票价格反馈学习(RLSP)。此外,我们采用了低级适应(LoRA,QLoRA)方法,允许用户自定义自己的FinLLM,从开源通用自然语言处理器(NLP)获得低成本。FinGPT还应用于多个领域,包括爬虫、情感分析 для算法交易、以及低代码开发。FinGPT旨在普及FinLLM,促进创新,解锁开放金融领域的新机会。代码可以在https://github.com/AI4Finance-Foundation/FinGPT和https://github.com/AI4Finance-Foundation/FinNLP 中找到。

What can we learn from Data Leakage and Unlearning for Law?

  • paper_url: http://arxiv.org/abs/2307.10476
  • repo_url: None
  • paper_authors: Jaydeep Borkar
  • for: 这篇论文主要关注大语言模型(LLMs)的隐私问题, LLMS 可能会在训练数据中记忆 personally identifiable information(PII),如电子邮件和电话号码,并在推理过程中泄露这些信息。
  • methods: 作者使用了一种名为 “right to be forgotten” 的隐私法规,以删除用户数据点中最容易提取的数据点,以保护用户的隐私。他们还发现,删除这些数据点后,新的数据点会变得更容易提取。此外,作者还发现,精度调整后的模型不仅会泄露训练数据,还会泄露在预训练阶段记忆的数据和 PII。
  • results: 作者发现,随着用户数据点的删除,新的数据点会变得更容易提取,这可能会对公司使用 LLMs 提供服务的隐私和法律问题产生影响。作者希望通过这篇论文,引起 AI 和法律社区之间的交互性讨论,以解决这些问题。
    Abstract Large Language Models (LLMs) have a privacy concern because they memorize training data (including personally identifiable information (PII) like emails and phone numbers) and leak it during inference. A company can train an LLM on its domain-customized data which can potentially also include their users' PII. In order to comply with privacy laws such as the "right to be forgotten", the data points of users that are most vulnerable to extraction could be deleted. We find that once the most vulnerable points are deleted, a new set of points become vulnerable to extraction. So far, little attention has been given to understanding memorization for fine-tuned models. In this work, we also show that not only do fine-tuned models leak their training data but they also leak the pre-training data (and PII) memorized during the pre-training phase. The property of new data points becoming vulnerable to extraction after unlearning and leakage of pre-training data through fine-tuned models can pose significant privacy and legal concerns for companies that use LLMs to offer services. We hope this work will start an interdisciplinary discussion within AI and law communities regarding the need for policies to tackle these issues.
    摘要

Findings of Factify 2: Multimodal Fake News Detection

  • paper_url: http://arxiv.org/abs/2307.10475
  • repo_url: None
  • paper_authors: S Suryavardan, Shreyash Mishra, Megha Chakraborty, Parth Patwa, Anku Rani, Aman Chadha, Aishwarya Reganti, Amitava Das, Amit Sheth, Manoj Chinnakotla, Asif Ekbal, Srijan Kumar
  • for: 针对社交媒体上快速增长的假新闻,这篇论文提出了自动检测假信息和证明其准确性的研究。
  • methods: 该论文使用了多模态的真实性检测和讽刺新闻 dataset,并采用了对比基于方法,将社交媒体声明与支持文档、图像进行对比,分为5类多模态关系。
  • results: 在第二次任务中,有60多名参与者和9个测试集提交,最高的F1分平均为81.82%。使用DeBERTatext和Swinv2和CLIP图像得到了最佳表现。
    Abstract With social media usage growing exponentially in the past few years, fake news has also become extremely prevalent. The detrimental impact of fake news emphasizes the need for research focused on automating the detection of false information and verifying its accuracy. In this work, we present the outcome of the Factify 2 shared task, which provides a multi-modal fact verification and satire news dataset, as part of the DeFactify 2 workshop at AAAI'23. The data calls for a comparison based approach to the task by pairing social media claims with supporting documents, with both text and image, divided into 5 classes based on multi-modal relations. In the second iteration of this task we had over 60 participants and 9 final test-set submissions. The best performances came from the use of DeBERTa for text and Swinv2 and CLIP for image. The highest F1 score averaged for all five classes was 81.82%.
    摘要

Improving the Reusability of Pre-trained Language Models in Real-world Applications

  • paper_url: http://arxiv.org/abs/2307.10457
  • repo_url: None
  • paper_authors: Somayeh Ghanbarzadeh, Hamid Palangi, Yan Huang, Radames Cruz Moreno, Hamed Khanpour
  • for: 提高预训练语言模型(PLM)的可 reuse性和实际应用效果
  • methods: integrate 掩码语言模型(MLM)训练目标到细化过程中,以提高 PLM 的通用化
  • results: compared with现有状态的技术,Mask-tuning 能够提高 PLM 的通用化性和实际应用效果,并且在不同数据集上的性能也有提高。
    Abstract The reusability of state-of-the-art Pre-trained Language Models (PLMs) is often limited by their generalization problem, where their performance drastically decreases when evaluated on examples that differ from the training dataset, known as Out-of-Distribution (OOD)/unseen examples. This limitation arises from PLMs' reliance on spurious correlations, which work well for frequent example types but not for general examples. To address this issue, we propose a training approach called Mask-tuning, which integrates Masked Language Modeling (MLM) training objectives into the fine-tuning process to enhance PLMs' generalization. Comprehensive experiments demonstrate that Mask-tuning surpasses current state-of-the-art techniques and enhances PLMs' generalization on OOD datasets while improving their performance on in-distribution datasets. The findings suggest that Mask-tuning improves the reusability of PLMs on unseen data, making them more practical and effective for real-world applications.
    摘要 现代预训练语言模型(PLM)的再利用性受到其泛化问题的限制,其性能在训练集不同的示例上逐渐下降,称为Out-of-Distribution(OOD)/未见示例。这种限制来自PLM的假 correlate的依赖,它们在常见示例类型上工作良好,但不适用于通用示例。为解决这个问题,我们提出了一种叫Mask-tuning的训练方法,它将Masked Language Modeling(MLM)训练目标纳入细化过程中,以提高PLM的泛化性。经过探索性实验,我们发现Mask-tuning在OOD数据集上的泛化性超过了当前状态艺技术,同时在入distribution数据集上提高PLM的性能。这些发现表明Mask-tuning可以提高PLM在未见数据上的再利用性,使其更加实用和有效于实际应用。

Integrating a Heterogeneous Graph with Entity-aware Self-attention using Relative Position Labels for Reading Comprehension Model

  • paper_url: http://arxiv.org/abs/2307.10443
  • repo_url: None
  • paper_authors: Shima Foolad, Kourosh Kiani
  • for: 本文旨在提高 transformer 模型在复杂理解任务中的表现,通过在输入序列中嵌入显式知识来解决 transformer 模型缺乏显式知识的限制。
  • methods: 本文提出了一种新的注意模式,即在 transformer 架构中 integrate 知识来自多元图的推理知识,而不需要外部知识。该注意模式包括三个关键元素:全局-本地注意 для单词标签、图注意 для实体标签,以及对每个实体标签和单词标签之间的关系类型的考虑。这种注意模式与 LUKE 模型的实体自注意机制结合,以优化注意力的分配。
  • results: 实验结果表明,我们的模型在 ReCoRD 数据集上的表现较高,超过了当前顶峰 LUKE-Graph 和基线 LUKE 模型。
    Abstract Despite the significant progress made by transformer models in machine reading comprehension tasks, they still fall short in handling complex reasoning tasks due to the absence of explicit knowledge in the input sequence. To address this limitation, many recent works have proposed injecting external knowledge into the model. However, selecting relevant external knowledge, ensuring its availability, and requiring additional processing steps remain challenging. In this paper, we introduce a novel attention pattern that integrates reasoning knowledge derived from a heterogeneous graph into the transformer architecture without relying on external knowledge. The proposed attention pattern comprises three key elements: global-local attention for word tokens, graph attention for entity tokens that exhibit strong attention towards tokens connected in the graph as opposed to those unconnected, and the consideration of the type of relationship between each entity token and word token. This results in optimized attention between the two if a relationship exists. The pattern is coupled with special relative position labels, allowing it to integrate with LUKE's entity-aware self-attention mechanism. The experimental findings corroborate that our model outperforms both the cutting-edge LUKE-Graph and the baseline LUKE model on the ReCoRD dataset that focuses on commonsense reasoning.
    摘要 尽管 transformer 模型在机器阅读理解任务中已经做出了重要进步,但它们仍然在处理复杂的推理任务时缺乏明确的知识。为解决这个限制,许多最近的工作已经提议在模型中注入外部知识。然而,选择相关的外部知识、确保其可用性和需要额外处理步骤仍然是挑战。在这篇论文中,我们介绍了一种新的注意模式,它可以在 transformer 架构中 integrate 推理知识,不需要外部知识。这个注意模式包括三个关键元素:全球-地方注意WORD token,对connected在图形上的entity token的注意力,以及对每个entity token和word token的关系型别进行考虑。这导致了两者之间的优化注意力。这个模式与特殊的相对位置标签相结合,使其能够与LUKE的entity-aware自注意运算机制集成。实验结果证实了我们的模型在ReCoRD dataset上的表现比cutting-edge LUKE-Graph和基准LUKE模型更好。

Thrust: Adaptively Propels Large Language Models with External Knowledge

  • paper_url: http://arxiv.org/abs/2307.10442
  • repo_url: None
  • paper_authors: Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Jianshu Chen
  • for: 这 paper 的目的是如何提高大规模预训练语言模型(PTLM)中的知识利用效率,以及External Knowledge 的搜索方法。
  • methods: 该 paper 提出了一种名为 Instance-level Adaptive Propulsion of External Knowledge(IAPEK)的方法,该方法通过测量 PTLM 模型中的知识量来决定是否需要进行 External Knowledge 的搜索。该方法使用 Thrust 指标,该指标基于一小量的 seen instances 的表示分布来衡量 PTLM 模型的实例级知识程度。
  • results: experiments 表明,Thrust 指标是一个好的 Measurement of PTLM 模型的实例级知识程度。此外,通过使用 Thrust 指标作为搜索指标,可以在 88% 的任务上实现显著的成本效益,即提高了 26% 的平均性能。这些发现有助于在实际应用中提高知识增强 LM 的效率和成本控制。
    Abstract Although large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters, the inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary. However, the existing information retrieval techniques could be costly and may even introduce noisy and sometimes misleading knowledge. To address these challenges, we propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary. To achieve this goal, we propose measuring whether a PTLM contains enough knowledge to solve an instance with a novel metric, Thrust, which leverages the representation distribution of a small number of seen instances. Extensive experiments demonstrate that thrust is a good measurement of PTLM models' instance-level knowledgeability. Moreover, we can achieve significantly higher cost-efficiency with the Thrust score as the retrieval indicator than the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings shed light on the real-world practice of knowledge-enhanced LMs with a limited knowledge-seeking budget due to computation latency or costs.
    摘要 Translation notes:* "pre-trained language models" (PTLMs) is translated as "预训练语言模型" (预PTLMs)* "instance-level adaptive propulsion of external knowledge" (IAPEK) is translated as "实例级适应性外知启动" (IAPEK)* "novel metric" is translated as "新的指标" (新指标)* "representation distribution" is translated as "表示分布" (表示分布)* "instance-level knowledgeability" is translated as "实例级知识能力" (实例知识能力)* "cost-efficiency" is translated as "成本效益" (成本效益)* "knowledge-seeking budget" is translated as "知识寻找预算" (知识寻找预算)

PharmacyGPT: The AI Pharmacist

  • paper_url: http://arxiv.org/abs/2307.10432
  • repo_url: None
  • paper_authors: Zhengliang Liu, Zihao Wu, Mengxuan Hu, Bokai Zhao, Lin Zhao, Tianyi Zhang, Haixing Dai, Xianyan Chen, Ye Shen, Sheng Li, Brian Murray, Tianming Liu, Andrea Sikora
  • For: The paper aims to assess the capabilities of large language models (LLMs) in emulating the role of clinical pharmacists, with potential applications in patient care and the development of future AI-driven healthcare solutions.* Methods: The paper uses real data from the intensive care unit (ICU) at the University of North Carolina Chapel Hill (UNC) Hospital to evaluate the performance of PharmacyGPT, a novel framework that leverages LLMs to generate comprehensible patient clusters, formulate medication plans, and forecast patient outcomes.* Results: The paper offers valuable insights into the potential applications and limitations of LLMs in clinical pharmacy, with implications for both patient care and the development of future AI-driven healthcare solutions. The analysis provides a comprehensive evaluation of PharmacyGPT’s performance and contributes to the ongoing discourse surrounding the integration of artificial intelligence in healthcare settings.Here’s the same information in Simplified Chinese text:* For: 这篇论文目标是评估大语言模型(LLMs)在供药医生角色中的能力,有可能应用于患者护理和未来的人工智能驱动医疗解决方案。* Methods: 这篇论文使用了北卡罗来纳大学 CHAPEL HILL 医院岗位护理单元(ICU)的真实数据来评估PharmacyGPT框架的表现,该框架利用LLMs生成易于理解的患者团集、药物计划和patient outcome预测。* Results: 这篇论文提供了价值的应用和LLMs在供药医生方面的限制,对患者护理和未来人工智能驱动医疗解决方案的开发有重要意义。分析提供了PharmacyGPT表现的全面评估,贡献到健康设施中人工智能的权威使用。
    Abstract In this study, we introduce PharmacyGPT, a novel framework to assess the capabilities of large language models (LLMs) such as ChatGPT and GPT-4 in emulating the role of clinical pharmacists. Our methodology encompasses the utilization of LLMs to generate comprehensible patient clusters, formulate medication plans, and forecast patient outcomes. We conduct our investigation using real data acquired from the intensive care unit (ICU) at the University of North Carolina Chapel Hill (UNC) Hospital. Our analysis offers valuable insights into the potential applications and limitations of LLMs in the field of clinical pharmacy, with implications for both patient care and the development of future AI-driven healthcare solutions. By evaluating the performance of PharmacyGPT, we aim to contribute to the ongoing discourse surrounding the integration of artificial intelligence in healthcare settings, ultimately promoting the responsible and efficacious use of such technologies.
    摘要 在这项研究中,我们介绍了药店GPT,一种新的框架,用于评估大型自然语言模型(LLMs)如ChatGPT和GPT-4在仿真药师的角色。我们的方法包括使用LLMs生成可读的患者群集,制定药物计划,预测患者结果。我们在北卡罗来纳大学夏洛 Chapel Hill医院的劳动 intensivist 单元(ICU)中收集到的实际数据进行了研究。我们的分析提供了价值的洞察,探讨了LLMs在临床药师领域的应用和局限性,对患者护理和未来基于人工智能的医疗解决方案的发展产生了影响。通过评估药店GPT的性能,我们希望能在艺术智能在医疗设置中的整合中发挥作用,并促进负责任和有效的使用这些技术。

LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

  • paper_url: http://arxiv.org/abs/2307.10168
  • repo_url: None
  • paper_authors: Tongshuang Wu, Haiyi Zhu, Maya Albayrak, Alexis Axon, Amanda Bertsch, Wenxing Deng, Ziqi Ding, Bill Guo, Sireesh Gururaja, Tzu-Sheng Kuo, Jenny T. Liang, Ryan Liu, Ihita Mandal, Jeremiah Milbauer, Xiaolin Ni, Namrata Padmanabhan, Subhashini Ramkumar, Alexis Sudjianto, Jordan Taylor, Ying-Jui Tseng, Patricia Vaidos, Zhijin Wu, Wei Wu, Chenyang Yang
  • for: 研究是否可以使用机器学习模型(LLMs)来复制人类在协作任务中的行为。
  • methods: 研究使用现代机器学习模型来模拟人类在“人类计算算法”中的能力,并评估这些模型的成功程度。
  • results: 研究发现,现代机器学习模型可以在一些复杂的协作任务中模拟人类的能力,但成功程度受到请求者对LLM能力的理解、任务下的具体技能要求以及完成这些任务的最佳交互方式的影响。研究还发现人类和LLM在接受指令方面存在差异,并重要地强调了启用人类面向的安全措施,以及训练人类和LLM的合作技能。
    Abstract LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these "human computation algorithms," but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks. We reflect on human and LLMs' different sensitivities to instructions, stress the importance of enabling human-facing safeguards for LLMs, and discuss the potential of training humans and LLMs with complementary skill sets. Crucially, we show that replicating crowdsourcing pipelines offers a valuable platform to investigate (1) the relative strengths of LLMs on different tasks (by cross-comparing their performances on sub-tasks) and (2) LLMs' potential in complex tasks, where they can complete part of the tasks while leaving others to humans.
    摘要

Exploring Transformer Extrapolation

  • paper_url: http://arxiv.org/abs/2307.10156
  • repo_url: None
  • paper_authors: Zhen Qin, Yiran Zhong, Hui Deng
  • for: 本研究旨在 investigate the conditions for length extrapolation in transformers using Relative Positional Encodings (RPEs), and to derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps.
  • methods: 本研究使用了 Thorough mathematical and empirical analysis to determine the conditions for length extrapolation in transformers, and derived two practices for language modeling tasks on a variety of corpora.
  • results: 研究发现,如果系列对应的RPE的对数列 converges,那么 transformer 就一定拥有 length extrapolation 性能。 并且在 Wikitext-103、Books、Github 和 WikiBook 等多个dataset上进行了广泛的实验,以证明我们发现的条件的有效性。同时,我们还对 Empirical Receptive Field (ERF) 和 TRF 进行了比较,并在不同的模型上显示了一致的趋势。
    Abstract Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext-103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on the aforementioned datasets. The code is available at https://github.com/OpenNLPLab/Rpe.
    摘要 lenght extrapolation 在 recent years 吸引了许多关注,因为它允许 transformers 在训练中使用的序列长度上进行测试。 previous research 表明,这种属性可以通过使用特制的 Relative Positional Encodings (RPEs) 实现。 although these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. extensive experiments are conducted on the Wikitext-103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on the aforementioned datasets. the code is available at https://github.com/OpenNLPLab/Rpe.

Gradient Sparsification For Masked Fine-Tuning of Transformers

  • paper_url: http://arxiv.org/abs/2307.10098
  • repo_url: None
  • paper_authors: James O’ Neill, Sourav Dutta
  • for: 本研究目的是提高下游任务的训练 speed和性能,通过对预训练语言模型进行细化调整。
  • methods: 本研究使用的方法包括 GradDrop 和其变种,它们在 backwards pass 中随机mask gradients,以增强预训练语言模型的性能。
  • results: 实验结果表明,GradDrop 能够与使用额外的翻译数据进行中间预训练相比,在 multilingual XGLUE benchmark 上表现出色,并且超过标准的 fine-tuning 和慢搅拌。post-analysis 还显示,GradDrop 能够在未经训练的语言上提高性能。
    Abstract Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks. Fine-tuning can be achieved by freezing gradients of the pretrained network and only updating gradients of a newly added classification layer, or by performing gradient updates on all parameters. Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training. This has been an effective strategy to trade-off between storage and training speed with generalization performance. However, it is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing which may improve fine-tuning performance. In this paper, we propose to stochastically mask gradients to regularize pretrained language models for improving overall fine-tuned performance. We introduce GradDrop and variants thereof, a class of gradient sparsification methods that mask gradients during the backward pass, acting as gradient noise. GradDrop is sparse and stochastic unlike gradual freezing. Extensive experiments on the multilingual XGLUE benchmark with XLMR-Large show that GradDrop is competitive against methods that use additional translated data for intermediate pretraining and outperforms standard fine-tuning and gradual unfreezing. A post-analysis shows how GradDrop improves performance with languages it was not trained on, such as under-resourced languages.
    摘要 现在广泛采用已经预训练的自主学习语言模型进行传输学习,这包括冻结预训练网络的梯度并且只更新新增的分类层的梯度,或者是在所有参数上进行梯度更新。渐进解冻可以考虑到这两种方法之间的折衔,以便在存储和训练速度之间做出一个平衡。然而,是否在训练过程中逐渐解冻层的最佳方法还没有得出确定的答案。在这篇论文中,我们提议使用杂色梯度抑制来规范预训练语言模型,以提高总的精通率。我们引入了GradDrop和其变种,它是一种杂色梯度抑制方法,在反向传播中随机地Mask梯度。GradDrop不同于渐进解冻,它是粒子和随机的。我们在多语言XGLUE测试准则上进行了广泛的实验,结果显示GradDrop与使用额外翻译数据进行中间预训练的方法相当竞争,并且超过了标准的精通和渐进解冻。后续分析表明,GradDrop可以提高语言模型的表现,包括未经训练的语言,如少数语言。