cs.CL - 2023-08-08

Unmasking Nationality Bias: A Study of Human Perception of Nationalities in AI-Generated Articles

  • paper_url: http://arxiv.org/abs/2308.04346
  • repo_url: None
  • paper_authors: Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao `Kenneth’ Huang, Shomir Wilson
  • for: 本研究旨在检测自然语言处理(NLP)模型中的国籍偏见,以确定AI系统的公正性和正义。
  • methods: 本研究采用了两步混合方法,包括量化分析和质量分析,以识别和理解国籍偏见在文本生成模型中的影响。
  • results: 研究发现,偏见NLP模型通常会复制和强化现有的社会偏见,可能导致社会技术环境中的歧视。参与者的口头问naire和主题分析也表明,读者阅读文章时可能受到这些偏见的影响,从而改变他们对国家的看法。这些发现强调了AI系统在社会中的影响,以及需要更正AI系统中的偏见。
    Abstract We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. Through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by AI sources. We then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. Our findings reveal that biased NLP models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. The qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. These findings emphasize the critical role of public perception in shaping AI's impact on society and the need to correct biases in AI systems.
    摘要 (Simplified Chinese translation)我们研究使用人类评估方法检测自然语言处理(NLP)模型中的国籍偏见。偏见的NLP模型可能扩大和复制现有社会偏见,导致算法性隔离,这对于AI系统的公平和正义具有挑战性。我们的研究采用了一种两步混合方法,包括量化和质量分析,以确定和理解国籍偏见在文本生成模型中的影响。我们通过人类中心的量化分析 mesure了AI源生成的文章中的国籍偏见的程度。然后,我们通过对参与者进行开放结构问naire和Theme coding分析来理解这些偏见对人类读者的影响。我们的发现表明,偏见的NLP模型通常会复制和加强现有社会偏见,这可能在社会技术 Setting中导致害。我们的访问分析表明,当读者遇到这些文章时,可能会改变他们对某个国家的看法。这些发现强调了AI对社会的影响的重要性,以及需要 corrections in AI systems。

Towards an AI to Win Ghana’s National Science and Maths Quiz

  • paper_url: http://arxiv.org/abs/2308.04333
  • repo_url: https://github.com/nsmq-ai/nsmqai
  • paper_authors: George Boateng, Jonathan Abrefah Mensah, Kevin Takyi Yeboah, William Edor, Andrew Kojo Mensah-Onumah, Naafi Dasana Ibrahim, Nana Sam Yeboah
  • for: The paper is written to explore the possibility of building an AI system that can compete in Ghana’s National Science and Maths Quiz (NSMQ) and potentially win.
  • methods: The paper describes an open-source project that is building AI to compete in the NSMQ, with a focus on speech-to-text, text-to-speech, question-answering, and human-computer interaction.
  • results: The paper provides an overview of the progress made thus far in the project, including the development of the AI system and the next steps toward its planned launch and debut in October for NSMQ 2023.
    Abstract Can an AI win Ghana's National Science and Maths Quiz (NSMQ)? That is the question we seek to answer in the NSMQ AI project, an open-source project that is building AI to compete live in the NSMQ and win. The NSMQ is an annual live science and mathematics competition for senior secondary school students in Ghana in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. The NSMQ is an exciting live quiz competition with interesting technical challenges across speech-to-text, text-to-speech, question-answering, and human-computer interaction. In this ongoing work that began in January 2023, we give an overview of the project, describe each of the teams, progress made thus far, and the next steps toward our planned launch and debut of the AI in October for NSMQ 2023. An AI that conquers this grand challenge can have real-world impact on education such as enabling millions of students across Africa to have one-on-one learning support from this AI.
    摘要 可以AI赢得加纳国家科学和数学竞赛(NSMQ)呢?我们在NSMQ AI项目中寻求答案,这是一个开源项目,旨在通过AI参加NSMQ并赢得奖。NSMQ是每年在加纳举行的生活 science和数学竞赛,参赛者是高中二年级学生,共有3个队伍,每个队伍有2名学生,在5轮5阶段的竞赛中回答生物、化学、物理和数学等领域的问题。NSMQ是一场激动人心的直播竞赛,涉及到语音识别、文本识别、问题回答和人机交互等技术挑战。在我们自2023年1月开始的工作中,我们将提供项目概述,介绍各个团队、已经进步的情况,以及下一步的计划,以备在10月份的NSMQ 2023上发布和使用AI。一旦AI成功解决这一大挑战,可以对教育产生实际影响,如提供非洲数百万学生一对一的学习支持。

Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2308.04306
  • repo_url: None
  • paper_authors: Cheng Yang, Wenye Zhao, Zhiyue Liu, Qingbao Huang
  • for: 本研究的目的是提供深度学习在метaphore认知任务中知识批注的综述和总结。
  • methods: 本文系统地总结了主流的知识和知识批注原则,并评估了在 métaphore认知任务中使用的数据集、评价指标和参考模型。
  • results: 本文结果预示,现有的知识批注方法在 métaphore认知任务中具有较高的识别率和准确率。但是,现有的方法还存在一些问题,如知识批注的质量和可靠性问题。
    Abstract The history of metaphor research also marks the evolution of knowledge infusion research. With the continued advancement of deep learning techniques in recent years, the natural language processing community has shown great interest in applying knowledge to successful results in metaphor recognition tasks. Although there has been a gradual increase in the number of approaches involving knowledge injection in the field of metaphor recognition, there is a lack of a complete review article on knowledge injection based approaches. Therefore, the goal of this paper is to provide a comprehensive review of research advances in the application of deep learning for knowledge injection in metaphor recognition tasks. In this paper, we systematically summarize and generalize the mainstream knowledge and knowledge injection principles, as well as review the datasets, evaluation metrics, and benchmark models used in metaphor recognition tasks. Finally, we explore the current issues facing knowledge injection methods and provide an outlook on future research directions.
    摘要 历史上的比喻研究也标志着知识混合研究的演化。随着近年深度学习技术的不断发展,自然语言处理社区对于应用知识到成功的结果在比喻识别任务中表示了极大的兴趣。虽然在比喻识别领域中有一个慢慢增长的方法涉及知识注入,但是没有一篇完整的文章来评论这些方法。因此,本文的目标是为您提供深度学习在比喻识别任务中知识注入的完整评论。在这篇文章中,我们系统地总结和总结主流的知识和知识注入原则,同时回顾用于比喻识别任务的数据集、评价指标和标准模型。最后,我们探讨知识注入方法当前面临的问题,并对未来研究方向提出了一些想法。

Comparative Analysis of the wav2vec 2.0 Feature Extractor

  • paper_url: http://arxiv.org/abs/2308.04286
  • repo_url: None
  • paper_authors: Peter Vieting, Ralf Schlüter, Hermann Ney
  • for: 这个论文的目的是研究一种基于神经网络的原始波形特征提取器(FEs),以取代传统的手动设计的特征提取方法,以实现更加一致的模型化从语音到文本转写。
  • methods: 这个论文使用了wav2vec 2.0模型,这是一种最近受欢迎的模型,它使用了一种卷积 convolutional FE,直接操作语音波形。然而,这个方法尚未得到了广泛的研究。
  • results: 研究表明,使用神经网络原始波形特征提取器可以与传统的特征提取方法竞争,并且可以在LibriSpeech benchmark上实现类似的性能。此外,研究还分析了各个组件的效果,并发现了一些帮助ASR系统获得重要信息的带宽滤波器。
    Abstract Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important information for the ASR system is obtained by a set of bandpass filters.
    摘要 自动语音识别(ASR)系统通常使用手工设计的特征提取管道。为了避免其内置的信息损失并实现更一致的模型化从语音到转录文本,神经原始波形特征提取器(FEs)是一种吸引人的方法。另外,最近广受欢迎的wav2vec 2.0模型使用了一种卷积 convolutional FE,该模型直接操作语音波形。然而,它在文献中还未得到了广泛的研究。在这种工作中,我们研究了它的可行性来代替标准特征提取方法在一个连接主义时间分类(CTC) ASR 模型中,并与一种代替神经 FE 进行比较。我们发现两者都与传统的特征提取方法竞争在 LibriSpeech 测试集上,并分析了各个组件的效果。此外,我们分析了学习的滤波器,并发现了一组频率带滤波器是 ASR 系统中最重要的信息来源。

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

  • paper_url: http://arxiv.org/abs/2308.04255
  • repo_url: None
  • paper_authors: Luka Terčon, Nikola Ljubešić
  • for: 这个论文是为了提出一个基于Stanza自然语言处理管道的自动语言标注管道,用于南斯拉夫语言的语言处理。
  • methods: 该管道使用了主要改进于Stanza,并对2.1版本进行了详细的模型训练过程。
  • results: 管道在不同语言和方言上都达到了高性能水平,并在大多数任务上超越或扩展了父管道Stanza。此外,新增的网络数据处理功能和其原因也被介绍。
    Abstract We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.
    摘要 我团队今天发布了一个名为CLASSLA-Stanza的自动语言标注管道,这个管道基于Stanza自然语言处理管道。我们详细介绍了CLASSLA-Stanza与Stanza之间的主要改进,以及latest版本2.1中模型训练过程的详细描述。我们还公布了不同语言和变体的性能分数。CLASSLA-Stanza在所有支持语言上表现了高稳定性,并在所有任务上超越或扩展了父管道Stanza的性能。我们还介绍了新增的网络数据处理功能,以及这种功能的实现的原因。

OpinionConv: Conversational Product Search with Grounded Opinions

  • paper_url: http://arxiv.org/abs/2308.04226
  • repo_url: None
  • paper_authors: Vahid Sadiri Javadi, Martin Potthast, Lucie Flek
  • for: This paper aims to address the problem of training conversational AI in simulating sales conversations by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives.
  • methods: The paper uses product reviews as a source of product opinions to train a conversational AI model called OpinionConv, which can simulate sales conversations.
  • results: The paper conducts several user studies to validate the generated conversations and shows that the generated opinions are perceived as realistic. The assessors also confirm the importance of opinions as an informative basis for decision-making.Here’s the simplified Chinese version of the three key points:
  • for: 这篇论文目标是使用产品评论作为对话AI训练的基础,以便模拟销售对话。
  • methods: 论文使用产品评论作为对话AI的训练数据,开发了一个名为OpinionConv的对话AI模型。
  • results: 论文通过多个用户研究证明了生成的对话是真实的,评分人也证明了对话中的意见对决策提供了有用信息。
    Abstract When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also true in sales conversations, where a customer and a sales assistant exchange facts and opinions about products. However, training an AI for such conversations is complicated by the fact that language models do not possess authentic opinions for their lack of real-world experience. We address this problem by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives. With OpinionConv, we develop the first conversational AI for simulating sales conversations. To validate the generated conversations, we conduct several user studies showing that the generated opinions are perceived as realistic. Our assessors also confirm the importance of opinions as an informative basis for decision-making.
    摘要

Studying Socially Unacceptable Discourse Classification (SUD) through different eyes: “Are we on the same page ?”

  • paper_url: http://arxiv.org/abs/2308.04180
  • repo_url: https://github.com/mlinardicyu/sud_study_different_eyes
  • paper_authors: Bruno Machado Carneiro, Michele Linardi, Julien Longhi
  • for: 这个论文是为了研究在线文本中的社会不接受的语言表达(SUD)的特征和检测方法。
  • methods: 作者首先建立了一个包含多种不同在线源的手动标注文本的新 корпуス,以用于测试现有的机器学习(ML)SUD检测解决方案中的通用性。
  • results: 作者通过分析不同批标注方法对SUD学习的影响,并提供了一些可支持领域专家在标注任务中的数据洞察。
    Abstract We study Socially Unacceptable Discourse (SUD) characterization and detection in online text. We first build and present a novel corpus that contains a large variety of manually annotated texts from different online sources used so far in state-of-the-art Machine learning (ML) SUD detection solutions. This global context allows us to test the generalization ability of SUD classifiers that acquire knowledge around the same SUD categories, but from different contexts. From this perspective, we can analyze how (possibly) different annotation modalities influence SUD learning by discussing open challenges and open research directions. We also provide several data insights which can support domain experts in the annotation task.
    摘要 我们研究社会不可接受的语言(SUD)的特征化和检测在线文本中。我们首先构建了一个新的文献库,包含了不同的在线来源的手动标注文本,以及现有的机器学习(ML)SUD检测解决方案中使用的同一类型的文本。这个全球背景允许我们测试SUD分类器在不同上下文中是否具有泛化能力。从这个角度来看,我们可以分析不同的标注方式对SUD学习产生的影响,并讨论开放的挑战和未来研究方向。我们还提供了一些数据分析视图,以支持领域专家进行标注任务。

On Monotonic Aggregation for Open-domain QA

  • paper_url: http://arxiv.org/abs/2308.04176
  • repo_url: https://github.com/yeonseokjeong/judge-specialist
  • paper_authors: Sang-eun Han, Yeonseok Jeong, Seung-won Hwang, Kyungjae Lee
  • for: answering user questions on unrestricted knowledge sources
  • methods: Judge-Specialist framework with specialist retrievers/readers and a dedicated language model to select the final answer
  • results: outperforms state-of-the-art multi-source QA methods on Natural Questions, and robustly preserves monotonicity against noise from speech recognition
    Abstract Question answering (QA) is a critical task for speech-based retrieval from knowledge sources, by sifting only the answers without requiring to read supporting documents. Specifically, open-domain QA aims to answer user questions on unrestricted knowledge sources. Ideally, adding a source should not decrease the accuracy, but we find this property (denoted as "monotonicity") does not hold for current state-of-the-art methods. We identify the cause, and based on that we propose Judge-Specialist framework. Our framework consists of (1) specialist retrievers/readers to cover individual sources, and (2) judge, a dedicated language model to select the final answer. Our experiments show that our framework not only ensures monotonicity, but also outperforms state-of-the-art multi-source QA methods on Natural Questions. Additionally, we show that our models robustly preserve the monotonicity against noise from speech recognition. We publicly release our code and setting.
    摘要 问答(QA)是知识源检索中的关键任务,通过只检索答案而不需要阅读支持文档。特别是开放领域QA旨在回答用户问题在不限制的知识源上。理想情况下,添加源应该不会降低准确性,但我们发现现有方法中的性质( denoted as "monotonicity")不成立。我们认定了原因,并基于此我们提出了 Judge-Specialist 框架。我们的框架包括(1)专家检索/读取器,覆盖个别源,以及(2)判官,专门的语言模型来选择最终答案。我们的实验表明,我们的框架不仅保证幂等性,而且超越了当前状态的跨源QA方法在自然问题上的性能。此外,我们的模型也能够坚定地保持幂等性面对语音识别器的噪音。我们在线发布了我们的代码和设置。

  • paper_url: http://arxiv.org/abs/2308.04138
  • repo_url: None
  • paper_authors: Dietrich Trautmann
  • for: 这个论文主要是为了解决法律文档分类问题,以提高法律文档分类的效率和准确率。
  • methods: 这个论文使用了提问链接法(prompt chaining)来解决法律文档分类问题。提问链接法是一种细分大任务,将其 decomposes 成一系列更小的任务,以提高模型的性能。在这个论文中,首先创建了原始文档的简洁摘要,然后进行 semantic search 来找到相关的示例文档和其对应的注释集。最后,通过提问来 assigning 标签,基于任务的需求。
  • results: 根据论文的结果,通过提问链接法,可以不 только超越零shot,还可以超过大型模型,如ChatGPT零shot,使用更小的模型。
    Abstract Prompting is used to guide or steer a language model in generating an appropriate response that is consistent with the desired outcome. Chaining is a strategy used to decompose complex tasks into smaller, manageable components. In this study, we utilize prompt chaining for extensive legal document classification tasks, which present difficulties due to their intricate domain-specific language and considerable length. Our approach begins with the creation of a concise summary of the original document, followed by a semantic search for related exemplar texts and their corresponding annotations from a training corpus. Finally, we prompt for a label - based on the task - to assign, by leveraging the in-context learning from the few-shot prompt. We demonstrate that through prompt chaining, we can not only enhance the performance over zero-shot, but also surpass the micro-F1 score achieved by larger models, such as ChatGPT zero-shot, using smaller models.
    摘要 提示是用于引导或导引语言模型生成适当的回应,以确保与所需结果相符。链式是一种策略,用于将复杂任务分解成更小、更容易处理的组件。在这项研究中,我们使用提示链式来处理广泛的法律文档分类任务,这些任务因其专业领域语言和较长的文档长度而更加具有挑战性。我们的方法包括:首先创建原始文档的简短摘要,然后通过semantic search找到相关的示例文档和它们的相关注释,从训练集中获取。最后,我们根据任务提供标签,通过受到上下文学习的几个提示来启用。我们示出,通过提示链式,不仅可以超越零shot的性能,还可以使用更小的模型超越更大的模型,如ChatGPT零shot。

Social Media, Topic Modeling and Sentiment Analysis in Municipal Decision Support

  • paper_url: http://arxiv.org/abs/2308.04124
  • repo_url: None
  • paper_authors: Miloš Švaňa
    for: This paper is written for municipal decision-makers who want to incorporate social media sentiment into their decision-making processes.methods: The paper proposes a framework for processing social media posts that consists of three steps: determining the sentiment polarity of each post, identifying prevalent topics, and aggregating the sentiment information. The framework uses fuzzy numbers to represent the sentiment in a richer way and capture the diversity of opinions expressed on social media.results: The paper demonstrates the application of the framework on tweets published from Ostrava, Czechia over a period of about two months. The results show how fuzzy numbers can represent the sentiment in a more nuanced way and capture the diversity of opinions expressed on social media.
    Abstract Many cities around the world are aspiring to become. However, smart initiatives often give little weight to the opinions of average citizens. Social media are one of the most important sources of citizen opinions. This paper presents a prototype of a framework for processing social media posts with municipal decision-making in mind. The framework consists of a sequence of three steps: (1) determining the sentiment polarity of each social media post (2) identifying prevalent topics and mapping these topics to individual posts, and (3) aggregating these two pieces of information into a fuzzy number representing the overall sentiment expressed towards each topic. Optionally, the fuzzy number can be reduced into a tuple of two real numbers indicating the "amount" of positive and negative opinion expressed towards each topic. The framework is demonstrated on tweets published from Ostrava, Czechia over a period of about two months. This application illustrates how fuzzy numbers represent sentiment in a richer way and capture the diversity of opinions expressed on social media.
    摘要
  1. Determine the sentiment polarity of each social media post (是否积极或消极的意见)2. Identify prevalent topics and map them to individual posts (找出主要话题并将其与各个帖子相关联)3. Aggregate the two pieces of information into a fuzzy number representing the overall sentiment expressed towards each topic (将这两个信息合并为一个模糊数字表示每个话题的总意见)Optionally, the fuzzy number can be reduced into a tuple of two real numbers indicating the “amount” of positive and negative opinion expressed towards each topic (可以将模糊数字转换为一个二元数组,表示每个话题的积极和消极意见的量)This framework was demonstrated on tweets published from Ostrava, Czechia over a period of about two months, showing how fuzzy numbers can represent sentiment in a richer way and capture the diversity of opinions expressed on social media.

Collective Human Opinions in Semantic Textual Similarity

  • paper_url: http://arxiv.org/abs/2308.04114
  • repo_url: https://github.com/yuxiaw/usts
  • paper_authors: Yuxia Wang, Shimin Tao, Ning Xie, Hao Yang, Timothy Baldwin, Karin Verspoor
  • for: 这个论文的目的是研究语义相似性(STS)的不确定性和人类评分的分布。
  • methods: 这个论文使用了一个新的 uncertainty-aware STS 数据集(USTS),包含了 ~15,000 个中文句子对和 150,000 个标签,以研究集体人类评分的变化。
  • results: 分析发现,人类评分的集体变化不能用标准的整数或单个高斯函数来描述,而是由人类不同的评分差异所引起的。此外,现有的 STS 模型无法捕捉人类对具体实例的不一致,而更反映了模型对总体数据集的预测程度。
    Abstract Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as the gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgements adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.
    摘要 尽管 semantic textual similarity (STS) 的评估是主观的,且存在各种不同的评估标准,现有的 benchmark 都使用了人类评分的均值作为金标准。但是,均值将低度一致的示例评分压缩到了一个平均值上,从而隐藏了人类意见的差异。在这项工作中,我们引入了 USTS,首个带有 ~15,000 个中文句子对和 150,000 个标签的不确定性意见 STS 数据集,以研究人类集体意见在 STS 中的表现。分析发现, neither 一个整数还是一个 Gaussian 能够准确地描述观察到的判断。此外,我们还表明,现有的 STS 模型无法捕捉人类对具体实例的不一致,而是反映了对总体数据集的预测信心。

I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection

  • paper_url: http://arxiv.org/abs/2308.04109
  • repo_url: https://github.com/cyzLoveDream/I-was
  • paper_authors: Yongzhu Chang, Rongsheng Zhang, Jiashu Pu
  • for: 用于提高自然语言处理(NLP)相关应用中的比喻检测精度。
  • methods: 使用Word replacement和Sentence completion方法,通过GPT-2语言模型进行数据增强。
  • results: 实验结果表明,我们的提议的数据增强方法可以有效提高比喻检测的性能。
    Abstract Simile detection is a valuable task for many natural language processing (NLP)-based applications, particularly in the field of literature. However, existing research on simile detection often relies on corpora that are limited in size and do not adequately represent the full range of simile forms. To address this issue, we propose a simile data augmentation method based on \textbf{W}ord replacement And Sentence completion using the GPT-2 language model. Our iterative process called I-WAS, is designed to improve the quality of the augmented sentences. To better evaluate the performance of our method in real-world applications, we have compiled a corpus containing a more diverse set of simile forms for experimentation. Our experimental results demonstrate the effectiveness of our proposed data augmentation method for simile detection.
    摘要 寓言检测是许多自然语言处理(NLP)应用中的重要任务,特别在文学领域。然而,现有的寓言检测研究通常基于有限的词库和不充分表示寓言的全面形式。为解决这个问题,我们提出了基于Word replacement和Sentence completion的GPT-2语言模型的寓言数据增强方法。我们的迭代过程被称为I-WAS,旨在提高增强后的句子质量。为更好地评估我们的方法在实际应用中的表现,我们将一个包含更多寓言形式的词库编译起来。我们的实验结果表明我们的提议的数据增强方法对寓言检测具有有效性。

DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles

  • paper_url: http://arxiv.org/abs/2308.04076
  • repo_url: None
  • paper_authors: Nicole Sultanum, Arjun Srinivasan
  • for: 这个论文是为了探讨使用现代大语言模型(LLM)来帮助作者写作数据驱动文章的可能性和价值。
  • methods: 该论文使用了一个名为DataTales的 прототип系统,利用LLM生成 Chart 的文字导趋。
  • results: 该研究通过对 11 名专业人士的反馈,发现 DataTales 可以帮助作者更快速地撰写数据驱动文章,并提供了一些可能性和机会来进一步 интегра LLM 为数据驱动文章作者的 valuabe助手。
    Abstract Authoring data-driven articles is a complex process requiring authors to not only analyze data for insights but also craft a cohesive narrative that effectively communicates the insights. Text generation capabilities of contemporary large language models (LLMs) present an opportunity to assist the authoring of data-driven articles and expedite the writing process. In this work, we investigate the feasibility and perceived value of leveraging LLMs to support authors of data-driven articles. We designed a prototype system, DataTales, that leverages a LLM to generate textual narratives accompanying a given chart. Using DataTales as a design probe, we conducted a qualitative study with 11 professionals to evaluate the concept, from which we distilled affordances and opportunities to further integrate LLMs as valuable data-driven article authoring assistants.
    摘要 作者撰写数据驱动文章是一个复杂的过程,作者需要不仅分析数据获得洞察,还需要把握数据来编写一篇有关的文章。当代大语言模型(LLM)的文本生成能力提供了帮助作者撰写数据驱动文章的机会,并且可以快速化写作过程。在这项工作中,我们研究了利用LLM支持数据驱动文章作者的可能性和价值。我们设计了一个名为DataTales的 прототип系统,该系统利用LLM生成与给定图表相关的文字导趋。通过DataTales作为设计探索工具,我们对11名专业人士进行了质量研究,从中提炼出了LLM在数据驱动文章作者助手中的可能性和优势。

The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings

  • paper_url: http://arxiv.org/abs/2308.04052
  • repo_url: https://github.com/TimMerino1710/five-dollar-model
  • paper_authors: Timothy Merino, Roman Negri, Dipika Rajesh, M Charity, Julian Togelius
  • for: 这篇论文旨在描述一种可以从编码文本提示生成低维度图像的轻量级文本-图像生成模型。
  • methods: 该模型使用了一些限制量的训练数据,并应用了一些新的扩展策略来提高模型在三个小 datasets(像素艺术视频游戏地图、视频游戏 sprite 图像和压缩emoji图像)的性能。
  • results: 根据cosine相似性分数,该模型能够成功地生成具有编码 semantic 含义的图像,并且在限制量数据下可以达到高质量和美观的图像生成。
    Abstract The five-dollar model is a lightweight text-to-image generative architecture that generates low dimensional images from an encoded text prompt. This model can successfully generate accurate and aesthetically pleasing content in low dimensional domains, with limited amounts of training data. Despite the small size of both the model and datasets, the generated images are still able to maintain the encoded semantic meaning of the textual prompt. We apply this model to three small datasets: pixel art video game maps, video game sprite images, and down-scaled emoji images and apply novel augmentation strategies to improve the performance of our model on these limited datasets. We evaluate our models performance using cosine similarity score between text-image pairs generated by the CLIP VIT-B/32 model.
    摘要 “五币模型”是一个轻量级文本至图生成架构,它将文本提示转换为低维度图像。这个模型可以成功实现精确和美观的内容生成,即使对于训练数据的量相对较少。尽管模型和数据集都很小,但生成的图像仍然能够保持文本提示中的 semantics 含义。我们将这个模型应用到三个小型数据集:像素艺术游戏地图、游戏图像和缩小的表情符号图像,并对这些限制的数据集进行新的扩展策略以改善我们的模型表现。我们使用 CLIP VIT-B/32 模型的弹性相似度分数来评估我们的模型表现。

A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset

  • paper_url: http://arxiv.org/abs/2308.04037
  • repo_url: None
  • paper_authors: Mamata Das, Selvakumar K., P. J. A. Alphonse
  • For: The paper is written for text classification and its algorithms, specifically focusing on the feature weighting method for text classification on unstructured data.* Methods: The paper uses two features, N-Grams and TF-IDF, on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. The state-of-the-art classifiers used to validate the method include SVM, Logistic Regression, Multinomial Naive Bayes, Random Forest, Decision Tree, and k-nearest neighbors.* Results: The paper found that TF-IDF features resulted in a significant increase in feature extraction, with TF-IDF achieving the maximum accuracy, precision, recall, and F1-score values of 93.81%, 94.20%, 93.81%, and 91.99%, respectively, in the Random Forest classifier.
    Abstract Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.
    摘要 文本分类是将文本分类到相关的类别,其算法是自然语言处理(NLP)的核心。文本频率-反转文档频率(TF-IDF)和NLP是文本检索中最常用的方法。我们已经调查和分析了文本分类中的特征赋值方法,并在IMDB电影评论和Amazon Alexa评论数据集上进行了 sentiment分析。然后,我们使用了当今最佳分类器来验证方法,即支持向量机(SVM)、梯度回归、多元随机树(Multinomial NB)、Random Forest、决策树和k-最近邻居(KNN)。从这两个特征提取方法中,TF-IDF特征提取得到了显著的提高,而不是基于N-Gram。TF-IDF在Random Forest分类器中获得了最高的准确率(93.81%)、精度(94.20%)、回归率(93.81%)和F1分数(91.99%)值。

Continual Pre-Training of Large Language Models: How to (re)warm your model?

  • paper_url: http://arxiv.org/abs/2308.04014
  • repo_url: None
  • paper_authors: Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort
  • for: 这个论文的目的是探讨如何实现大语言模型的持续预训练,以提高计算效率和预训练模型的性能。
  • methods: 本研究使用了不同的暖身策略来研究模型在新数据上的性能。
  • results: 研究结果显示,使用暖身策略可以在长期内提高下游数据的性能,并且在大下游数据集上超越从头开始训练的模型。
    Abstract Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.
    摘要

Simple synthetic data reduces sycophancy in large language models

  • paper_url: http://arxiv.org/abs/2308.03958
  • repo_url: https://github.com/google/sycophancy-intervention
  • paper_authors: Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
  • for: 本研究旨在研究语音模型中的奴役行为(sycophancy),并提出一种简单的人工数据干预方法来减少这种行为。
  • methods: 研究者使用了三个偏见任务(Perez et al., 2022),测试了模型在不同的缩放和调教情况下的奴役行为。
  • results: 研究发现,对于PaLM模型,通过缩放和调教可以显著增强奴役行为,而且even when the user’s view is objectively incorrect, models will still agree with them。此外,研究者还提出了一种简单的人工数据干预方法,通过在公共NLP任务上添加一些适当的数据,可以减少模型对用户意见的依赖。
    Abstract Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.
    摘要 sycophancy 是一种不良行为,在语言模型回答时适应人工用户的观点,即使这些观点不是 объекively 正确(例如,适应自由主义观点一旦用户承认自己是自由主义者)。在这篇论文中,我们研究了语言模型中的 sycophancy 的普遍性和提出了一种简单的人工数据干预措施来降低这种行为。首先,我们在 Perez et al. (2022) 中提供的三个 sycophancy 任务上观察到,随着模型缩放和指令调整,PaLM 模型的 sycophancy 会增加到 540B 参数的最大值。其次,我们扩展了 sycophancy 评估范围到对象错误的简单加法句子,发现,即使用户认为这些句子是错误的,语言模型仍会同意它们,如果用户也同意。为了降低 sycophancy,我们提出了一种简单的人工数据干预措施,通过在公共 NLP 任务上添加一些适应用户观点的数据,让模型在新的任务上具有更好的Robustness。可以在 https://github.com/google/sycophancy-intervention 找到代码生成 synthetic data 的步骤。

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

  • paper_url: http://arxiv.org/abs/2308.03917
  • repo_url: https://github.com/ctaguchi/multipa
  • paper_authors: Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang
  • for: 这个研究旨在开发一个可以将任何语言的speech转录为国际音声字母表(IPA)的模型。
  • methods: 这个模型基于wav2vec 2.0,通过对听取到的音频输入进行 fine-tuning,预测IPA。
  • results: 该模型在七种语言的CommonVoice 11.0训练数据上达到了与人工标注几乎相当的质量水平,并且与之前的最佳speech-to-IPA模型(Wav2Vec2Phoneme)的训练数据集相比,该模型的训练数据集更小。
    Abstract This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.
    摘要

A Cross-Domain Evaluation of Approaches for Causal Knowledge Extraction

  • paper_url: http://arxiv.org/abs/2308.03891
  • repo_url: https://github.com/aniksh/causal-spert
  • paper_authors: Anik Saha, Oktie Hassanzadeh, Alex Gittens, Jian Ni, Kavitha Srinivas, Bulent Yener
  • for: 提取文本中的 causa-effect 关系
  • methods: 使用预训练语言模型(如BERT)的序列标记模型和span基于方法进行 causal 知识提取
  • results: 结果表明,使用BERT预训练语言模型的序列标记模型可以提供 significnat 性能提升,而span基于方法在四个数据集中的表现都比simple sequence tagging模型更好。
    Abstract Causal knowledge extraction is the task of extracting relevant causes and effects from text by detecting the causal relation. Although this task is important for language understanding and knowledge discovery, recent works in this domain have largely focused on binary classification of a text segment as causal or non-causal. In this regard, we perform a thorough analysis of three sequence tagging models for causal knowledge extraction and compare it with a span based approach to causality extraction. Our experiments show that embeddings from pre-trained language models (e.g. BERT) provide a significant performance boost on this task compared to previous state-of-the-art models with complex architectures. We observe that span based models perform better than simple sequence tagging models based on BERT across all 4 data sets from diverse domains with different types of cause-effect phrases.
    摘要 causal knowledge extraction 是另一个重要的自然语言处理任务,即从文本中提取有关 causal 关系的信息。 although recent works in this area have mainly focused on将文本段分类为 causal 或非 causal,我们在这个领域进行了系统性的分析,并与 span 基于 causality 提取方法进行比较。 our experiments show that pre-trained language model 的 embedding 提供了 significannot performance boost 在这个任务中,比之前的 state-of-the-art 模型 with complex architectures。 我们发现 span 基于模型在所有四个数据集中表现较好, especialy when dealing with diverse domains and different types of cause-effect phrases.

  • paper_url: http://arxiv.org/abs/2308.03883
  • repo_url: https://github.com/northeastern-datalab/alt-gen
  • paper_authors: Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller
  • for: 本研究的目的是为了开发一种基于生成AI模型的数据管理 benchmark,以解决现有的数据管理问题具有语义性。
  • methods: 本研究使用的方法包括生成AI模型来创建结构化数据 benchmark,以及对现有的手动纪录和标注的数据进行evaluation。
  • results: 研究发现,使用生成AI模型创建的 benchmark 比手动纪录和标注的 benchmark 更加具有挑战性,并且允许更加详细的分析方法的性能。 Specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks.
    Abstract Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.
    摘要 “数据管理历史上依赖了人工生成的数据生成器来生成结构化的标准吞吐量测试(TPC),以控制数据大小和分布的重要参数。这些测试对数据库管理系统的采用和普及做出了重要贡献。然而,越来越多的数据管理问题是 semantic 性质的,例如找到可union的表。虽然任何两个表都可以union,但表union搜索问题是找到semantically coherent的表的union。semantic问题无法使用人工生成的数据来 benchmark。我们当前的创建benchmark方法是通过手动筛选和标注实际数据来实现。这些方法不具有可靠性和扩展性,而且可能更重要的是,不确定创建的benchmark的可靠性。我们提议使用生成AI模型来创建结构化数据 benchmarks for table union search。我们提出了一种使用生成模型创建表 avec specified properties的新方法。使用这种方法,我们创建了一个新的benchmark,包含可union和non-union但相关的表对。我们进行了对现有benchmark和我们新的benchmark的严格评估。我们还提出了基于最新的大语言模型的新表搜索方法,并对所有benchmark进行评估。我们发现新的benchmark比手动创建的benchmark更加具有挑战性,特别是top-performing方法的 Mean Average Precision 约为60%,相比手动创建的benchmark的30%以上。我们分析了这种情况,并证明新的benchmark允许更详细的方法分析,包括对方法的false positives和false negatives的研究,这些研究不可能通过现有benchmark进行。”

Semantic Equivalence of e-Commerce Queries

  • paper_url: http://arxiv.org/abs/2308.03869
  • repo_url: None
  • paper_authors: Aritra Mandal, Daniel Tunkelang, Zhe Wu
  • for: 提高电商搜索中的用户体验和企业业绩
  • methods: 提出了一种框架,通过识别和利用查询等价性来提高搜索结果的准确率和用户满意度
  • results: 实验结果表明,该框架可以高效地识别和利用查询等价性,并与流行的句子转换模型相比,实现了更高的查询相似性(Pearson correlation coefficient为0.85),这表明该方法可以提高电商搜索中的用户体验和企业业绩。
    Abstract Search query variation poses a challenge in e-commerce search, as equivalent search intents can be expressed through different queries with surface-level differences. This paper introduces a framework to recognize and leverage query equivalence to enhance searcher and business outcomes. The proposed approach addresses three key problems: mapping queries to vector representations of search intent, identifying nearest neighbor queries expressing equivalent or similar intent, and optimizing for user or business objectives. The framework utilizes both surface similarity and behavioral similarity to determine query equivalence. Surface similarity involves canonicalizing queries based on word inflection, word order, compounding, and noise words. Behavioral similarity leverages historical search behavior to generate vector representations of query intent. An offline process is used to train a sentence similarity model, while an online nearest neighbor approach supports processing of unseen queries. Experimental evaluations demonstrate the effectiveness of the proposed approach, outperforming popular sentence transformer models and achieving a Pearson correlation of 0.85 for query similarity. The results highlight the potential of leveraging historical behavior data and training models to recognize and utilize query equivalence in e-commerce search, leading to improved user experiences and business outcomes. Further advancements and benchmark datasets are encouraged to facilitate the development of solutions for this critical problem in the e-commerce domain.
    摘要 translate-send: from-language en to-language zh-CN text-type text contents "Search query variation poses a challenge in e-commerce search, as equivalent search intents can be expressed through different queries with surface-level differences. This paper introduces a framework to recognize and leverage query equivalence to enhance searcher and business outcomes. The proposed approach addresses three key problems: mapping queries to vector representations of search intent, identifying nearest neighbor queries expressing equivalent or similar intent, and optimizing for user or business objectives. The framework utilizes both surface similarity and behavioral similarity to determine query equivalence. Surface similarity involves canonicalizing queries based on word inflection, word order, compounding, and noise words. Behavioral similarity leverages historical search behavior to generate vector representations of query intent. An offline process is used to train a sentence similarity model, while an online nearest neighbor approach supports processing of unseen queries. Experimental evaluations demonstrate the effectiveness of the proposed approach, outperforming popular sentence transformer models and achieving a Pearson correlation of 0.85 for query similarity. The results highlight the potential of leveraging historical behavior data and training models to recognize and utilize query equivalence in e-commerce search, leading to improved user experiences and business outcomes. Further advancements and benchmark datasets are encouraged to facilitate the development of solutions for this critical problem in the e-commerce domain."Here's the translation in Simplified Chinese:搜索查询的变化 poses 电商搜索中的挑战,因为等效的搜索意图可以通过不同的查询语句表达出来,具有表面上的差异。本文提出了一种框架,用于认可和利用查询相似性,以提高搜索者和企业的结果。该框架解决了三个关键问题:将查询映射到搜索意图的vector表示,标识最相似的查询语句,并优化用户或企业的目标。该框架利用表面相似性和行为相似性来确定查询相似性。表面相似性包括Word排序、幂等词、缩合词和噪音词的canonicalization。行为相似性利用历史搜索行为生成查询意图的vector表示。在线程中使用了一个历史搜索行为训练的模型,而在线上使用了一个最近的邻居方法来处理未看过的查询。实验证明了该方法的效果,超越了流行的句子变换模型,并达到了0.85的Pearson相关性。结果表明,通过利用历史行为数据和训练模型,可以认可和利用查询相似性,提高用户体验和商业结果。进一步的进步和标准 datasets 是鼓励的,以便开发电商搜索领域中的解决方案。

Storyfier: Exploring Vocabulary Learning Support with Text Generation Models

  • paper_url: http://arxiv.org/abs/2308.03864
  • repo_url: None
  • paper_authors: Zhenhui Peng, Xingbo Wang, Qiushi Han, Junkai Zhu, Xiaojuan Ma, Huamin Qu
  • for: 支持学习任务的生成模型(Generative Adversarial Networks,GANs)
  • methods: 使用文本生成模型生成故事,并提供学习者可以使用AI助手进行写作和练习语言使用的功能
  • results: 学习者对使用Storyfier进行学习有很好的满意度,但是在阅读、填充和写作任务中,使用Storyfier的学习者表现相对较差,尤其是在记忆和使用目标词汇方面。
    Abstract Vocabulary learning support tools have widely exploited existing materials, e.g., stories or video clips, as contexts to help users memorize each target word. However, these tools could not provide a coherent context for any target words of learners' interests, and they seldom help practice word usage. In this paper, we work with teachers and students to iteratively develop Storyfier, which leverages text generation models to enable learners to read a generated story that covers any target words, conduct a story cloze test, and use these words to write a new story with adaptive AI assistance. Our within-subjects study (N=28) shows that learners generally favor the generated stories for connecting target words and writing assistance for easing their learning workload. However, in the read-cloze-write learning sessions, participants using Storyfier perform worse in recalling and using target words than learning with a baseline tool without our AI features. We discuss insights into supporting learning tasks with generative models.
    摘要 学习词汇支持工具已经广泛利用现有的材料,如故事或视频片段,作为词汇记忆的 Context。然而,这些工具无法提供学生们感兴趣的词汇的 coherent Context,并rarely帮助学生们实践词汇使用。在这篇论文中,我们与教师和学生合作开发了Storyfier,利用文本生成模型,让学生可以阅读一个包含target词的生成故事,进行故事填充测试,并使用这些词汇写新的故事,并且有adaptive AI帮助。我们的在人Subjects研究(N=28)表明,学生通常喜欢使用Storyfier来连接target词和写作帮助,以减轻学习劳动。然而,在read-cloze-write学习 Session中,参与者使用Storyfier表现比基eline工具而言,更难记忆和使用target词。我们讨论了如何使用生成模型支持学习任务的信息。

Extracting detailed oncologic history and treatment plan from medical oncology notes with large language models

  • paper_url: http://arxiv.org/abs/2308.03853
  • repo_url: https://github.com/madhumitasushil/oncllmextraction
  • paper_authors: Madhumita Sushil, Vanessa E. Kennedy, Brenda Y. Miao, Divneet Mandair, Travis Zack, Atul J. Butte
  • for: 这个研究的目的是为了评估最新的自然语言处理模型(GPT-4、GPT-3.5-turbo、FLAN-UL2)在抽取肿瘤病例纪录中的表现。
  • methods: 这个研究使用了一种细化的 schema 来标注肿瘤病例纪录中的信息,包括患者特征、肿瘤特征、测试和治疗等。然后,使用这些标注数据来评估这三个模型在抽取肿瘤病例纪录中的表现。
  • results: 研究发现,GPT-4 模型在抽取肿瘤病例纪录中表现最佳,其中的 BLEU 分数为 0.69,ROUGE 分数为 0.72,并且在复杂任务中的准确率为 67%。这个模型在抽取肿瘤特征和药物信息方面表现特别出色,并且在推断疾病的 симптом和未来药物的考虑方面也表现了优异。这个研究表明,GPT-4 可能已经可以用于从肿瘤进程纪录中提取重要的信息,以便于临床研究、复杂人口管理和评估quality patient care。
    Abstract Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 10 de-identified breast cancer progress notes at University of California, San Francisco, we applied this schema to assess the abilities of three recently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 2750 entities, 2874 modifiers, and 1623 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.69, an average ROUGE score of 0.72, and an average accuracy of 67% on complex tasks (expert manual evaluation). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in inferring symptoms due to cancer and considerations of future medications. The analysis demonstrates that GPT-4 is potentially already usable to extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.
    摘要 医疗和观察研究在肿瘤学中都需要深刻理解病人疾病进展和治疗历史,这些信息通常都是在临床笔记中详细记录的。 despite their vital role, current oncology information representation and annotation schema 没有完全涵盖临床笔记中的多样性信息。 Recently, large language models (LLMs) have shown impressive performance on various medical natural language processing tasks,but due to the lack of comprehensively annotated oncology datasets,the extent to which LLMs can extract and reason with the complex rhetoric in oncology notes remains understudied。We developed a detailed schema for annotating textual oncology information, including patient characteristics,tumor characteristics,tests,treatments,and temporality。Using a corpus of 10 de-identified breast cancer progress notes at University of California, San Francisco,we applied this schema to assess the abilities of three recently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed oncological history from two narrative sections of clinical progress notes。Our team annotated 2750 entities,2874 modifiers,and 1623 relationships。GPT-4 model exhibited overall best performance,with an average BLEU score of 0.69,an average ROUGE score of 0.72,and an average accuracy of 67% on complex tasks (expert manual evaluation)。It was proficient in tumor characteristic and medication extraction,and demonstrated superior performance in inferring symptoms due to cancer and considerations of future medications。The analysis demonstrates that GPT-4 is potentially already usable to extract important facts from cancer progress notes needed for clinical research,complex population management,and documenting quality patient care。

What about translation? New coding system for content analysis on the perception of literary translation around the political transformation in 1989 in Hungary as a classification problem on an unbalanced dataset

  • paper_url: http://arxiv.org/abs/2308.03742
  • repo_url: None
  • paper_authors: Dalma Galambos, Pál Zsámboki
  • for: Tracking trends in the perception of literary translation during political transformation in 1989 in Hungary.
  • methods: Trained BERT models to carry over coding system to 1980-1999 issues of literary journal Nagyvilág, with extensive hyperparameter tuning, loss functions robust to label unbalance, 10-fold cross-validation, model ensemble for prediction, manual validation, and calibration method to better predict label counts.
  • results: Study of relations between labels using label relation networks.
    Abstract To track trends in the perception of literary translation around the political transformation in 1989 in Hungary, a coding system was developed on the paragraphs of the 1980-1999 issues of the literary journal Alf\"old. This paper describes how we trained BERT models to carry over the coding system to the 1980-1999 issues of the literary journal Nagyvil\'ag. We use extensive hyperparameter tuning, loss functions robust to label unbalance, 10-fold cross-validation for precise evaluations and a model ensemble for prediction, manual validation on the predict set, a new calibration method to better predict label counts for sections of the Nagyvil\'ag corpus, and to study the relations between labels, we construct label relation networks.
    摘要 为了跟踪1989年政治转型期间文学翻译的观点变化,我们在1980-1999年《alföld》期刊中的段落上设计了一个编码系统。本文描述了我们如何使用BERT模型将编码系统传播到1980-1999年《大世界》期刊中的段落上。我们采用了广泛的hyperparameter优化、 Label不均衡的损失函数、10重交叉验证、精确的预测和手动验证预测集、一个新的准确预测标签计数的方法、以及为了研究标签之间的关系,我们构建了标签关系网络。