cs.CL - 2023-08-10

AST-MHSA : Code Summarization using Multi-Head Self-Attention

  • paper_url: http://arxiv.org/abs/2308.05646
  • repo_url: None
  • paper_authors: Yeshwanth Nagaraj, Ujjwal Gupta
  • for: 提高 Code 摘要生成的精度和效率,使其能够更好地捕捉代码中重要信息。
  • methods: 使用 transformer 架构的 Encoder-Decoder 模型,并在 Encoder 中使用多头注意机制来提取代码树中重要的 semantic 信息。
  • results: 通过训练模型并对其进行优化,可以实现更高的摘要生成精度和更高的效率。
    Abstract Code summarization aims to generate concise natural language descriptions for source code. The prevailing approaches adopt transformer-based encoder-decoder architectures, where the Abstract Syntax Tree (AST) of the source code is utilized for encoding structural information. However, ASTs are much longer than the corresponding source code, and existing methods ignore this size constraint by directly feeding the entire linearized AST into the encoders. This simplistic approach makes it challenging to extract truly valuable dependency relations from the overlong input sequence and leads to significant computational overhead due to self-attention applied to all nodes in the AST. To address this issue effectively and efficiently, we present a model, AST-MHSA that uses multi-head attention to extract the important semantic information from the AST. The model consists of two main components: an encoder and a decoder. The encoder takes as input the abstract syntax tree (AST) of the code and generates a sequence of hidden states. The decoder then takes these hidden states as input and generates a natural language summary of the code. The multi-head attention mechanism allows the model to learn different representations of the input code, which can be combined to generate a more comprehensive summary. The model is trained on a dataset of code and summaries, and the parameters of the model are optimized to minimize the loss between the generated summaries and the ground-truth summaries.
    摘要 仪器简化目标在编译代码时生成简洁的自然语言描述。现有方法大多采用变换器基于encoder-decoder架构,其中源代码的抽象 sintaxis树(AST)被用于编码结构信息。然而,AST比源代码更长,现有方法直接将整个 linearized AST 传递给编码器,这种简单的方法使得EXTRACTING valuable dependency relations from the oversized input sequence becomes challenging, leading to significant computational overhead due to self-attention applied to all nodes in the AST.为解决这个问题,我们提出了一个模型,AST-MHSA,该模型使用多头注意力机制来EXTRACT important semantic information from the AST.该模型包括两个主要组件:编码器和解码器。编码器接受源代码的抽象 sintaxis树(AST)作为输入,并生成一个序列hidden states。解码器接受这些hidden states作为输入,并生成一个自然语言描述。多头注意力机制允许模型学习不同的输入代码的表示,这些表示可以结合以生成更加全面的描述。模型在一个代码和描述的集合上训练,并优化参数以Minimize the loss between the generated summaries and the ground-truth summaries。

IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer

  • paper_url: http://arxiv.org/abs/2308.05633
  • repo_url: None
  • paper_authors: Keqiang Fan, Xiaohao Cai, Mahesan Niranjan
  • for: 这个研究旨在提出一个基于图像转换器的医疗报告生成方法,以帮助医生写出详细和准确的医疗报告。
  • methods: 这个方法包括三个模组:分类模组、指标扩展模组和生成模组。首先,分类模组从医疗图像中提取出病理相关的特征,然后生成病理相关的指标,并且使用“资料-文本-资料”策略扩展这些指标。接着,生成模组使用这些提取的特征和图像特征作为辅助信息,生成最终的医疗报告。
  • results: 实验结果显示,提出的IIHT方法可以实现高品质的医疗报告生成,并且在不同的评估指标下表现出色。此外,这个方法还可以让医生在实际应用中修改病理指标,以确保报告的准确性和流畅性。
    Abstract Automated medical report generation has become increasingly important in medical analysis. It can produce computer-aided diagnosis descriptions and thus significantly alleviate the doctors' work. Inspired by the huge success of neural machine translation and image captioning, various deep learning methods have been proposed for medical report generation. However, due to the inherent properties of medical data, including data imbalance and the length and correlation between report sequences, the generated reports by existing methods may exhibit linguistic fluency but lack adequate clinical accuracy. In this work, we propose an image-to-indicator hierarchical transformer (IIHT) framework for medical report generation. It consists of three modules, i.e., a classifier module, an indicator expansion module and a generator module. The classifier module first extracts image features from the input medical images and produces disease-related indicators with their corresponding states. The disease-related indicators are subsequently utilised as input for the indicator expansion module, incorporating the "data-text-data" strategy. The transformer-based generator then leverages these extracted features along with image features as auxiliary information to generate final reports. Furthermore, the proposed IIHT method is feasible for radiologists to modify disease indicators in real-world scenarios and integrate the operations into the indicator expansion module for fluent and accurate medical report generation. Extensive experiments and comparisons with state-of-the-art methods under various evaluation metrics demonstrate the great performance of the proposed method.
    摘要 医学报告自动生成已成为医学分析中不可或缺的一环。它可以生成计算机辅助诊断描述,从而减轻医生的工作负担。启发于神经机器翻译和图像描述的巨大成功,各种深度学习方法在医学报告生成中得到了广泛应用。然而,由于医学数据的内在特性,包括数据不均衡和报告序列长度和相关性,现有方法生成的报告可能具有语言流畅性,但缺乏准确的医学精度。在这项工作中,我们提出了一种图像层次变换器(IIHT)框架,用于医学报告生成。该框架包括三个模块:分类模块、指标扩展模块和生成模块。首先,分类模块从输入医学图像中提取出疾病相关的特征,并生成相应的疾病指标。然后,指标扩展模块使用“数据-文本-数据”策略,将疾病指标扩展为更加详细的报告。最后,基于变换器的生成器利用这些提取的特征和图像特征作为辅助信息,生成最终的报告。此外,我们的IIHT方法可以让医生在实际应用场景中修改疾病指标,并将操作集成到指标扩展模块中,以实现流畅和准确的医学报告生成。我们对多种评价指标进行了广泛的实验和比较,结果表明我们的方法具有出色的表现。

LASIGE and UNICAGE solution to the NASA LitCoin NLP Competition

  • paper_url: http://arxiv.org/abs/2308.05609
  • repo_url: None
  • paper_authors: Pedro Ruas, Diana F. Sousa, André Neves, Carlos Cruz, Francisco M. Couto
  • for: 本研究的目的是提高生物医学自然语言处理(NLP)的效率,以满足生物医学领域的数据处理需求。
  • methods: 本研究使用了行业数据工程解决方案和学术系统(LasigeUnicage_NER和BiOnt)的集成,并将外部知识(其他数据集和生物学 ontology)integrated into the pipeline。
  • results: 本研究在2022年LitCoin NLP挑战赛中获得了第7名,表明了学术和行业之间的成功合作(LASIGE和Unicage)。软件支持此研究可以在GitHub上获取:https://github.com/lasigeBioTM/Litcoin-Lasige_Unicage。
    Abstract Biomedical Natural Language Processing (NLP) tends to become cumbersome for most researchers, frequently due to the amount and heterogeneity of text to be processed. To address this challenge, the industry is continuously developing highly efficient tools and creating more flexible engineering solutions. This work presents the integration between industry data engineering solutions for efficient data processing and academic systems developed for Named Entity Recognition (LasigeUnicage\_NER) and Relation Extraction (BiOnt). Our design reflects an integration of those components with external knowledge in the form of additional training data from other datasets and biomedical ontologies. We used this pipeline in the 2022 LitCoin NLP Challenge, where our team LasigeUnicage was awarded the 7th Prize out of approximately 200 participating teams, reflecting a successful collaboration between the academia (LASIGE) and the industry (Unicage). The software supporting this work is available at \url{https://github.com/lasigeBioTM/Litcoin-Lasige_Unicage}.
    摘要

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

  • paper_url: http://arxiv.org/abs/2308.05596
  • repo_url: None
  • paper_authors: Xinlei He, Savvas Zannettou, Yun Shen, Yang Zhang
    for: 这paper主要针对的是如何使用大语言模型(LLMs)和提示学习来解决在线社会中的恶意内容问题。methods: 这paper使用了五种模型建筑和八个数据集来评估LLMs和提示学习在恶意内容检测、恶意 span 检测和净化等三个任务中的性能。results: 研究发现,使用LLMs和提示学习可以在恶意内容检测和恶意 span 检测任务中达到类似或更好的性能,而且在净化任务中可以成功减少恶意内容的平均分数(从0.775降至0.213)而不会产生Semantic 损害。
    Abstract The spread of toxic content online is an important problem that has adverse effects on user experience online and in our society at large. Motivated by the importance and impact of the problem, research focuses on developing solutions to detect toxic content, usually leveraging machine learning (ML) models trained on human-annotated datasets. While these efforts are important, these models usually do not generalize well and they can not cope with new trends (e.g., the emergence of new toxic terms). Currently, we are witnessing a shift in the approach to tackling societal issues online, particularly leveraging large language models (LLMs) like GPT-3 or T5 that are trained on vast corpora and have strong generalizability. In this work, we investigate how we can use LLMs and prompt learning to tackle the problem of toxic content, particularly focusing on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification. We perform an extensive evaluation over five model architectures and eight datasets demonstrating that LLMs with prompt learning can achieve similar or even better performance compared to models trained on these specific tasks. We find that prompt learning achieves around 10\% improvement in the toxicity classification task compared to the baselines, while for the toxic span detection task we find better performance to the best baseline (0.643 vs. 0.640 in terms of $F_1$-score). Finally, for the detoxification task, we find that prompt learning can successfully reduce the average toxicity score (from 0.775 to 0.213) while preserving semantic meaning.
    摘要 互联网上的毒害内容问题是一个重要的问题,它对于用户在线体验和社会的影响都有不良的影响。为了解决这个问题,研究者们通常会使用机器学习(ML)模型,这些模型通常是基于人工标注的数据集进行训练。然而,这些模型通常无法泛化好和适应新趋势(如新的毒害词语的出现)。目前,我们正在见证到一种shift在在线社会问题的解决方法,即通过大型自然语言模型(LLMs)如GPT-3或T5进行训练,这些模型可以在庞大的数据集上进行训练,并且具有强大的泛化能力。在这项工作中,我们研究了如何使用LLMs和提示学习来解决毒害内容问题,特别关注以下三个任务:1)毒害分类,2)毒害块检测,3)毒害除掉。我们对五种模型结构进行了广泛的评估,并对八个数据集进行了extensive的评估,得到的结果表明,LLMs通过提示学习可以与特定任务训练的模型 achieve similar or even better性能。我们发现,提示学习在毒害分类任务中可以提高约10%的性能,而在毒害块检测任务中,我们的模型可以与最佳基eline(0.643 vs. 0.640)相比,而在毒害除掉任务中,我们的模型可以成功地减少平均毒害分数(从0.775下降至0.213),同时保持 semantics的意义。

Do Language Models Refer?

  • paper_url: http://arxiv.org/abs/2308.05576
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Matthew Mandelkern, Tal Linzen
  • for: 本研究探讨了语言模型(LMs)是否能够通过语言产生“字符串”来表达意义。
  • methods: 本研究使用了哲学语言学的外部主义传统, analyze LMs的输出是否能够建立“字符串”与世界之间的连接。
  • results: 研究结果表明,LMs的输出可以建立“字符串”与世界之间的连接,并且这种连接是通过语言产生的。
    Abstract What do language models (LMs) do with language? Everyone agrees that they produce sequences of (mostly) coherent sentences. But are they saying anything with those strings or simply babbling in a convincing simulacrum of language use? This is a vague question, and there are many ways of making it precise. Here we will address one aspect of the question, namely, whether LMs' words refer: that is, whether the outputs of LMs achieve "word-to-world" connections. There is prima facie reason to think they do not since LMs do not interact with the world in the way that ordinary language users do. Drawing on insights from the externalist tradition in philosophy of language, we argue that appearances are misleading and that there is good reason to think that LMs can refer.
    摘要

Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages

  • paper_url: http://arxiv.org/abs/2308.05574
  • repo_url: None
  • paper_authors: Danish Ebadulla, Rahul Raman, S. Natarajan, Hridhay Kiran Shetty, Ashish Harish Shenoy
  • for: zero-shot translation (零shot翻译)
  • methods: using transliteration and linguistic similarity (利用转写和语言相似性)
  • results: achieves scores within 3 BLEU of large-scale pivot-based models when trained on 50% of the language directions (在50%的语言方向上训练后,与大规模的 pivot-based 模型的分值相似)
    Abstract Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In this work, we overcome the shortcomings of zero-shot translation by taking advantage of transliteration and linguistic similarity. We build a single encoder-decoder neural machine translation system for Dravidian-Dravidian multilingual translation and perform zero-shot translation. We compare the data vs zero-shot accuracy tradeoff and evaluate the performance of our vanilla method against the current state of the art pivot based method. We also test the theory that morphologically rich languages require large vocabularies by restricting the vocabulary using an optimal transport based technique. Our model manages to achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50\% of the language directions.
    摘要 当前的零shot翻译研究受到许多问题的困扰,如高计算需求、增长训练时间和目标翻译错误。提出的解决方案通常会增加更多的数据或计算需求。 pivot基于神经机器翻译模型被广泛采用,尽管它增加了训练和评估时间。在这项工作中,我们超越零shot翻译的缺点,利用翻译和语言相似性。我们构建了单个encoder-decoder神经机器翻译系统,用于欧洲-欧洲多语言翻译,并实现零shot翻译。我们比较数据VS零shot精度的贸易和现有状态的 pivot基于方法的性能。我们还测试了语言富有 morphology 的语言需要大 vocabulary 的假设,使用最优运输算法来限制词汇。我们的模型在训练50%的语言方向时达到了与大规模 pivot基于模型相当的 scores,并且在 3 BLEU 之内。

Exploring Machine Learning and Transformer-based Approaches for Deceptive Text Classification: A Comparative Analysis

  • paper_url: http://arxiv.org/abs/2308.05476
  • repo_url: None
  • paper_authors: Anusuya Krishnan
  • for: 本研究旨在比较机器学习和转换器基于的方法对骗inson文本分类的效果。
  • methods: 本研究使用了传统的机器学习算法和当前最佳的转换器模型,如BERT、XLNET、DistilBERT和RoBERTa,来检测骗inson文本。
  • results: 经过广泛的实验,本研究发现了不同方法的性能指标,包括准确率、特异度、准确率和F1分数,可以帮助研究人员和实践者在面临骗inson内容时做出了 informed decisions。
    Abstract Deceptive text classification is a critical task in natural language processing that aims to identify deceptive o fraudulent content. This study presents a comparative analysis of machine learning and transformer-based approaches for deceptive text classification. We investigate the effectiveness of traditional machine learning algorithms and state-of-the-art transformer models, such as BERT, XLNET, DistilBERT, and RoBERTa, in detecting deceptive text. A labeled dataset consisting of deceptive and non-deceptive texts is used for training and evaluation purposes. Through extensive experimentation, we compare the performance metrics, including accuracy, precision, recall, and F1 score, of the different approaches. The results of this study shed light on the strengths and limitations of machine learning and transformer-based methods for deceptive text classification, enabling researchers and practitioners to make informed decisions when dealing with deceptive content.
    摘要 伪装文本分类是自然语言处理中的一项重要任务,旨在识别伪装或fraudulent内容。本研究进行了机器学习和转换器基于方法的比较分析,用于检测伪装文本。我们使用了一个标注数据集,其中包含伪装和非伪装文本,以进行训练和评估。通过广泛的实验,我们比较了不同方法的性能指标,包括准确率、精度、回归率和F1分数。研究结果为研究者和实践者提供了有用的指导,帮助他们在处理伪装内容时做出 Informed 决策。

WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

  • paper_url: http://arxiv.org/abs/2308.05361
  • repo_url: https://github.com/ant-research/fin_domain_llm
  • paper_authors: Siqiao Xue, Fan Zhou, Yi Xu, Hongyu Zhao, Shuo Xie, Caigao Jiang, James Zhang, Jun Zhou, Peng Xu, Dacheng Xiu, Hongyuan Mei
  • for: 这篇研究是为了开发一个适用于金融领域的智能对话系统,即 WeaverBird。
  • methods: 这篇研究使用了大型语言模型的 GPT 架构,并在大量的金融相关文本数据上进行了调整。这使得系统能够理解复杂的金融查询,如 “如何在通货膨胀时管理投资?”,并提供了知ledge的回答。此外,这篇研究还应用了本地知识库和搜索引擎,以撷取相关信息。最终的回答将会基于搜寻结果,并包含对源的正确引用,因此具有更高的信誉。
  • results: 这篇研究显示了 WeaverBird 比其他模型更高的性能,通过处理多种金融相关的问题。如果您想了解更多,可以通过访问我们的线上示范 https://weaverbird.ttic.edu,以及观看我们的2分钟视频示范 https://www.youtube.com/watch?v=yofgeqnlrMc。
    Abstract We present WeaverBird, an intelligent dialogue system designed specifically for the finance domain. Our system harnesses a large language model of GPT architecture that has been tuned using extensive corpora of finance-related text. As a result, our system possesses the capability to understand complex financial queries, such as "How should I manage my investments during inflation?", and provide informed responses. Furthermore, our system incorporates a local knowledge base and a search engine to retrieve relevant information. The final responses are conditioned on the search results and include proper citations to the sources, thus enjoying an enhanced credibility. Through a range of finance-related questions, we have demonstrated the superior performance of our system compared to other models. To experience our system firsthand, users can interact with our live demo at https://weaverbird.ttic.edu, as well as watch our 2-min video illustration at https://www.youtube.com/watch?v=yofgeqnlrMc.
    摘要 我们介绍WeaverBird,一个专门为金融领域设计的智能对话系统。我们的系统利用一个大型语言模型基于GPT架构,通过对金融相关文本的大量训练而调整。因此,我们的系统具有理解复杂金融查询的能力,如“如何在Inflation中管理投资?”,并提供了有知识的回答。另外,我们的系统还包括本地知识库和搜索引擎,以 retrieve relevant information。最终的回答基于搜索结果,并包含对源的正确引用,因此受到加强的信任。通过一系列金融相关问题的测试,我们已经证明了我们的系统在其他模型之上的超越表现。如果您想体验我们的系统,可以通过https://weaverbird.ttic.edu的实时示例和https://www.youtube.com/watch?v=yofgeqnlrMc的2分钟视频资料来了解更多。

Developing an Informal-Formal Persian Corpus

  • paper_url: http://arxiv.org/abs/2308.05336
  • repo_url: None
  • paper_authors: Vahide Tajalli, Fateme Kalantari, Mehrnoush Shamsfard
  • for: 这篇论文的目的是提出一种方法来建立一个大型平行对照词库,以便为 colloquial Persian 语言处理工具的开发提供基础。
  • methods: 这篇论文使用了两种方法来建立对照词库:一是从不同的 informal scripts 中搜集 sentences,二是通过跟踪语音和形态变化来找到更多的实例。
  • results: 这篇论文建立了一个包含 50,000 句对的平行对照词库,其中包含了大约 530,000 个对应关系和一个词汇词组词典,其中包含 49,397 个词和短语对。
    Abstract Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language faces some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. Such a converter needs a large aligned parallel corpus of colloquial-formal sentences which can be useful for linguists to extract a regulated grammar and orthography for colloquial Persian as is done for the formal language. In this paper we explain our methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level. The sentences were attempted to cover almost all kinds of lexical and syntactic changes between informal and formal Persian, therefore both methods of exploring and collecting from the different resources of informal scripts and following the phonological and morphological patterns of changes were applied to find as much instances as possible. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs.
    摘要 便衣语言是一种常用于日常对话、社交媒体、博客、电子邮件和短信的语言样式。在便衣写作中,语言会经历一些词汇和语法变化,这些变化可能在不同的语言中出现。波斯语是一种语言,其正式和便衣样式之间存在很多差异,因此开发一个便衣语言处理工具 для这种语言是必要的。这种转换器需要一个大量的同步并行的便衣-正式句子对Alignment的词/短语级别 parallel corpus。在这篇论文中,我们介绍了我们的方法ологи寻找和收集50,000个句子对Alignment的方法。这些句子尽可能地覆盖了便衣波斯语中的所有类型的词汇和语法变化,因此我们采用了从不同的 informal scripts 和遵循语音和形态变化的方法来找到最多的实例。最终的 corpus 包含了约530,000个对Alignment和一个字典,其中包含了49,397个词和短语对。

Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

  • paper_url: http://arxiv.org/abs/2308.05317
  • repo_url: None
  • paper_authors: Alexander Hanbo Li, Mingyue Shang, Evangelia Spiliopoulou, Jie Ma, Patrick Ng, Zhiguo Wang, Bonan Min, William Wang, Kathleen McKeown, Vittorio Castelli, Dan Roth, Bing Xiang
  • for: 本研究旨在提出一种新的数据结构化文本生成方法,以解决现有方法的局限性,主要是针对特定类型的结构数据进行优化。
  • methods: 本方法使用多任务训练、零 shot和几 shot等方法来提高性能,并提供一个通用的表示方式,可以处理不同类型的结构数据,如表格、知识图三元组和意义表示。
  • results: 我们的方法可以有效地适应新的结构形式,并在比较现有方法时显著提高表达能力,例如在转移模型从表格输入到知识图集合中的零 shot BLEU 分数中提高了66%。
    Abstract We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.
    摘要 我们提出了一种新的数据结构化到文本生成方法,旨在解决现有方法的局限性,主要专注于特定类型的结构数据。我们的提议方法可以在多任务训练、零shot和几shot情况下提高性能,通过提供一个统一的表示方式,处理不同类型的结构数据,如表格、知识图三元组和意义表示。我们示示了我们的提议方法可以有效地适应新的结构形式,并在与当前方法进行比较时提高性能。例如,我们的方法在适应知识图数据集时,对于从表格输入的模型进行转移而达到了66%的零shot BLEU分数提升。我们的提议方法是一个重要的步骤 towards更通用的数据结构化到文本生成框架。

Investigating disaster response through social media data and the Susceptible-Infected-Recovered (SIR) model: A case study of 2020 Western U.S. wildfire season

  • paper_url: http://arxiv.org/abs/2308.05281
  • repo_url: None
  • paper_authors: Zihui Ma, Lingyao Li, Libby Hemphill, Gregory B. Baecher
  • for: 本研究旨在提供决策者可靠、快速的灾害应急应对方法,以帮助受到灾害影响的社区。
  • methods: 本研究使用BERT主题 clustering分析Twitter数据,并进行了时空分析,探讨不同地区twitter用户关注的主题是什么。
  • results: 研究发现,Twitter用户主要关注了“健康影响”、“损害”和“疏散”等三个主题。使用SIR理论分析话题的传播范围和速度,发现居民在野火爆发期间表现出了高度的一些担忧。
    Abstract Effective disaster response is critical for affected communities. Responders and decision-makers would benefit from reliable, timely measures of the issues impacting their communities during a disaster, and social media offers a potentially rich data source. Social media can reflect public concerns and demands during a disaster, offering valuable insights for decision-makers to understand evolving situations and optimize resource allocation. We used Bidirectional Encoder Representations from Transformers (BERT) topic modeling to cluster topics from Twitter data. Then, we conducted a temporal-spatial analysis to examine the distribution of these topics across different regions during the 2020 western U.S. wildfire season. Our results show that Twitter users mainly focused on three topics:"health impact," "damage," and "evacuation." We used the Susceptible-Infected-Recovered (SIR) theory to explore the magnitude and velocity of topic diffusion on Twitter. The results displayed a clear relationship between topic trends and wildfire propagation patterns. The estimated parameters obtained from the SIR model in selected cities revealed that residents exhibited a high level of several concerns during the wildfire. Our study details how the SIR model and topic modeling using social media data can provide decision-makers with a quantitative approach to measure disaster response and support their decision-making processes.
    摘要 “有效的灾害应急应对是影响的社区的关键。响应者和决策者可以从社交媒体中获得可靠、时间性的灾害影响社区的信息,以获得更好的决策。社交媒体可以反映公众对灾害的关注和要求,提供决策者理解灾害发展的价值信息。我们使用了Transformers(BERT)话题模型将Twitter数据集分为话题,然后对这些话题的空间分布进行时间-空间分析。我们的结果显示,Twitter用户主要关注的是“健康影响”、“损害”和“疏散”三个话题。我们使用了感染传播理论(SIR)模型来探索话题在Twitter上的扩散范围和速度。结果显示,话题趋势和野火传播模式之间存在明显的关系。我们的研究表明,使用社交媒体数据和SIR模型可以为决策者提供一种量化的方法来衡量灾害应急应对和支持决策过程。”

A Novel Self-training Approach for Low-resource Speech Recognition

  • paper_url: http://arxiv.org/abs/2308.05269
  • repo_url: None
  • paper_authors: Satwinder Singh, Feng Hou, Ruili Wang
  • for: 提高低资源语言自动语音识别(ASR)的精度。
  • methods: 提议使用自我准备方法生成低资源语言无标注语音的高度准确 pseudo-标签,以解决低资源语言 ASR 系统的发展困难。
  • results: 对四个实际语音Dataset进行实验,比基eline模型提高单词错误率14.94%,并在Common Voice Punjabi Dataset上达到最佳结果。
    Abstract In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. While self-training approaches have been extensively developed and evaluated for high-resource languages such as English, their applications to low-resource languages like Punjabi have been limited, despite the language being spoken by millions globally. The scarcity of annotated data has hindered the development of accurate ASR systems, especially for low-resource languages (e.g., Punjabi and M\=aori languages). To address this issue, we propose an effective self-training approach that generates highly accurate pseudo-labels for unlabeled low-resource speech. Our experimental analysis demonstrates that our approach significantly improves word error rate, achieving a relative improvement of 14.94% compared to a baseline model across four real speech datasets. Further, our proposed approach reports the best results on the Common Voice Punjabi dataset.
    摘要 在这篇论文中,我们提出了一种自动听说训练方法,用于低资源语言自动听说识别(ASR)。虽然自动听说训练方法在高资源语言如英语上已经广泛开发和评估,但是它们在低资源语言如旁遮普语上的应用却受到了限制,尽管这种语言由全球数百万人说。缺乏标注数据的问题使得低资源语言ASR系统的开发受到了很大的阻碍,尤其是旁遮普语和Maori语言。为解决这个问题,我们提出了一种高度有效的自动听说训练方法,可以生成优质的 Pseudo-标签 для无标注的低资源语音。我们的实验分析表明,我们的方法可以在四个真实语音数据集上显著改善单词错误率,相比基准模型,实现了14.94%的相对改善。而我们提出的方法在Common Voice Punjabi数据集上的结果最佳。

Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes

  • paper_url: http://arxiv.org/abs/2308.08494
  • repo_url: None
  • paper_authors: Sharon Jiang, Shannon Shen, Monica Agrawal, Barbara Lam, Nicholas Kurtzman, Steven Horng, David Karger, David Sontag
  • for: 降低医生疲劳和提高诊断效率
  • methods: 利用电子医疗记录(EHR)审核日志数据进行机器学习监督,实时动态提取相关病史信息
  • results: 在紧急医学部门中,我们的方法可以达到0.963的准确率,帮助医生更快速地找到相关病史信息,并在用户研究中得到了证明。
    Abstract The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records (EHRs) is a leading cause of clinician burnout. By proactively and dynamically retrieving relevant notes during the documentation process, we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learning as a source of supervision of note relevance in a specific clinical context, at a particular point in time. Our evaluation focuses on the dynamic retrieval in the emergency department, a high acuity setting with unique patterns of information retrieval and note writing. We show that our methods can achieve an AUC of 0.963 for predicting which notes will be read in an individual note writing session. We additionally conduct a user study with several clinicians and find that our framework can help clinicians retrieve relevant information more efficiently. Demonstrating that our framework and methods can perform well in this demanding setting is a promising proof of concept that they will translate to other clinical settings and data modalities (e.g., labs, medications, imaging).
    摘要 丰富的时间投入在检索病人病历和电子医疗记录 (EHR) 中是临床人员疲劳的主要原因。我们提出了一种在医疗记录过程中积极动态检索有关病人历史的方法,以减少找到有关病人历史的努力。在本工作中,我们使用EHR审核日志作为机器学习的监督来确定病人病历中的有关性。我们的评估将注重在急诊室中进行实时检索,这是一个高危级设置,具有独特的信息检索和笔记录模式。我们的方法可以达到0.963的AUC,用于预测单个笔记录会被读取。此外,我们还进行了一些临床医生的用户研究,发现我们的框架可以帮助临床人员更有效地检索有关信息。这是一个有前途的证明,表明我们的框架和方法可以在其他临床设置和数据模式中表现出色。

Decoding Layer Saliency in Language Transformers

  • paper_url: http://arxiv.org/abs/2308.05219
  • repo_url: None
  • paper_authors: Elizabeth M. Hou, Gregory Castanon
  • for: 这篇论文的目的是提出一种用于大规模语言模型中文本焦点标识的策略,以便在分类任务中提高模型的性能。
  • methods: 这篇论文使用了基于梯度的焦点方法,并提出了一种用于评估每层语义准确性的方法。
  • results: 论文在多个benchmark分类数据集上demonstrated consistent improvement over其他文本焦点方法,无需额外训练或访问标注数据。
    Abstract In this paper, we introduce a strategy for identifying textual saliency in large-scale language models applied to classification tasks. In visual networks where saliency is more well-studied, saliency is naturally localized through the convolutional layers of the network; however, the same is not true in modern transformer-stack networks used to process natural language. We adapt gradient-based saliency methods for these networks, propose a method for evaluating the degree of semantic coherence of each layer, and demonstrate consistent improvement over numerous other methods for textual saliency on multiple benchmark classification datasets. Our approach requires no additional training or access to labelled data, and is comparatively very computationally efficient.
    摘要 在这篇论文中,我们介绍了一种策略用于在大规模语言模型中标识文本焦点。在视觉网络中,焦点自然地由卷积层localized;然而,这不同于现代使用transformer核心网络处理自然语言的网络。我们采用梯度基的焦点方法,提出了评估层次含义一致度的方法,并在多个benchmark分类 datasets上示出了逐渐提高的表现。我们的方法不需要额外的训练或标注数据,并且计算效率较高。

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

  • paper_url: http://arxiv.org/abs/2308.05081
  • repo_url: None
  • paper_authors: Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, Tat-Seng Chua
  • for: 本文targets Video Semantic Role Labeling (VidSRL), aiming to detect salient events in videos by recognizing predict-argument event structures and interrelationships.
  • methods: 该文提出了一种基于现有动态场景图结构的新凝合空间时间场景图表示(namely HostSG),可以准确地模型视频的细化空间Semantics和时间动态。然后,基于HostSG,提出了一种 nichetargeting VidSRL框架,包括场景事件映射机制和迭代结构优化。
  • results: 在标准测试集上,该框架与现有最佳模型相比,显著提高了性能。此外,文章还进行了进一步的分析,以便更好地理解该方法的进步。
    Abstract Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.
    摘要

RadGraph2: Modeling Disease Progression in Radiology Reports via Hierarchical Information Extraction

  • paper_url: http://arxiv.org/abs/2308.05046
  • repo_url: None
  • paper_authors: Sameer Khanna, Adam Dejl, Kibo Yoon, Quoc Hung Truong, Hanh Duong, Agustina Saenz, Pranav Rajpurkar
  • for: 提供一个新的医学报告信息EXTRACTION数据集,即RadGraph2,用于捕捉医疗器械报告中的疾病状态和设备位置变化趋势。
  • methods: 提出一种层次结构,用于在训练信息EXTRACTION模型时组织实体,并通过修改 DyGIE++ 框架,提出一种新的 HGIE 模型,可以更好地完成实体和关系抽取任务。
  • results: 比较 RadGraph2 和原始 RadGraph 数据集训练的模型,显示 RadGraph2 能够捕捉更多的发现和在关系抽取任务中表现更好,提供了开发自动监测疾病进程变化和医疗器械领域信息EXTRACTION模型的基础。
    Abstract We present RadGraph2, a novel dataset for extracting information from radiology reports that focuses on capturing changes in disease state and device placement over time. We introduce a hierarchical schema that organizes entities based on their relationships and show that using this hierarchy during training improves the performance of an information extraction model. Specifically, we propose a modification to the DyGIE++ framework, resulting in our model HGIE, which outperforms previous models in entity and relation extraction tasks. We demonstrate that RadGraph2 enables models to capture a wider variety of findings and perform better at relation extraction compared to those trained on the original RadGraph dataset. Our work provides the foundation for developing automated systems that can track disease progression over time and develop information extraction models that leverage the natural hierarchy of labels in the medical domain.
    摘要 我团队今天发布了一个新的 dataset called RadGraph2,用于从医学报告中提取信息。这个 dataset 专注于捕捉疾病状态的变化和设备的位置变化过时。我们提出了一个层次结构,用于将实体按照其关系进行组织,并证明在训练时使用这个层次结构可以提高信息抽取模型的性能。我们对 DyGIE++ 框架进行修改,得到了我们的 HGIE 模型,该模型在实体和关系抽取任务中表现出色。我们表明,RadGraph2 可以让模型捕捉更多的发现和在关系抽取任务中表现更好,比那些基于原始 RadGraph dataset 训练的模型。我们的工作为建立自动跟踪疾病进展和在医疗领域中利用自然的标签层次结构开发自动化系统提供了基础。