cs.CL - 2023-09-22

Document Understanding for Healthcare Referrals

  • paper_url: http://arxiv.org/abs/2309.13184
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Jimit Mistry, Natalia M. Arzeno
  • for: 提高医疗referral管理效率,减少管理成本和错误
  • methods: 提出了一种гибрид模型,结合LayoutLMv3和域pecific规则,用于在传输referral文档中自动识别关键病人、医生和检查相关信息
  • results: 结果表明,通过添加域pecific规则,使用变换器模型的精度和F1分数得到了大幅提高,表明混合模型在实际应用中可以提高referral管理效率。
    Abstract Reliance on scanned documents and fax communication for healthcare referrals leads to high administrative costs and errors that may affect patient care. In this work we propose a hybrid model leveraging LayoutLMv3 along with domain-specific rules to identify key patient, physician, and exam-related entities in faxed referral documents. We explore some of the challenges in applying a document understanding model to referrals, which have formats varying by medical practice, and evaluate model performance using MUC-5 metrics to obtain appropriate metrics for the practical use case. Our analysis shows the addition of domain-specific rules to the transformer model yields greatly increased precision and F1 scores, suggesting a hybrid model trained on a curated dataset can increase efficiency in referral management.
    摘要 靠扫描文档和传真communication для医疗referral导致高行政成本和错误,这些错误可能影响病人护理。在这个工作中,我们提出了一种hybrid模型,利用LayoutLMv3 alongside domain-specific规则来标识患者、医生和检查相关实体在传真referral文档中。我们探讨了应用文档理解模型到referral的挑战,因为referral的格式可能因医疗实践而异常,并评估模型性能使用MUC-5指标,以获得实用的指标。我们的分析显示,将域专门规则添加到变换器模型可以提高准确率和F1分数,表明一种基于 cura dataset的hybrid模型可以提高referral管理的效率。

Effective Distillation of Table-based Reasoning Ability from LLMs

  • paper_url: http://arxiv.org/abs/2309.13182
  • repo_url: None
  • paper_authors: Bohao Yang, Chen Tang, Kun Zhao, Chenghao Xiao, Chenghua Lin
    for:This paper aims to specialize table reasoning skills in smaller models for table-to-text generation tasks.methods:The proposed method uses distillation to transfer specific capabilities of large language models (LLMs) to smaller models, specifically tailored for table-based reasoning.results:The fine-tuned model (Flan-T5-base) achieved significant improvement compared to traditional baselines and outperformed specific LLMs like gpt-3.5-turbo on the scientific table-to-text generation dataset (SciGen).
    Abstract Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their remarkable parameter size and their impressive high requirement of computing resources pose challenges for their practical deployment. Recent research has revealed that specific capabilities of LLMs, such as numerical reasoning, can be transferred to smaller models through distillation. Some studies explore the potential of leveraging LLMs to perform table-based reasoning. Nevertheless, prior to our work, there has been no investigation into the prospect of specialising table reasoning skills in smaller models specifically tailored for table-to-text generation tasks. In this paper, we propose a novel table-based reasoning distillation, with the aim of distilling distilling LLMs into tailored, smaller models specifically designed for table-based reasoning task. Experimental results have shown that a 0.22 billion parameter model (Flan-T5-base) fine-tuned using distilled data, not only achieves a significant improvement compared to traditionally fine-tuned baselines but also surpasses specific LLMs like gpt-3.5-turbo on the scientific table-to-text generation dataset (SciGen). The code and data are released in https://github.com/Bernard-Yang/TableDistill.
    摘要

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

  • paper_url: http://arxiv.org/abs/2309.13173
  • repo_url: None
  • paper_authors: Mohsinul Kabir, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, M Saiful Bari, Enamul Hoque
  • for: 本研究评估了大型自然语言处理(NLP)模型(LLMs)在低资源语言如孟加拉语(Bangla)中的表现。
  • methods: 本研究使用了多种重要和多样化的孟加拉语NLP任务,如抽象摘要、问答、重叠、自然语言推理、文本分类和情感分析,对ChatGPT、LLaMA-2和Claude-2等LLMs进行零搅evaluation,并比较其表现与现有的精度调整模型。
  • results: 实验结果显示了不同孟加拉语NLP任务中LLMs的表现较差,这表明需要进一步的研究以提高LLMs在低资源语言如孟加拉语的理解。
    Abstract Large Language Models (LLMs) have emerged as one of the most important breakthroughs in natural language processing (NLP) for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). In this paper, we evaluate the performance of LLMs for the low-resourced Bangla language. We select various important and diverse Bangla NLP tasks, such as abstractive summarization, question answering, paraphrasing, natural language inference, text classification, and sentiment analysis for zero-shot evaluation with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with state-of-the-art fine-tuned models. Our experimental results demonstrate an inferior performance of LLMs for different Bangla NLP tasks, calling for further effort to develop better understanding of LLMs in low-resource languages like Bangla.
    摘要 大型自然语言模型(LLM)已经被认为是自然语言处理(NLP)领域的一个重要突破,它们在语言生成和其他语言特定任务中表现出了卓越的能力。虽然 LLM 已经在英语等语言上进行了评估,但它们尚未在低资源语言 such as 孟加拉语(Bangla)进行了系统性的评估。在这篇论文中,我们对低资源 Bangla 语言进行了 LLM 的评估。我们选择了一些重要和多样的 Bangla NLP 任务,如抽象摘要、问答、重叠、自然语言推理、文本分类和情感分析,并对 ChatGPT、LLaMA-2 和 Claude-2 进行零 shot 评估,并与当前的精度模型进行比较。我们的实验结果表明 LLMs 在不同的 Bangla NLP 任务中表现出了较差的性能,这表明需要进一步的研究,以更好地理解 LLMs 在低资源语言如 Bangla 的性能。

Cardiovascular Disease Risk Prediction via Social Media

  • paper_url: http://arxiv.org/abs/2309.13147
  • repo_url: None
  • paper_authors: Al Zadid Sultan Bin Habib, Md Asif Bin Syed, Md Tanvirul Islam, Donald A. Adjeroh
  • for: 预测心血管疾病(CVD)风险
  • methods: 使用推特和情感分析预测CVD风险,开发了新的CVD相关关键词词典,并使用VADER模型进行情感分析,将用户分类为可能存在CVD风险
  • results: 结果表明通过分析推特中的情感,可以超过基于人口数据alone的预测力,并能够识别可能发展CVD的个体,这些结果表明了自然语言处理和机器学习技术在使用推特来识别CVD风险的潜力。
    Abstract Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.
    摘要

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

  • paper_url: http://arxiv.org/abs/2309.13018
  • repo_url: None
  • paper_authors: Jiamin Xie, Ke Li, Jinxi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli
  • for: 这个研究的目的是实现对多语言自动话语识别(ASR)模型的压缩,并且将其转换为单语言模型或多语言模型。
  • methods: 这个研究使用了适应性遮盾方法,包括两个情况:一是生成简单的单语言模型,二是将多语言模型转换为简单的多语言模型。这个方法可以避免固定的子网络结构,并且在不同的初始化情况下进行适应。
  • results: 这个研究发现,使用适应性遮盾方法可以在对多语言模型进行压缩时,比较有效率,并且可以实现更好的表现。此外,这个方法可以将多语言模型转换为简单的多语言模型,并且可以实现更好的表现。
    Abstract Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
    摘要 中文简体版:神经网络剪枝提供了一种有效的压缩方法,以最小化多语言自动语音识别(ASR)模型的性能损失。然而,它需要每种语言进行多轮剪枝和重新训练。在这个工作中,我们提议使用适应maskingapproach来有效地剪枝多语言ASR模型,分别得到简洁的单语言模型或简洁的多语言模型(名为动态ASR PATHways)。我们的方法可以动态适应子网络,避免提前决定固定子网络结构。我们显示,我们的方法在targeting简洁的单语言模型时比既有的剪枝方法高效。此外,我们还示出了Dynamic ASR PATHways可以将多语言模型中的不同子网络初始化相互转换,从而降低语言特定的剪枝需求。

Nested Event Extraction upon Pivot Element Recogniton

  • paper_url: http://arxiv.org/abs/2309.12960
  • repo_url: None
  • paper_authors: Weicheng Ren, Zixuan Li, Xiaolong Jin, Long Bai, Miao Su, Yantao Liu, Saiping Guan, Jiafeng Guo, Xueqi Cheng
  • for: 提高复杂事件结构抽取精度,解决现有方法不能很好地处理嵌入式事件结构中的 pivot element 问题。
  • methods: 基于识别触发器对 triggers 和 arguments 的类型和关系进行分类,并通过提示学习获得更好的触发器和 argue 的表示,以提高 NEE 性能。
  • results: PerNee 在 ACE2005-Nest、Genia11 和 Genia13 上实现了状态之冠性表现,提高了 NEE 精度。
    Abstract Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer events and as triggers of inner events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner and outer events and further recognizes the PEs via classifying the relation type between trigger pairs. In order to obtain better representations of triggers and arguments to further improve NEE performance, it incorporates the information of both event types and argument roles into PerNee through prompt learning. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in generic domain and construct a new NEE dataset, namely ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11 and Genia13.
    摘要 嵌入式事件提取(NEE)目标是提取嵌入式事件结构,其中事件包含其他事件作为自身参数的嵌入式结构。嵌入事件中的 pivot 元素(PE)同时作为外部事件的参数和内部事件的触发器,因此将其连接到嵌入结构中。这种特殊的 PE 特点带来了现有 NEE 方法的挑战,因为它们无法好地处理 PE 的双重身份。因此,本文提出了一种新模型,即 PerNee,它基于认可 PE 来提取嵌入事件。具体来说,PerNee 先认可外部和内部事件的触发器,然后通过类型化 trigger 对的关系来认定 PE。为了从trigger和参数角度获得更好的表示,PerNee 通过推训来 incorporate 事件类型和参数角色信息。由于现有 NEE 数据集(如 Genia11)限制在特定领域,并且只包含一些嵌入式事件结构,我们系统地分类嵌入事件在通用领域,并构建了一个新的 NEE 数据集,即 ACE2005-Nest。实验结果表明,PerNee 在 ACE2005-Nest、Genia11 和 Genia13 上具有状态的表现。

TopRoBERTa: Topology-Aware Authorship Attribution of Deepfake Texts

  • paper_url: http://arxiv.org/abs/2309.12934
  • repo_url: None
  • paper_authors: Adaku Uchendu, Thai Le, Dongwon Lee
  • for: 本研究旨在开发一种可以判断文本是否为深度伪造文本(deepfake text)的计算方法,以mitigate大量深度伪造文本的散布。
  • methods: 本研究使用了Topological Data Analysis(TDA)层和RoBERTa模型,以capture更多的语言特征和结构特征,提高作者识别率。
  • results: 对于3个 datasets,TopRoBERTa模型比vanilla RoBERTa模型提高了2/3的Macro F1分数,最高提高7%。
    Abstract Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
    摘要 最近的大语言模型(LLM)技术的进步,使得可以生成高质量、不易于 distinguishing 的文本,我们称之为“深伪文本”。目前已经有超过 11K 的文本生成模型在 huggingface 模型库中。因此,有恶意用户可以使用这些开源的 LLM 生成大量的危险文本和谣言。为了解决这问题,一种计算方法是需要的——namely,Turing Test(TT)。在这种情况下,我们研究了一个更一般的问题——作者归属问题(AA),在多类别Setting下进行研究——即不仅是判断给定文本是否是深伪文本,还可以确定这个文本的作者是哪个 LLM。我们提出了 TopRoBERTa,用于改进现有 AA 解决方案,通过包含 Topological Data Analysis(TDA)层在 RoBERTa 模型中,从而更好地捕捉深伪文本中的语言特征。我们通过对不规则、不均衡和不一致的数据进行处理,提取 TDA 特征从 RoBERTa 模型中的 pooling 输出中。我们使用 RoBERTa 模型来捕捉语义和语法特征,而使用 TDA 来捕捉数据的形态和结构特征。最后,TopRoBERTa 在 2/3 个数据集上表现出色,与原始 RoBERTa 相比,提高了 macro F1 得分的最高7%。

PopBERT. Detecting populism and its host ideologies in the German Bundestag

  • paper_url: http://arxiv.org/abs/2309.14355
  • repo_url: None
  • paper_authors: L. Erhard, S. Hanke, U. Remer, A. Falenska, R. Heiberger
  • for: 本研究旨在提供一种可靠、有效、可扩展的方法来评估民粹主义的语言表达。
  • methods: 我们创建了基于德国bundestag(2013-2021年)的 parliamentary speeches 的标注 dataset,并采用 transformer-based 模型(PopBERT)作为多类分类器来检测和评估民粹主义语言的多个维度。
  • results: 验证检查表明,PopBERT 具有强的预测准确率、高质量的面效VALIDITY、与专家调查中党派排名相符、并能正确地检测新的文本片断。PopBERT 可以为德语政治家和党派的语言使用提供动态分析,以及可以在跨领域应用或开发相关的分类器。
    Abstract The rise of populism concerns many political scientists and practitioners, yet the detection of its underlying language remains fragmentary. This paper aims to provide a reliable, valid, and scalable approach to measure populist stances. For that purpose, we created an annotated dataset based on parliamentary speeches of the German Bundestag (2013 to 2021). Following the ideational definition of populism, we label moralizing references to the virtuous people or the corrupt elite as core dimensions of populist language. To identify, in addition, how the thin ideology of populism is thickened, we annotate how populist statements are attached to left-wing or right-wing host ideologies. We then train a transformer-based model (PopBERT) as a multilabel classifier to detect and quantify each dimension. A battery of validation checks reveals that the model has a strong predictive accuracy, provides high qualitative face validity, matches party rankings of expert surveys, and detects out-of-sample text snippets correctly. PopBERT enables dynamic analyses of how German-speaking politicians and parties use populist language as a strategic device. Furthermore, the annotator-level data may also be applied in cross-domain applications or to develop related classifiers.
    摘要 populism 的崛起引起了许多政治科学家和实践者的关注,但检测其下面的语言 ainda是 fragmentary。这篇论文目的是提供一种可靠、有效、可扩展的方法来评估 populist 的立场。为此,我们创建了基于德国bundestag parliamentary speeches(2013-2021)的注释数据集。根据意识形态的定义,我们将 moralizing 引用为贤良人或腐败的エリー特定为 populist 语言的核心维度。此外,为了了解 populist 语言如何被膨胀,我们还注释了 populist 声明与左翼或右翼的主义相关的hosts。然后,我们使用 transformer 基本模型(PopBERT)作为多类归一类ifier来检测和评估每一维度。一系列的验证检查表明,模型具有强大预测精度,提供高质量的面 validate,匹配党派评估专家调查的排名,并正确地检测出 sample 文本片段。PopBERT 允许我们动态地分析德语政治人物和党派如何使用 populist 语言作为策略工具。此外,注释数据还可以在跨领域应用或开发相关的分类器。

Affect Recognition in Conversations Using Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12881
  • repo_url: None
  • paper_authors: Shutong Feng, Guangzhi Sun, Nurul Lubis, Chao Zhang, Milica Gašić
  • for: 本研究旨在探讨大语言模型(LLMs)在对话中识别人类情感的能力,包括开放领域对话和任务导向对话。
  • methods: 研究使用了三个不同的数据集:IEMOCAP、EmoWOZ和DAIC-WOZ,这些数据集涵盖了从伙伴对话到医疗采访的对话。研究使用了零shot和几shot学习,以及任务特定的精度调整,来评估和比较LLMs的表现。
  • results: 研究发现LLMs在情感识别方面具有一定的能力,但是其表现受到自动语音识别(ASR)错误的影响。通过这项研究,我们希望探讨LLMs在对话中是否可以模拟人类的情感识别能力。
    Abstract Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
    摘要 人类communication中,情感认知(affect recognition)发挥关键作用。在人工智能对话中,能够识别和回应人类情感cue的能力是创造有趣和同情的交互的关键因素。本研究探讨了大型自然语言模型(LLMs)在对话中识别人类情感的能力,包括开放领域对话和任务导向对话。通过使用三个多样化的数据集,namely IEMOCAP、EmoWOZ和DAIC-WOZ,覆盖了对话的广泛spectrum,从互斥对话到临床采访,我们评估和比较了LLMs的表现。我们的调查探讨了LLMs在零shot和几shot情况下的能力,以及通过任务特定的精度调整来提高模型 capacities。此外,本研究还考虑了自动语音识别(ASR)错误对LLM预测的影响。通过这项工作,我们希望探讨LLMs在对话中是否能够模拟人类情感认知能力。

StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors

  • paper_url: http://arxiv.org/abs/2309.12810
  • repo_url: None
  • paper_authors: Inez Okulska, Daria Stetsenko, Anna Kołos, Agnieszka Karlińska, Kinga Głąbińska, Adam Nowakowski
  • for: 这个论文的目的是为开源多语言工具StyloMetrix提供一个概述。这个工具提供了不同语言的语法、 синтакси和词汇方面的语料,覆盖了波兰语、英语、乌克兰语和俄语四种语言。
  • methods: 这个论文使用了StyloMetrix工具来生成各种语言的语料,并对这些语料进行了normalization处理。然后,使用了不同的机器学习算法进行超参数的评估。
  • results: 实验结果表明,StyloMetrix vectors可以在不同的语言上进行有效的内容分类,并且可以帮助提高深度学习算法的表现。在Random Forest Classifier、Voting Classifier、Logistic Regression等简单机器学习算法上进行了超参数的评估,并且在Transformer架构上进行了深度学习的评估。
    Abstract This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the StyloMetrix vectors at enhancing an embedding layer extracted from Transformer architectures. The StyloMetrix has proven itself to be a formidable source for the machine learning and deep learning algorithms to execute different classification tasks.
    摘要 Translation notes:* "stylometric" is translated as "式文学的" (shìwén xué de), which is a compound word consisting of "式" (shì) meaning "style" and "文学" (wénxué) meaning "literature" or "linguistics".* "multilanguage" is translated as "多语言" (duō yǔyán), which is a compound word consisting of "多" (duō) meaning "many" and "语言" (yǔyán) meaning "language".* "tool" is translated as "工具" (gōngjù), which is a generic term for any device or software used to perform a specific task.* "cover" is translated as "覆盖" (fùkài), which means "to cover" or "to encompass".* "aspects" is translated as "方面" (fāngmiàn), which means "aspects" or "facets".* "grammar" is translated as "语法" (yǔfǎ), which is the study of the rules and structures of a language.* "syntax" is translated as "语法结构" (yǔfǎ jiégòu), which is the study of the arrangement of words and phrases to form sentences.* "lexicon" is translated as "词汇" (cíhuì), which is a collection of words and their meanings.* "normalized" is translated as "标准化" (biǎozhǔn huà), which means "to make something conform to a standard or norm".* "output" is translated as "输出" (shūchū), which means "output" or "result".* "feature" is translated as "特征" (tèzhèng), which means "feature" or "characteristic".* "developed" is translated as "开发" (kāifā), which means "to develop" or "to create".* "linguistic" is translated as "语言学的" (yǔyán xué de), which is a compound word consisting of "语言" (yǔyán) meaning "language" and "学的" (xué de) meaning "academic" or "scholarly".* "fruitful" is translated as "有益" (yǒu yì), which means "beneficial" or "useful".* "course" is translated as "课程" (kèchéng), which means "course" or "program".* "machine learning" is translated as "机器学习" (jīqì xuéxí), which is a compound word consisting of "机器" (jīqì) meaning "machine" and "学习" (xuéxí) meaning "learning" or "study".* "deep learning" is translated as "深度学习" (shēngrù xuéxí), which is a compound word consisting of "深度" (shēngrù) meaning "depth" and "学习" (xuéxí) meaning "learning" or "study".* "supervised" is translated as "监督学习" (jiāndū xuéxí), which is a compound word consisting of "监督" (jiāndū) meaning "supervise" and "学习" (xuéxí) meaning "learning" or "study".* "content classification" is translated as "内容分类" (nèiróng fēnlèi), which is a compound word consisting of "内容" (nèiróng) meaning "content" and "分类" (fēnlèi) meaning "classification" or "categorization".* "simple algorithms" is translated as "简单的算法" (jiǎnduō de suānfǎ), which is a compound word consisting of "简单" (jiǎnduō) meaning "simple" and "算法" (suānfǎ) meaning "algorithm".* "random forest classifier" is translated as "随机森林分类器" (suījì sēnjīn fēnlèi zhīngjī), which is a compound word consisting of "随机" (suījì) meaning "random" and "森林" (sēnjīn) meaning "forest" and "分类器" (fēnlèi zhīngjī) meaning "classifier".* "voting classifier" is translated as "投票分类器" (tóuchòu fēnlèi zhīngjī), which is a compound word consisting of "投票" (tóuchòu) meaning "vote" and "分类器" (fēnlèi zhīngjī) meaning "classifier".* "logistic regression" is translated as "逻辑回归" (suǒyì huíqiù), which is a compound word consisting of "逻辑" (suǒyì) meaning "logic" and "回归" (huíqiù) meaning "regression".* "deep learning assessments" is translated as "深度学习评估" (shēngrù xuéxí píngjì), which is a compound word consisting of "深度" (shēngrù) meaning "depth" and "学习" (xuéxí) meaning "learning" or "study" and "评估" (píngjì) meaning "assessment" or "evaluation".* "Transformer architectures" is translated as "变换器架构" (biànhuà zhìgòu), which is a compound word consisting of "变换" (biànhuà) meaning "transformation" and "器架构" (zhìgòu) meaning "architecture".

ChatPRCS: A Personalized Support System for English Reading Comprehension based on ChatGPT

  • paper_url: http://arxiv.org/abs/2309.12808
  • repo_url: None
  • paper_authors: Xizhe Wang, Yihua Zhong, Changqin Huang, Xiaodi Huang
  • for: 提高学生的阅读理解能力
  • methods: 使用大语言模型技术,包括预测学生阅读理解水平、生成问题和自动评估等方法
  • results: 实验结果显示,ChatPRCS可以为学生提供高质量的阅读理解问题,与专家制定的问题相似程度有统计显著的相似性
    Abstract As a common approach to learning English, reading comprehension primarily entails reading articles and answering related questions. However, the complexity of designing effective exercises results in students encountering standardized questions, making it challenging to align with individualized learners' reading comprehension ability. By leveraging the advanced capabilities offered by large language models, exemplified by ChatGPT, this paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS, based on the Zone of Proximal Development theory. ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation, among others, to enhance reading comprehension instruction. First, we develop a new algorithm that can predict learners' reading comprehension abilities using their historical data as the foundation for generating questions at an appropriate level of difficulty. Second, a series of new ChatGPT prompt patterns is proposed to address two key aspects of reading comprehension objectives: question generation, and automated evaluation. These patterns further improve the quality of generated questions. Finally, by integrating personalized ability and reading comprehension prompt patterns, ChatPRCS is systematically validated through experiments. Empirical results demonstrate that it provides learners with high-quality reading comprehension questions that are broadly aligned with expert-crafted questions at a statistical level.
    摘要 通常来说,学习英语的读写涉及到阅读文章并回答相关问题。然而,设计有效的训练活动具有复杂性,导致学生遇到标准化的问题,困难与个性化学生的读写理解水平进行对应。本文基于大语言模型的高级功能,例如ChatGPT,提出了一种新的个性化支持系统,称为ChatPRCS,基于读写理解能力的发展Zone of Proximal Development理论。ChatPRCS使用包括读写理解能力预测、问题生成和自动评估等方法,以提高读写理解教学。首先,我们开发了一种新的算法,可以根据学生的历史数据预测他们的读写理解能力,并使用这些数据来生成适合的题目。其次,我们提出了一系列新的ChatGPT提示模式,用于解决读写理解目标的两个关键方面:问题生成和自动评估。这些模式进一步提高生成的题目质量。最后,通过结合个性化能力和读写理解提示模式,我们系统化验证了ChatPRCS。实验结果表明,它可以为学生提供高质量的读写理解题目,与专家制作的问题在统计上保持一致。

Furthest Reasoning with Plan Assessment: Stable Reasoning Path with Retrieval-Augmented Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12767
  • repo_url: None
  • paper_authors: Yin Zhu, Zhiling Luo, Gong Cheng
  • for: 本研究旨在解决现有多步问答(MHQA)方法中的两个主要缺陷:一是信息检索器(IR)因为生成过程中的低质量问题而受到限制,二是语言模型(LLM)因为与 irrelevant knowledge 的交互而导致偏差。
  • methods: 本研究提出了一种新的管道方法,即 Furthest-Reasoning-with-Plan-Assessment(FuRePA),其包括一种改进的框架(Furthest Reasoning)和一个附加的模块(Plan Assessor)。 Furthest Reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration。 Plan Assessor 是一个训练过的评价器,可以选择 LLM 提出的合适的计划。
  • results: 本研究在三个公开的多步问答数据集上进行了评估,并与现有最佳方法进行比较。结果显示, FuRePA 在大多数指标上表现出色,相比之下, achieved a 10%-12% 的答案准确率。
    Abstract Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
    摘要 大型语言模型(LLM)作为强大的理解和生成工具,在不同的自然语言任务中表现出色,其中包括多步 вопро答(MHQA)类型。在这些任务中,我们需要让 LLM 和外部知识的搜寻紧密相连。现有的方法使用 LLM 生成推理路径和计划,并使用 IR 逐步获取相关知识,但这些方法存在问题。一方面,资讯搜寻器(IR)受到 LLM 产生的问题质量低下的限制。另一方面, LLM 受到 IR 提供的无关知识的影响,导致错误的推理。这些错误,在 LLM 和 IR 之间的回归交互中累累积累,最终导致效率下降。为解决以上问题,在这篇论文中,我们提出了一个新的多步 вопро答(MHQA)管道,称为 Furthest-Reasoning-with-Plan-Assessment(FuRePA),包括改进的架构( Furthest Reasoning)和附加的模组( Plan Assessor)。1. Furthest Reasoning 运作方式是将前一次的推理路径和生成的问题遮盖,让 LLM 在每次回归中从头开始生成推理链。这种方法允许 LLM 破坏前一次的错误思维和问题(如果有),并将注意力集中在更加重要的问题上。2. Plan Assessor 是一个训练好的评估器,可以从 LLM 提供的候选计划中选择最佳的计划。我们的方法在三个公开的多步 вопро答 datasets 上进行评估,并在大多数指标上超越了现有的state-of-the-art(实现了10%-12%的答案精度提升)。

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

  • paper_url: http://arxiv.org/abs/2309.12763
  • repo_url: None
  • paper_authors: Asad Ullah, Alessandro Ragano, Andrew Hines
  • for: 本研究旨在提高low resource语言下的自主学习表示学习(SSRL)模型的表现,并评估其在下游phoneme认识任务中的性能。
  • methods: 本研究使用了音频扩展来预训SSRL模型,并评估其在phoneme认识任务中的表现。我们系统地比较了不同的扩展技术,包括拟音变化、噪音添加、重音目标语言speech和其他语言speech。我们发现,将扩展技术与拟音变化相结合(噪音/拟音)是最佳扩展策略,超过了重音和语言知识传递。
  • results: 我们发现,使用具有不同量和类型的预训数据,SSRL模型在phoneme认识任务中的表现都有所提高。此外,我们还评估了扩展数据的缩放因子,以达到与target domain speech预训数据相等的性能。我们的发现表明,在resource受限的语言下,使用本地生成的扩展数据可以超过语言知识传递和其他语言speech的表现。
    Abstract Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
    摘要 自我指导学习(SSRL)已经提高了下游音频识别的性能,而不需要大量的标注数据。然而,对于低资源语言,具有大量预训练数据的困难。而不是通过语言知识传输,我们提议使用音频加工来预训练SSRL模型,并评估音频识别作为下游任务。我们进行了系统性的比较,包括噪音添加、抖音变化、外语言材料和对应语言材料。我们发现将噪音和抖音相结合是最佳的加工策略,超过了对应语言和外语言知识传输。我们对不同量和类型的预训练数据进行了比较,并评估了增强数据的扩展因子以实现与目标频谱 speech 的相同性。我们的发现表明,在资源受限的语言中,可以通过本地生成的增强数据来超越对应语言和外语言的知识传输。

Semantic similarity prediction is better than other semantic similarity measures

  • paper_url: http://arxiv.org/abs/2309.12697
  • repo_url: https://github.com/aieng-lab/stsscore
  • paper_authors: Steffen Herbold
  • for: mesure la similarité sémantique entre des textes naturels
  • methods: utilise une modèle fine-tuné pour prédire la similarité
  • results: obtenu une mesure de similarité plus robuste et alignée avec les attentes que les autres approches
    Abstract Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
    摘要 <>按照以下文本翻译成简化中文:<>自然语言文本之间的 semantics 相似性通常通过子序列重叠(例如 BLEU)或使用嵌入(例如 BERTScore、S-BERT)来衡量。在这篇论文中,我们认为只需要量化 semantics 相似性时,直接使用特定任务的 fine-tuned 模型来预测相似性是更好的方法。使用 GLUE benchmark 中的 STS-B 任务中的 fine-tuned 模型,我们定义了 STSScore 方法,并证明其生成的相似性更加符合我们对坚实 semantics 相似性的预期,与其他方法相比。

AMPLIFY:Attention-based Mixup for Performance Improvement and Label Smoothing in Transformer

  • paper_url: http://arxiv.org/abs/2309.12689
  • repo_url: https://github.com/kiwi-lilo/amplify
  • paper_authors: Leixin Yang, Yaping Zhang, Haoyu Xiong, Yu Xiang
  • for: 提高文本分类 task 的性能,降低模型对噪音和异常值的敏感性。
  • methods: 提出了一种新的 Mixup 方法 called AMPLIFY,通过 Transformer 自带的注意力机制来减少原始样本中噪音和异常值的影响,不增加额外可训练参数,计算成本很低。
  • results: 在 7 个 benchmark dataset 上,AMPLIFY 在文本分类任务中比其他 Mixup 方法具有更高的性能,而且在较小的计算成本下。
    Abstract Mixup is an effective data augmentation method that generates new augmented samples by aggregating linear combinations of different original samples. However, if there are noises or aberrant features in the original samples, Mixup may propagate them to the augmented samples, leading to over-sensitivity of the model to these outliers . To solve this problem, this paper proposes a new Mixup method called AMPLIFY. This method uses the Attention mechanism of Transformer itself to reduce the influence of noises and aberrant values in the original samples on the prediction results, without increasing additional trainable parameters, and the computational cost is very low, thereby avoiding the problem of high resource consumption in common Mixup methods such as Sentence Mixup . The experimental results show that, under a smaller computational resource cost, AMPLIFY outperforms other Mixup methods in text classification tasks on 7 benchmark datasets, providing new ideas and new ways to further improve the performance of pre-trained models based on the Attention mechanism, such as BERT, ALBERT, RoBERTa, and GPT. Our code can be obtained at https://github.com/kiwi-lilo/AMPLIFY.
    摘要 混合是一种有效的数据增强方法,可以生成新的增强样本通过原始样本的线性组合。但是,如果原始样本中存在噪声或异常特征,那么混合可能会传递这些噪声或异常特征到增强样本中,导致模型对这些噪声或异常特征过敏。为解决这个问题,本文提出了一种新的混合方法called AMPLIFY。这种方法使用Transformer自带的注意力机制来减少原始样本中噪声或异常值对预测结果的影响,无需增加额外可训练参数,计算成本非常低,因此可以避免常见的混合方法如 Sentence Mixup 中的高资源消耗问题。实验结果表明,在相对较小的计算资源成本下,AMPLIFY在文本分类任务中比其他混合方法表现更好,提供了新的想法和新的方法来进一步提高基于注意力机制的预测模型,如BERT、ALBERT、RoBERTa和GPT。我们的代码可以在https://github.com/kiwi-lilo/AMPLIFY获取。

JCoLA: Japanese Corpus of Linguistic Acceptability

  • paper_url: http://arxiv.org/abs/2309.12676
  • repo_url: https://github.com/osekilab/jcola
  • paper_authors: Taiga Someya, Yushi Sugimoto, Yohei Oseki
    for:这个论文的目的是为了评估不同类型的日语语言模型在语法可接受性领域的性能。methods:这篇论文使用了10,020个句子的手动标注的双值可接受性判断,其中86%是来自语言学教科书和手册的简单可接受性判断,剩下的14%是根据语言学期刊文章中的12种语言现象分类的。然后, authors使用这些数据来评估9种日语语言模型的语法知识。results:论文的结果表明,一些模型在域内数据上可以超越人类性能,而在域外数据上则无法超越人类性能。此外,对具体的语言现象进行分析也表明,虽然神经语言模型在地方语法依赖关系上很强,但在长距离语法依赖关系上表现不佳,如宾格结构和词汇协调等。
    Abstract Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
    摘要 neural language models 在多种下游任务中表现出色,但是对这些模型内化语法知识的理解还很有限,因此在不同语言之间建立了一些数据集,以便对语言模型的语法评估。本文介绍了日语Corpus of Linguistic Acceptability(JCoLA),包含10,020个句子,每个句子都有 binary acceptability 判断。具体来说,这些句子来自语言学书籍、手册和学术期刊,并分为预测数据(86%;相对简单的acceptability judgments从书籍和手册中提取)和 OUT-OF-DOMAIN 数据(14%;从期刊中提取,并分为12种语言现象)。然后,我们对9种日语语言模型在 JCoLA 上进行了语法知识的评估。结果显示,一些模型在预测数据上能够超越人类性能,而在 OUT-OF-DOMAIN 数据上则没有任何模型能够达到人类性能。进一步的错误分析按语言现象分类,表明了 neural language models 在处理本地语法依赖关系(如语素结构)方面表现出色,但是在面对远程语法依赖关系(如 Nominalization 和 NPI 许可)时,其性能却衰退。

HRoT: Hybrid prompt strategy and Retrieval of Thought for Table-Text Hybrid Question Answering

  • paper_url: http://arxiv.org/abs/2309.12669
  • repo_url: None
  • paper_authors: Tongxu Luo, Fangyu Lei, Jiahe Lei, Weihao Liu, Shihu He, Jun Zhao, Kang Liu
  • for: 这篇论文是为了解决Answering numerical questions over hybrid contents from the given tables and text(TextTableQA)问题。
  • methods: 这篇论文使用了Large Language Models (LLMs)和In-Context Learning技术,以及Chain-of-Thought prompting。
  • results: 这篇论文的方法在 MultiHiertt 数据集中的少量学习情况下达到了State-of-the-Art (SOTA) 性能。
    Abstract Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
    摘要 Answering numerical questions over hybrid contents from the given tables and text(文本表格问答) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.Here's the translation breakdown:* Answering numerical questions over hybrid contents (文本表格问答) + Answering (答案) + Numerical questions (数字问题) + Hybrid contents (混合内容) + Text and tables (文本和表格)* Recently, Large Language Models (LLMs) have gained significant attention (最近,大型语言模型已经吸引了重要的注意) + Recently (最近) + Large Language Models (大型语言模型) + Gained significant attention (吸引了重要的注意)* With the emergence of large language models (LLMs), In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics (LLMs的出现使得受到了关注的研究话题) + With the emergence of (出现) + Large language models (LLMs) + In-Context Learning (在Context学习) + Chain-of-Thought prompting (Chain-of-Thought提问) + Two particularly popular research topics (两个非常流行的研究话题)* In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought (在这篇论文中,我们介绍了一种新的提问策略) + In this paper (在这篇论文中) + We introduce (介绍) + A new prompting strategy (一种新的提问策略) + Called Hybrid prompt strategy (被称为Hybrid提问策略) + And Retrieval of Thought (以及 Retrieval of Thought)* Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking (通过In-Context学习,我们透过提问模型发展 Retrieval thinking的能力) + Through (通过) + In-Context Learning (在Context学习) + We prompt (我们透过提问) + The model (模型) + To develop (发展) + The ability of retrieval thinking ( Retrieval thinking的能力)* Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting (我们的方法在 MultiHiertt 数据集上在少量学习设置下表现出了superior的性能) + Our method (我们的方法) + Achieves (表现出) + Superior performance (superior的性能) + Compared to (与) + The fully-supervised SOTA (完全指导的SOTA) + On the MultiHiertt dataset (在 MultiHiertt 数据集上) + In the few-shot setting (在少量学习设置下)

Decoding Affect in Dyadic Conversations: Leveraging Semantic Similarity through Sentence Embedding

  • paper_url: http://arxiv.org/abs/2309.12646
  • repo_url: None
  • paper_authors: Chen-Wei Yu, Yun-Shiuan Chuang, Alexandros N. Lotsos, Claudia M. Haase
  • For: The paper aims to explore the use of sentence embeddings in analyzing real-world dyadic interactions and predicting the affect of conversational participants.* Methods: The study employs a Transformer-based model to obtain the embeddings of utterances from each speaker in 50 married couples’ conversations about conflicts and pleasant activities.* Results: The study finds that semantic similarity has a positive association with wives’ affect during conflict conversations, but not with husbands’ affect or during pleasant conversations.Here’s the information in Simplified Chinese text:
  • for: 这研究旨在利用句子嵌入来分析现实生活中的对话和预测对话参与者的情感。
  • methods: 这些研究使用Transformer模型来获取每个说话者的句子嵌入。
  • results: 研究发现,在对话中的 semantic similarity 与妻子在对抗对话中的情感有正相关关系,但不与丈夫在对抗对话中的情感或在愉悦对话中的情感有关系。
    Abstract Recent advancements in Natural Language Processing (NLP) have highlighted the potential of sentence embeddings in measuring semantic similarity. Yet, its application in analyzing real-world dyadic interactions and predicting the affect of conversational participants remains largely uncharted. To bridge this gap, the present study utilizes verbal conversations within 50 married couples talking about conflicts and pleasant activities. Transformer-based model all-MiniLM-L6-v2 was employed to obtain the embeddings of the utterances from each speaker. The overall similarity of the conversation was then quantified by the average cosine similarity between the embeddings of adjacent utterances. Results showed that semantic similarity had a positive association with wives' affect during conflict (but not pleasant) conversations. Moreover, this association was not observed with husbands' affect regardless of conversation types. Two validation checks further provided support for the validity of the similarity measure and showed that the observed patterns were not mere artifacts of data. The present study underscores the potency of sentence embeddings in understanding the association between interpersonal dynamics and individual affect, paving the way for innovative applications in affective and relationship sciences.
    摘要 现代自然语言处理(NLP)技术的发展,推祟了句子嵌入的潜在意义。然而,它在实际对话中分析双方对话和预测对话参与者的情感影响仍然是未知之地。为了bridging这一 gab,本研究使用了50对夫妻互动的对话,其中一方为冲突对话,另一方为愉悦对话。使用Transformer模型all-MiniLM-L6-v2,从每个发言人的utterance中获得了嵌入。然后,通过计算 adjacentutterance的cosine相似性的平均值来衡量对话的总相似性。结果表明,在冲突对话中,夫人的情感相关性与句子嵌入的semantic相似性呈正相关关系。此外,这种相关性不存在于愉悦对话中。两项验证检查还为研究的有效性提供了支持,并证明了所见到的模式不是数据的假象。本研究表明,句子嵌入可以帮助我们理解对话中的人际动力和个体情感之间的关系,并为情感科学和关系科学开辟了新的应用领域。

Learning to Diversify Neural Text Generation via Degenerative Model

  • paper_url: http://arxiv.org/abs/2309.12619
  • repo_url: None
  • paper_authors: Jimin Hong, ChaeHun Park, Jaegul Choo
  • for: 提高Neural语言模型的多样性和有用性,以扩展其应用范围。
  • methods: 提出一种新的方法,基于模型学习属性的观察:模型主要学习引起堕落问题的特征。该方法包括两个模型的训练:首先训练一个用于增强不良模式的模型,然后通过关注这个模型无法学习的模式来提高第二个模型的多样性。
  • results: 通过两个任务, namely语言模型和对话生成,进行了广泛的实验,证明了该方法的有效性。
    Abstract Neural language models often fail to generate diverse and informative texts, limiting their applicability in real-world problems. While previous approaches have proposed to address these issues by identifying and penalizing undesirable behaviors (e.g., repetition, overuse of frequent words) from language models, we propose an alternative approach based on an observation: models primarily learn attributes within examples that are likely to cause degeneration problems. Based on this observation, we propose a new approach to prevent degeneration problems by training two models. Specifically, we first train a model that is designed to amplify undesirable patterns. We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn. Extensive experiments on two tasks, namely language modeling and dialogue generation, demonstrate the effectiveness of our approach.
    摘要 neural network语言模型经常无法生成多样化和有用的文本,限制它们在实际问题中的应用。而以前的方法已经提议通过识别和处罚不良行为(例如重复、频繁使用常见词)来解决这些问题。我们则基于一个观察:模型主要学习文本中可能导致异常问题的特征。基于这个观察,我们提出了一种新的方法,通过训练两个模型来预防异常问题。具体来说,我们首先训练一个用于强化不良模式的模型。然后,我们通过关注这个模型无法学习的模式来提高第二个模型的多样性。我们在语言模型和对话生成两个任务上进行了广泛的实验,并证明了我们的方法的有效性。

Unlocking Model Insights: A Dataset for Automated Model Card Generation

  • paper_url: http://arxiv.org/abs/2309.12616
  • repo_url: None
  • paper_authors: Shruti Singh, Hitesh Lodwal, Husain Malwat, Rakesh Thakur, Mayank Singh
  • for: 这paper是为了提高机器学习模型的训练和应用而写的。
  • methods: 这paper使用了500个问题对25种机器学习模型进行了询问,以描述这些模型的训练配置、数据集、偏见、结构细节和训练资源。
  • results: 这paper发现现有的语言模型(如ChatGPT-3.5、LLaMa和Galactica)在理解研讨纸和生成文字回答中存在差距,并且可以使用这些模型来自动生成模型卡。
    Abstract Language models (LMs) are no longer restricted to ML community, and instruction-tuned LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a popular practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 ML models that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. Our initial experiments with ChatGPT-3.5, LLaMa, and Galactica showcase a significant gap in the understanding of research papers by these aforementioned LMs as well as generating factual textual responses. We posit that our dataset can be used to train models to automate the generation of model cards from paper text and reduce human effort in the model card curation process. The complete dataset is available on https://osf.io/hqt7p/?view_only=3b9114e3904c4443bcd9f5c270158d37
    摘要 机器学习模型(LM)不再受限于机器学习社区, instrucion-tuned LM 的出现导致自主AI代理人数量的增加。随着LM的访问权增加,理解其能力、适用范围和开发周期也变得非常重要。模型卡是一种很流行的实践,用于记录ML模型的详细信息。为了自动生成模型卡,我们提出了一个包含25种ML模型的500个问题答案对集。我们采用了人工批注人员,从原始论文中提取答案。此外,我们还explore了LM的可能性,以及它们在生成模型卡时的表现。我们的初步实验表明,ChatGPT-3.5、LLaMa和Galactica等LM在理解研讨文献和生成事实性文本响应方面存在较大的差距。我们认为,我们的数据集可以用于训练模型,以自动生成模型卡从文献中,并减少人类努力在模型卡筹编过程中。完整的数据集可以在https://osf.io/hqt7p/?view_only=3b9114e3904c4443bcd9f5c270158d37中找到。

Is it Possible to Modify Text to a Target Readability Level? An Initial Investigation Using Zero-Shot Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12551
  • repo_url: None
  • paper_authors: Asma Farajidizaji, Vatsal Raina, Mark Gales
  • for: 本研究旨在提出一种新的文本修改任务,即可以独立地控制文本的阅读difficulty水平。
  • methods: 本研究使用了ChatGPT和Llama-2两种基础模型,以及一种扩展方法,即通过语言模型两次生成 парафраз。
  • results: 研究发现,零配置方法可以将文本的阅读difficulty水平Push in the desired direction,但最终阅读difficulty仍然与原始文本的阅读difficulty相关。此外,研究还发现,阅读difficulty的变化会导致文本 semantic和 lexical similarity降低。
    Abstract Text simplification is a common task where the text is adapted to make it easier to understand. Similarly, text elaboration can make a passage more sophisticated, offering a method to control the complexity of reading comprehension tests. However, text simplification and elaboration tasks are limited to only relatively alter the readability of texts. It is useful to directly modify the readability of any text to an absolute target readability level to cater to a diverse audience. Ideally, the readability of readability-controlled generated text should be independent of the source text. Therefore, we propose a novel readability-controlled text modification task. The task requires the generation of 8 versions at various target readability levels for each input text. We introduce novel readability-controlled text modification metrics. The baselines for this task use ChatGPT and Llama-2, with an extension approach introducing a two-step process (generating paraphrases by passing through the language model twice). The zero-shot approaches are able to push the readability of the paraphrases in the desired direction but the final readability remains correlated with the original text's readability. We also find greater drops in semantic and lexical similarity between the source and target texts with greater shifts in the readability.
    摘要 文本简化和文本膨化是常见的任务,它们可以使文本更易于理解。然而,文本简化和膨化任务只能有限地改变文本的可读性。为了直接修改文本的可读性水平,我们提出了一个新的可读性控制文本修改任务。这个任务需要对每个输入文本生成8个版本,每个版本都达到不同的目标可读性水平。我们介绍了一些新的可读性控制文本修改指标。基elines для这个任务使用ChatGPT和Llama-2,我们还提出了一种扩展方法,即通过语言模型两次进行两步过程(生成重叠的重叠)来生成重叠。我们发现零配置方法可以推动文本的可读性水平,但最终的可读性仍然与原始文本的可读性相关。此外,我们还发现,随着可读性的增加,文本之间的semantic和lexical相似度会降低。

Automatic Answerability Evaluation for Question Generation

  • paper_url: http://arxiv.org/abs/2309.12546
  • repo_url: None
  • paper_authors: Zifan Wang, Kotaro Funakoshi, Manabu Okumura
  • for: 这 paper 的目的是提出一种新的自动评价指标,以评估生成的问题是否能够由参考答案回答。
  • methods: 这 paper 使用了一种基于提示的评价指标, named PMAN,通过对生成的问题和参考答案进行对比,来评估问题的可answerability。
  • results: 经过广泛的实验,这 paper 的评价结果被证明可靠,与人工评价结果相align。此外,这 paper 还应用了其metric来评估生成问题模型的性能,发现其metric 与传统的评价指标 complementary。最后, authors 使用 ChatGPT 实现了一个 SOTA 的问题生成模型。
    Abstract Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed for natural language generation (NLG) tasks, are based on measuring the n-gram overlap between the generated and reference text. These simple metrics may be insufficient for more complex tasks, such as question generation (QG), which requires generating questions that are answerable by the reference answers. Developing a more sophisticated automatic evaluation metric, thus, remains as an urgent problem in QG research. This work proposes a Prompting-based Metric on ANswerability (PMAN), a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers for the QG tasks. Extensive experiments demonstrate that its evaluation results are reliable and align with human evaluations. We further apply our metric to evaluate the performance of QG models, which shows our metric complements conventional metrics. Our implementation of a ChatGPT-based QG model achieves state-of-the-art (SOTA) performance in generating answerable questions.
    摘要 传统的自动评价指标,如BLEU和ROUGE,是基于生成和参考文本中的n-gram重叠而定义的。这些简单的指标可能不够用于更复杂的任务,如问题生成(QG),因为QG需要生成可回答的问题。开发一种更加复杂的自动评价指标,因此是QG研究中的紧迫问题。本工作提出了Answerability-based Metric on ANswerability(PMAN),一种新的自动评价指标,用于评估生成的问题是否可以由参考答案回答。我们进行了广泛的实验,并证明了其评价结果的可靠性和与人工评价结果的一致性。此外,我们还应用了我们的指标来评估QG模型的性能,并发现了它与传统指标的协同作用。我们实现了基于ChatGPT的QG模型,实现了状态的杰出表现(SOTA)在生成可回答的问题。