cs.CL - 2023-07-05

To be or not to be: a translation reception study of a literary text translated into Dutch and Catalan using machine translation

  • paper_url: http://arxiv.org/abs/2307.02358
  • repo_url: None
  • paper_authors: Ana Guerberof Arenas, Antonio Toral
  • for: 这项研究探讨了一篇小说的翻译receiving情况,包括机器翻译(MT)、后期编辑(PE)和从头 rewrite(HT)三种条件下的读者反应。
  • methods: 研究使用223名参与者,他们对不同翻译条件进行评分,包括narraative Engagement、Enjoyment和翻译接受度三个维度。
  • results: 结果显示,在catalan语言下,HT条件对narraative Engagement、Enjoyment和翻译接受度得分较高,而在荷兰语言下,PE条件对Enjoyment和翻译接受度得分较高,但对原始英语版本的评分最高。研究结果表明,在读一篇小说翻译版本时,不仅翻译条件和质量对其接受度起到关键作用,而且参与者的读写习惯、读写语言和社会语言地位也具有重要作用。
    Abstract This article presents the results of a study involving the reception of a fictional story by Kurt Vonnegut translated from English into Catalan and Dutch in three conditions: machine-translated (MT), post-edited (PE) and translated from scratch (HT). 223 participants were recruited who rated the reading conditions using three scales: Narrative Engagement, Enjoyment and Translation Reception. The results show that HT presented a higher engagement, enjoyment and translation reception in Catalan if compared to PE and MT. However, the Dutch readers show higher scores in PE than in both HT and MT, and the highest engagement and enjoyments scores are reported when reading the original English version. We hypothesize that when reading a fictional story in translation, not only the condition and the quality of the translations is key to understand its reception, but also the participants reading patterns, reading language, and, perhaps language status in their own societies.
    摘要

MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain

  • paper_url: http://arxiv.org/abs/2307.02340
  • repo_url: None
  • paper_authors: Timo Pierre Schrader, Teresa Bürkle, Sophie Henning, Sherry Tan, Matteo Finco, Stefan Grünewald, Maira Indrikova, Felix Hildebrand, Annemarie Friedrich
  • for: 这篇论文是为了提高材料科学研究文献的处理而写的。
  • methods: 这篇论文使用了域标注方法,并采用了特殊的材料科学注解方案来标注文献。
  • results: 研究发现,使用域特定的 пре打 trains 可以获得高精度的分类结果,而且已经存在的其他领域的 AZ 类别也可以在一定程度上转移到材料科学领域中。
    Abstract Scientific publications follow conventionalized rhetorical structures. Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence states a Motivation, a Result or Background information, has been proposed to improve processing of scholarly documents. In this work, we adapt and extend this idea to the domain of materials science research. We present and release a new dataset of 50 manually annotated research articles. The dataset spans seven sub-topics and is annotated with a materials-science focused multi-label annotation scheme for AZ. We detail corpus statistics and demonstrate high inter-annotator agreement. Our computational experiments show that using domain-specific pre-trained transformer-based text encoders is key to high classification performance. We also find that AZ categories from existing datasets in other domains are transferable to varying degrees.
    摘要 Note: The Simplified Chinese translation is written in the standardized format used in China, which is different from the Traditional Chinese used in Taiwan and other countries.

Utilizing ChatGPT Generated Data to Retrieve Depression Symptoms from Social Media

  • paper_url: http://arxiv.org/abs/2307.02313
  • repo_url: None
  • paper_authors: Ana-Maria Bucur
  • for: 本研究团队在eRisk Lab任务中做出了贡献,该任务是从Reddit社交媒体上检索和排序表达抑郁症状的句子。
  • methods: 我们使用ChatGPT生成了Synthetic Data,并设计了一个提示,以便生成的数据具有更高的 semantic diversity 和情感体验,同时还具有特定于Reddit的个人分享体验。我们使用cosine similarity进行 semantic search 和排序句子的相关性。
  • results: 我们的结果表明,使用专门为semantic search设计的模型的句子嵌入在这个任务中 perfoms 更好,而使用预训练在心理健康数据上的模型的嵌入则不及。此外,我们发现生成的Synthetic Data 是这个任务中特别适用的,使用原始BDI-II句子alone 的方法得到了最佳性能。
    Abstract In this work, we present the contribution of the BLUE team in the eRisk Lab task on searching for symptoms of depression. The task consists of retrieving and ranking Reddit social media sentences that convey symptoms of depression from the BDI-II questionnaire. Given that synthetic data provided by LLMs have been proven to be a reliable method for augmenting data and fine-tuning downstream models, we chose to generate synthetic data using ChatGPT for each of the symptoms of the BDI-II questionnaire. We designed a prompt such that the generated data contains more richness and semantic diversity than the BDI-II responses for each question and, at the same time, contains emotional and anecdotal experiences that are specific to the more intimate way of sharing experiences on Reddit. We perform semantic search and rank the sentences' relevance to the BDI-II symptoms by cosine similarity. We used two state-of-the-art transformer-based models (MentalRoBERTa and a variant of MPNet) for embedding the social media posts, the original and generated responses of the BDI-II. Our results show that using sentence embeddings from a model designed for semantic search outperforms the approach using embeddings from a model pre-trained on mental health data. Furthermore, the generated synthetic data were proved too specific for this task, the approach simply relying on the BDI-II responses had the best performance.
    摘要 在这项工作中,我们介绍了BLUE团队在eRisk Lab任务中搜索抑郁症状的贡献。这个任务的目标是从Reddit社交媒体上检索和排名符合BDI-II问卷中的抑郁症状表达。由于使用生成的数据已被证明可以增强模型和下游模型的精度,我们使用ChatGPT生成每个BDI-II问卷中的症状数据。我们设计了一个提示,以便生成的数据具有更多的 ricness和semantic diversity,同时具有特定于Reddit上更加亲切的情感和个人经历。我们使用cosine similarity对句子的相似性进行Semantic search和排名。我们使用两种现代变换器模型(MentalRoBERTa和MPNet的变种)来嵌入社交媒体文章和BDI-II的原始和生成回答。我们的结果表明,使用专门 дляsemantic search的句子嵌入模型比使用预先训练在精神健康数据上的模型来得到更高的性能。此外,我们发现生成的 sintetic data太特定于这个任务,使用直接使用BDI-II回答的方法获得最好的性能。

Sumformer: Universal Approximation for Efficient Transformers

  • paper_url: http://arxiv.org/abs/2307.02301
  • repo_url: None
  • paper_authors: Silas Alberti, Niclas Dern, Laura Thesing, Gitta Kutyniok
  • for: 这 paper 是为了研究sequence-to-sequence函数的 universally approximation问题而写的。
  • methods: 这 paper 使用了一种新的 Sumformer 架构,以及对 Linformer 和 Performer 的分析。
  • results: 这 paper 提供了 universally approximation results for Linformer 和 Performer,并提供了一个新的证明,证明只需一层注意力层可以universally approximation sequence-to-sequence函数。
    Abstract Natural language processing (NLP) made an impressive jump with the introduction of Transformers. ChatGPT is one of the most famous examples, changing the perception of the possibilities of AI even outside the research community. However, besides the impressive performance, the quadratic time and space complexity of Transformers with respect to sequence length pose significant limitations for handling long sequences. While efficient Transformer architectures like Linformer and Performer with linear complexity have emerged as promising solutions, their theoretical understanding remains limited. In this paper, we introduce Sumformer, a novel and simple architecture capable of universally approximating equivariant sequence-to-sequence functions. We use Sumformer to give the first universal approximation results for Linformer and Performer. Moreover, we derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
    摘要 自然语言处理(NLP)在Transformers的引入后作出了卓越的跳跃。ChatGPT是最著名的示例之一,对外部研究社区的可能性产生了深刻的影响。然而,Transformers的序列长度相对于时间和空间复杂性的平方带来了重大限制,对长序列处理有重要的限制。虽然高效的Transformers架构如Linformer和Performer已经出现了,但它们的理论理解仍然受限。在本文中,我们介绍Sumformer,一种新的简单架构,可以universally approximating equivariant sequence-to-sequence函数。我们使用Sumformer来给Linformer和Performer的首次universal approximation结果。此外,我们还 derivates一个新的证明,显示只需一层注意力层就能 universally approximation Transformers。

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

  • paper_url: http://arxiv.org/abs/2307.02288
  • repo_url: None
  • paper_authors: Xuan-Quy Dao
  • for: 这个研究是为了比较三个大型自然语言处理模型(LLMs)在VNHSGE英语数据集上的性能。
  • methods: 这个研究使用了OpenAI ChatGPT、Microsoft Bing Chat(BingChat)和Google Bard三个模型进行比较。
  • results: 研究发现,BingChat的性能比ChatGPT和Bard高出92.4%和86%。这意味着BingChat可以取代ChatGPT,而ChatGPT尚未正式在越南上市。此外,BingChat、Bard和ChatGPT在英语水平都高于越南高中生。这些结果贡献到了LLMs在英语教学中的潜力的理解。
    Abstract This paper presents a performance comparison of three large language models (LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard, on the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT (GPT-3.5) is 92.4\%, 86\%, and 79.2\%, respectively. The results show that BingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can replace ChatGPT while ChatGPT is not yet officially available in Vietnam. The results also indicate that BingChat, Bard and ChatGPT outperform Vietnamese students in English language proficiency. The findings of this study contribute to the understanding of the potential of LLMs in English language education. The remarkable performance of ChatGPT, BingChat, and Bard demonstrates their potential as effective tools for teaching and learning English at the high school level.
    摘要 这个论文比较了三个大语言模型(LLMs)的性能,即OpenAI ChatGPT、Microsoft Bing Chat(BingChat)和Google Bard,在VNHSGE英语数据集上。这三个模型的性能分别为92.4%、86%和79.2%。结果表明,BingChat比ChatGPT和Bard更好,因此BingChat和Bard可以取代ChatGPT,而ChatGPT尚未正式在越南发布。结果还表明,BingChat、Bard和ChatGPT在英语水平超过越南学生。这些发现贡献于英语教学中LLMs的潜力理解。ChatGPT、BingChat和Bard的出色表现表明它们在高中英语教学中是有效的工具。

SpaceNLI: Evaluating the Consistency of Predicting Inferences in Space

  • paper_url: http://arxiv.org/abs/2307.02269
  • repo_url: https://github.com/kovvalsky/spacenli
  • paper_authors: Lasha Abzianidze, Joost Zwarts, Yoad Winter
  • for: fill the gap of spatial expression and reasoning in natural language inference (NLI) datasets
  • methods: semi-automatically created an NLI dataset for spatial reasoning called SpaceNLI, using curated reasoning patterns and expert annotations
  • results: SOTA NLI systems obtain moderate results on spatial NLI problems but lack consistency per inference pattern, with non-projective spatial inferences (especially those using the “between” preposition) being the most challenging ones.
    Abstract While many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several SOTA NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.
    摘要 whilst many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several state-of-the-art NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.Here's the translation in Traditional Chinese as well:而 whereas many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several state-of-the-art NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.

Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks

  • paper_url: http://arxiv.org/abs/2307.02179
  • repo_url: None
  • paper_authors: Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi
  • for: This study aims to evaluate the performance of open-source Large Language Models (LLMs) in text annotation tasks and compare them with proprietary models like ChatGPT and human-based services such as MTurk.
  • methods: The study uses both zero-shot and few-shot approaches and different temperature parameters across a range of text annotation tasks to assess the performance of open-source LLMs.
  • results: The findings show that while ChatGPT achieves the best performance in most tasks, open-source LLMs not only outperform MTurk but also demonstrate competitive potential against ChatGPT in specific tasks.Here is the same information in Simplified Chinese text:
  • for: 这个研究旨在评估开源大语言模型(LLMs)在文本标注任务中的表现,并与专有模型如ChatGPT和人工服务如MTurk进行比较。
  • methods: 研究使用零shot和几shot方法,以及不同温度参数来评估开源LLMs的表现。
  • results: 结果显示,虽然ChatGPT在大多数任务中表现最佳,但开源LLMs不仅超越MTurk,还在特定任务中与ChatGPT竞争。
    Abstract This study examines the performance of open-source Large Language Models (LLMs) in text annotation tasks and compares it with proprietary models like ChatGPT and human-based services such as MTurk. While prior research demonstrated the high performance of ChatGPT across numerous NLP tasks, open-source LLMs like HugginChat and FLAN are gaining attention for their cost-effectiveness, transparency, reproducibility, and superior data protection. We assess these models using both zero-shot and few-shot approaches and different temperature parameters across a range of text annotation tasks. Our findings show that while ChatGPT achieves the best performance in most tasks, open-source LLMs not only outperform MTurk but also demonstrate competitive potential against ChatGPT in specific tasks.
    摘要 Translation in Simplified Chinese:这个研究研究了开源大型自然语言模型(LLM)在文本标注任务中的性能,并与专有模型如ChatGPT和人类基于服务如MTurk进行比较。先前的研究表明了ChatGPT在多种NLP任务中的高性能,但开源LLM如HugginChat和FLAN在成本效益、透明度、复制性和数据保护方面吸引了关注。我们使用零shot和几shot方法以及不同温度参数,对多种文本标注任务进行评估。我们的发现表明,虽然ChatGPT在大多数任务中表现最好,但开源LLM不仅超越MTurk,还在特定任务中与ChatGPT竞争。

Generative Job Recommendations with Large Language Model

  • paper_url: http://arxiv.org/abs/2307.02157
  • repo_url: None
  • paper_authors: Zhi Zheng, Zhaopeng Qiu, Xiao Hu, Likang Wu, Hengshu Zhu, Hui Xiong
  • For: 提供个性化和全面的就业搜索体验,通过语言模型生成具体的职位描述以满足企业和潜在雇员的需求。* Methods: 使用Supervised Fine-Tuning (SFT) STRATEGY和Proximal Policy Optimization (PPO)-based Reinforcement Learning (RL) 方法来训练语言模型生成职位描述,并使用CV和JD匹配度作为奖励模型。* Results: EXTENSIVE EXPERIMENTS ON A LARGE-SCALE REAL-WORLD DATASET 表明,我们的方法可以提供更高的准确率和更好的个性化效果,并且可以补充现有的作业推荐模型,提高搜索效果。
    Abstract The rapid development of online recruitment services has encouraged the utilization of recommender systems to streamline the job seeking process. Predominantly, current job recommendations deploy either collaborative filtering or person-job matching strategies. However, these models tend to operate as "black-box" systems and lack the capacity to offer explainable guidance to job seekers. Moreover, conventional matching-based recommendation methods are limited to retrieving and ranking existing jobs in the database, restricting their potential as comprehensive career AI advisors. To this end, here we present GIRL (GeneratIve job Recommendation based on Large language models), a novel approach inspired by recent advancements in the field of Large Language Models (LLMs). We initially employ a Supervised Fine-Tuning (SFT) strategy to instruct the LLM-based generator in crafting suitable Job Descriptions (JDs) based on the Curriculum Vitae (CV) of a job seeker. Moreover, we propose to train a model which can evaluate the matching degree between CVs and JDs as a reward model, and we use Proximal Policy Optimization (PPO)-based Reinforcement Learning (RL) method to further fine-tine the generator. This aligns the generator with recruiter feedback, tailoring the output to better meet employer preferences. In particular, GIRL serves as a job seeker-centric generative model, providing job suggestions without the need of a candidate set. This capability also enhances the performance of existing job recommendation models by supplementing job seeking features with generated content. With extensive experiments on a large-scale real-world dataset, we demonstrate the substantial effectiveness of our approach. We believe that GIRL introduces a paradigm-shifting approach to job recommendation systems, fostering a more personalized and comprehensive job-seeking experience.
    摘要 “在线招聘服务的快速发展已经鼓励了对职业推荐服务的使用。现在大多数的职业推荐都使用了共同预测或人职匹配策略。但这些模型通常 acted as "黑盒子"系统,缺乏可靠的指导。传统的匹配基于推荐方法仅能从数据库中撷取和排名现有的职位,限制了它们的潜在。为了解决这个问题,我们现在提出了GIRL(生成式职业推荐,基于大型自然语言模型)。我们首先使用Supervised Fine-Tuning(SFT)策略,将LLM基于生成器 instrucured in crafting suitable Job Descriptions(JD)based on the Curriculum Vitae(CV)of a job seeker。此外,我们提议使用Proximal Policy Optimization(PPO)-based Reinforcement Learning(RL)方法,以更好地调整生成器。这样,生成器与招聘者反馈相互匹配,使生成器的输出更加符合雇主的需求。特别是,GIRL serves as a job seeker-centric generative model,提供了不需要候选人的职业建议。这个能力也提高了现有的职业推荐模型的表现,通过增加了对职业推荐的内容。经过了大规模的实验,我们证明了GIRL的实际效果。我们相信GIRL引入了一种新的职业推荐系统方法,创造了更加个性化和全面的职业搜寻体验。”

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

  • paper_url: http://arxiv.org/abs/2307.02146
  • repo_url: None
  • paper_authors: Longshen Ou, Xichu Ma, Ye Wang
  • for: bridges the singability gap between generated lyrics and melodies
  • methods: jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L)
  • results: achieves 3.75% and 21.44% absolute accuracy gains in the outputs’ number-of-line and syllable-per-line requirements, and demonstrates 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation.
    Abstract Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L). After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
    摘要 尽管之前的旋律到歌词生成研究已经做出了很多努力,但是还存在旋律到歌词的兼容性差距,这缺点影响了输出的唱ability。这篇论文通过一种新的approach来bridging这个兼容性差距,通过同时学习wording和 formatting来帮助模型学习合适的歌词。在通用领域预训练后,我们的提议的模型从大量的文本只 lyrics corpus中获得了长度意识。然后,我们引入了基于音乐学研究的melody和lyrics之间关系的新目标,使模型学习细腻的格式要求。我们的模型在Naive fine-tuning的基础上获得了3.75%和21.44%的绝对准确率提升,而无需牺牲文本流畅性。此外,我们的模型在主观评估中 Display 63.92%和74.18%的音乐-歌词兼容性和总质量提升,相比之前的State-of-the-art melody-to-lyric生成模型,表明了格式学习的重要性。

Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction

  • paper_url: http://arxiv.org/abs/2307.02127
  • repo_url: None
  • paper_authors: Hejing Cao, Dongyan Zhao
  • for: 这篇论文主要是为了提出一种基于AMR的语法错误修正模型,以提高语法错误修正 task 的性能。
  • methods: 作者提出了一种seq-to-seq模型,并使用了denoising方法来提高AMR的可靠性。
  • results: 实验结果表明,与一些强大的基准模型相比,AMR-GEC可以与其相比,而且可以降低训练时间32%。
    Abstract Grammatical Error Correction (GEC) is the task of correcting errorful sentences into grammatically correct, semantically consistent, and coherent sentences. Popular GEC models either use large-scale synthetic corpora or use a large number of human-designed rules. The former is costly to train, while the latter requires quite a lot of human expertise. In recent years, AMR, a semantic representation framework, has been widely used by many natural language tasks due to its completeness and flexibility. A non-negligible concern is that AMRs of grammatically incorrect sentences may not be exactly reliable. In this paper, we propose the AMR-GEC, a seq-to-seq model that incorporates denoised AMR as additional knowledge. Specifically, We design a semantic aggregated GEC model and explore denoising methods to get AMRs more reliable. Experiments on the BEA-2019 shared task and the CoNLL-2014 shared task have shown that AMR-GEC performs comparably to a set of strong baselines with a large number of synthetic data. Compared with the T5 model with synthetic data, AMR-GEC can reduce the training time by 32\% while inference time is comparable. To the best of our knowledge, we are the first to incorporate AMR for grammatical error correction.
    摘要 grammatical error correction (GEC) 是指 corrections 错误的句子到 grammatically 正确、semantically 一致、coherent 的句子。受欢迎的 GEC 模型 either 使用大规模的 synthetic corpora 或者使用大量的人工设计的规则。前者 costly 训练,后者需要很多的人类专家知识。在 recent 年,AMR,一种 semantic representation framework,has 广泛应用于 many natural language tasks due to its completeness and flexibility。一个可耻的问题是 that AMRs of grammatically incorrect sentences may not be exactly reliable。在 this paper,we propose the AMR-GEC, a seq-to-seq model that incorporates denoised AMR as additional knowledge. Specifically, we design a semantic aggregated GEC model and explore denoising methods to get AMRs more reliable。 experiments on the BEA-2019 shared task and the CoNLL-2014 shared task have shown that AMR-GEC performs comparably to a set of strong baselines with a large number of synthetic data。 compared with the T5 model with synthetic data, AMR-GEC can reduce the training time by 32% while inference time is comparable。to the best of our knowledge, we are the first to incorporate AMR for grammatical error correction.

Multilingual Controllable Transformer-Based Lexical Simplification

  • paper_url: http://arxiv.org/abs/2307.02120
  • repo_url: https://github.com/kimchengsheang/mtls
  • paper_authors: Kim Cheng Sheang, Horacio Saggion
  • for: 这篇论文目标是提高文本读取和理解的Accessibility,使用控制token和预先训练的掩码语言模型来学习简化复杂词语的更简单的替换。
  • methods: 该论文提出了一种基于Transformer的多语言控制 Lexical Simplification(LS)系统,使用语言特定前缀、控制token和预先训练的掩码语言模型来学习简化复杂词语。
  • results: 论文的evaluation结果表明,该模型在三个常见的LS数据集(LexMTurk、BenchLS和NNSEval)上的表现比前一代的模型(如LSBert和ConLS)更好,并且在TSAR-2022多语言LS共享任务中的一部分检测数据集上与参与系统竞争,甚至在某些指标上超越GPT-3模型。此外,模型还在西班牙语和葡萄牙语上获得了性能提升。
    Abstract Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.
    摘要 文本是最常见的知识和信息来源,应该让更多人有访问权。然而,文本经常包含复杂的词语,这会降低阅读理解和访问性。因此,建议使用 simpler alternatives for complex words,不会妨碍意思的传递,可以帮助更多人理解。这篇论文提出了一种名为 mTLS 的多语言可控Transformer 基本 Lexical Simplification(LS)系统,该系统通过语言特定前缀、控制 токен和来自预训练的 masked language model 中的候选词来学习 simpler alternatives for complex words。我们的评估结果在 LexMTurk、BenchLS 和 NNSEval 三个常见 LS 数据集上表明,我们的模型比前一代模型 like LSBert 和 ConLS 更高效。此外,我们的方法在 TSAR-2022 年度多语言 LS 共同任务中的一部分进行了进一步的评估,我们的模型与英语 LS 中的参与系统竞争,并在一些指标上超越 GPT-3 模型。此外,我们的模型在西班牙语和葡萄牙语上也获得了性能提升。

Do predictability factors towards signing avatars hold across cultures?

  • paper_url: http://arxiv.org/abs/2307.02103
  • repo_url: None
  • paper_authors: Abdelhadi Soudi, Manal El Hakkaoui, Kristof Van Laerhoven
  • for: 这研究旨在探讨虚拟人物技术在听力障碍人群中的可用性和应用前景,以及听力障碍人群对虚拟人物的acceptance和态度的影响因素。
  • methods: 本研究采用问卷调查方法,询问听力障碍人群对虚拟人物的态度和acceptance水平,并对听力障碍人群的技术经验、听力状况、年龄和手语流畅度进行分析,以了解各因素对态度的影响。
  • results: 研究发现,听力障碍人群对虚拟人物的态度和acceptance水平异常高,且与听力状况、技术经验和年龄有关。特别是,MSL用户对虚拟人物的态度较低,与其他研究结果相比。
    Abstract Avatar technology can offer accessibility possibilities and improve the Deaf-and-Hard of Hearing sign language users access to communication, education and services, such as the healthcare system. However, sign language users acceptance of signing avatars as well as their attitudes towards them vary and depend on many factors. Furthermore, research on avatar technology is mostly done by researchers who are not Deaf. The study examines the extent to which intrinsic or extrinsic factors contribute to predict the attitude towards avatars across cultures. Intrinsic factors include the characteristics of the avatar, such as appearance, movements and facial expressions. Extrinsic factors include users technology experience, their hearing status, age and their sign language fluency. This work attempts to answer questions such as, if lower attitude ratings are related to poor technology experience with ASL users, for example, is that also true for Moroccan Sign Language (MSL) users? For the purposes of the study, we designed a questionnaire to understand MSL users attitude towards avatars. Three groups of participants were surveyed: Deaf (57), Hearing (20) and Hard-of-Hearing (3). The results of our study were then compared with those reported in other relevant studies.
    摘要 《备用人物技术可以提供更多的可用性和改善聋听人用手语讲解、教育和服务的访问,例如医疗系统。然而,手语用户对签名人物的接受度和对其的态度是多方面的和各不相同的。此外,研究人员大多是不聋听的。本研究探讨了对签名人物的态度是由内在或外在因素决定的程度。内在因素包括人物的特征,如外表、运动和表情。外在因素包括用户的技术经验、听力状况、年龄和手语流利程度。本研究试图回答问题,例如,听力不佳的ASL用户对人物的态度评分是否与MSL用户相似?为了了解MSL用户对人物的态度,我们设计了一份问naire。三个组合体的参与者被调查:聋听(57)、正常听力(20)和听力不佳(3)。我们的研究结果与其他相关研究的结果进行比较。

Different Games in Dialogue: Combining character and conversational types in strategic choice

  • paper_url: http://arxiv.org/abs/2307.02087
  • repo_url: None
  • paper_authors: Alafate Abulimiti
  • for: investigating the interaction of conversational type and character types of interlocutors
  • methods: combining decision making process for selecting dialogue moves with character type and conversational type, and presenting a mathematical model to illustrate the interactions
  • results: presenting a quantitative approach to understanding the interactions between conversational type and character types in dialogue moves
    Abstract In this paper, we show that investigating the interaction of conversational type (often known as language game or speech genre) with the character types of the interlocutors is worthwhile. We present a method of calculating the decision making process for selecting dialogue moves that combines character type and conversational type. We also present a mathematical model that illustrate these factors' interactions in a quantitative way.
    摘要 在这篇论文中,我们证明了对话类型(常称为语言游戏或语言类型)与对话参与者的人物类型之间的交互是有价值的。我们提出了一种计算对话搬运的决策过程,该过程结合了人物类型和对话类型。我们还提出了一个数学模型,用于显示这些因素之间的交互。

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

  • paper_url: http://arxiv.org/abs/2307.02083
  • repo_url: None
  • paper_authors: Christiaan Jacobs, Herman Kamper
  • for: 这个论文主要研究了语音嵌入模型(AWE)的含义模型化,即使只有未标注的语音数据。
  • methods: 作者提出了一些使用预训练多语言AWE模型的策略,包括聚类词段使用多语言AWE模型的中心点, derivation soft pseudo-word标签,并在Skipgram-like模型中训练soft vectors。
  • results: 作者的方法在语义相似任务中表现出色,比过去的所有语义AWE方法都高。此外,这种多语言传递方法还能够用于下游语义查询例子搜索。
    Abstract Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.
    摘要 听音字嵌入(AWE)是一种固定维度的 вектор表示方法,用于 repre senting speech segments 中的声音内容,以便不同的实现都有相似的嵌入。在这篇论文中,我们探讨 semantic AWE 模型。这些 AWE 不仅应 capture 声音特征,还应包含语言中词的含义(类似于文本字嵌入)。我们假设只有目标语言的无标注语音。我们提出了一些利用预训练的多语言 AWE 模型(包括目标语言)的策略。我们的最佳 semantics AWE 方法是将词段分组使用多语言 AWE 模型, deriv ing 软 Pseudo-word 标签从分组中心点,然后在 Skipgram-like 模型中训练 soft vectors。在内在词义相似任务中,这种多语言传输方法超过了所有之前的 semantics AWE 方法。我们还示出了,AWE 可以用于下游 semantic query-by-example 搜索。

Graph Contrastive Topic Model

  • paper_url: http://arxiv.org/abs/2307.02078
  • repo_url: https://github.com/zhehengluok/gctm
  • paper_authors: Zheheng Luo, Lei Liu, Qianqian Xie, Sophia Ananiadou
  • for: 本研究旨在提高NTMs中的话题准确性和文档表示力,并解决采样偏袋问题。
  • methods: 我们提出了一种新的采样假设,即负样本中的词语应该具有与原型相关的 semantics。基于此假设,我们提出了一种图像对话Token模型(GCTM),它通过图像对话学习(GCL)来使用有用的正样本和负样本,以提高文档话题表示和隐藏话题的学习。
  • results: 我们在多个标准数据集上进行了实验,并证明了我们的方法可以提高话题准确性和文档表示力,并且比既有最佳方法更高。
    Abstract Existing NTMs with contrastive learning suffer from the sample bias problem owing to the word frequency-based sampling strategy, which may result in false negative samples with similar semantics to the prototypes. In this paper, we aim to explore the efficient sampling strategy and contrastive learning in NTMs to address the aforementioned issue. We propose a new sampling assumption that negative samples should contain words that are semantically irrelevant to the prototype. Based on it, we propose the graph contrastive topic model (GCTM), which conducts graph contrastive learning (GCL) using informative positive and negative samples that are generated by the graph-based sampling strategy leveraging in-depth correlation and irrelevance among documents and words. In GCTM, we first model the input document as the document word bipartite graph (DWBG), and construct positive and negative word co-occurrence graphs (WCGs), encoded by graph neural networks, to express in-depth semantic correlation and irrelevance among words. Based on the DWBG and WCGs, we design the document-word information propagation (DWIP) process to perform the edge perturbation of DWBG, based on multi-hop correlations/irrelevance among documents and words. This yields the desired negative and positive samples, which will be utilized for GCL together with the prototypes to improve learning document topic representations and latent topics. We further show that GCL can be interpreted as the structured variational graph auto-encoder which maximizes the mutual information of latent topic representations of different perspectives on DWBG. Experiments on several benchmark datasets demonstrate the effectiveness of our method for topic coherence and document representation learning compared with existing SOTA methods.
    摘要 现有的NTMs通过对比学习受到样本偏袋问题的影响,这可能导致假象样本与prototype的semantics相似。在这篇文章中,我们想探索NTMs中有效的采样策略和对比学习方法来解决上述问题。我们提出了一种新的采样假设,即负样本应包含与prototype的semantics不相关的词。基于这个假设,我们提出了图像对比话题模型(GCTM),它通过图像对比学习(GCL)使用了有用的负样本和prototype来提高文档主题表示和潜在主题。在GCTM中,我们首先将输入文档表示为文档词biipartite图(DWBG),然后构建负样本和正样本的word co-occurrence图(WCG),通过图像神经网络编码,表达文档和词之间的深度semantic correlation和irrelevance。基于DWBG和WCG,我们设计了文档词信息传播过程(DWIP)来进行DWBG的边刺激,基于文档和词之间的多趟相关性/不相关性。这会生成我们需要的负样本和正样本,它们将与prototype一起用于GCL来提高文档主题表示和潜在主题。我们还证明了GCL可以被视为structured variational graph auto-encoder,它最大化了不同角度的DWBG上latent topic representation之间的mutual information。在多个标准 benchmark dataset上进行了实验,我们发现我们的方法在主题准确性和文档表示学习方面比现有SOTA方法更有效。

Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning

  • paper_url: http://arxiv.org/abs/2307.02053
  • repo_url: https://github.com/declare-lab/flacuna
  • paper_authors: Deepanway Ghosal, Yew Ken Chia, Navonil Majumder, Soujanya Poria
    for:* The paper is focused on investigating the impact of the third factor (instruction dataset) on the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture.methods:* The paper uses a customized instruction dataset collection called FLANMINI, which includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4.* The paper fine-tunes VICUNA, a large language model based on LLAMA, on the FLAN dataset to obtain enhanced problem-solving abilities.results:* The paper shows that fine-tuning VICUNA on the FLAN dataset leads to significant improvements across numerous benchmark datasets in INSTRUCTEVAL.* The paper also introduces FLACUNA, a publicly available model that is fine-tuned VICUNA on the FLAN dataset, which demonstrates improved problem-solving abilities compared to the latest decoder-based LLMs.
    Abstract Recently, the release of INSTRUCTEVAL has provided valuable insights into the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture. Interestingly, despite being introduced four years ago, T5-based LLMs, such as FLAN-T5, continue to outperform the latest decoder-based LLMs, such as LLAMA and VICUNA, on tasks that require general problem-solving skills. This performance discrepancy can be attributed to three key factors: (1) Pre-training data, (2) Backbone architecture, and (3) Instruction dataset. In this technical report, our main focus is on investigating the impact of the third factor by leveraging VICUNA, a large language model based on LLAMA, which has undergone fine-tuning on ChatGPT conversations. To achieve this objective, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at https://huggingface.co/declare-lab/flacuna-13b-v1.0.
    摘要 近期,INSTRUCTEVAL的发布提供了大语言模型(LLM)使用encoder-decoder或decoder-only架构的表现的有价值信息。有趣的是,虽然四年前出现了T5基于的LLM,如FLAN-T5,仍然在需要通用问题解决能力的任务上超过最新的decoder基于LLM,如LLAMA和VICUNA。这种表现差异可以归因于三个关键因素:(1)预训练数据,(2)后端架构,(3)指令集。在这份技术报告中,我们主要关注第三个因素,通过使用基于LLAMA的大语言模型VICUNA进行细化,以便 Investigate the impact of this factor. To achieve this goal, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at https://huggingface.co/declare-lab/flacuna-13b-v1.0.

CAME: Confidence-guided Adaptive Memory Efficient Optimization

  • paper_url: http://arxiv.org/abs/2307.02047
  • repo_url: https://github.com/yangluo7/came
  • paper_authors: Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You
  • for: 本文主要用于研究一种能同时实现快速收敛和低内存使用的自适应优化器,以提高大语言模型的训练效率。
  • methods: 本文使用了一种自信量指导策略来降低现有的内存有效优化器的不稳定性。基于这种策略,我们提出了一种名为CAME的优化器,能同时实现快速收敛和低内存使用。
  • results: 广泛的实验表明,CAME可以在不同的NLP任务中(如BERT和GPT-2训练)实现稳定的训练和高度的性能。特别是在BERT预训练中使用大批处理(32,768)时,我们的提议的优化器可以更快地收敛并达到更高的准确率,比 Adam 优化器更高。
    Abstract Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.
    摘要 优化器方法,如 Adam 和 LAMB,在大型语言模型的训练中表现出色。然而,需要适应性导致每个参数的二次积分估计需要保持,这会带来较高的额外存储开销。为解决这问题,许多快速优化器(例如 Adafactor)已经被提出,可以减少auxiliary存储使用量,但是它们通常会带来性能损失。在本文中,我们首先研究了一种 confidence-guided 策略,以减少现有的存储效率优化器的不稳定性。基于这种策略,我们提出了 CAME,可以同时实现两个目标:快速 convergence 和低存储使用量。我们的实验表明,CAME 在不同的 NLP 任务上(包括 BERT 和 GPT-2 训练)具有稳定的训练和高效的性能。特别是,对于 BERT 的预训练,我们的提出的优化器在大批处理大小为 32,768 时实现了更快的 convergence 和更高的准确率,相比 Adam 优化器。CAME 的实现已经公开可用。

Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems

  • paper_url: http://arxiv.org/abs/2307.02009
  • repo_url: None
  • paper_authors: Tanvina Patel, Odette Scharenborg
  • for: 降低不同年龄和非本地语言 speaker的偏见
  • methods: 使用 state-of-the-art 速度扰动和spectral augmentation 数据增强技术,以及 Vocal Tract Length Normalization (VTLN) 来normalize 频谱差异
  • results: 组合数据增强和 VTLN 可以降低平均 WER 和偏见,对不同多种speaker group的表现提高了6.9%和3.9%。VTLN 模型在德语和中文儿童语音上也有良好的普适性
    Abstract Speech technology has improved greatly for norm speakers, i.e., adult native speakers of a language without speech impediments or strong accents. However, non-norm or diverse speaker groups show a distinct performance gap with norm speakers, which we refer to as bias. In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. For an end-to-end (E2E) ASR system, we use state-of-the-art speed perturbation and spectral augmentation as data augmentation techniques and explore Vocal Tract Length Normalization (VTLN) to normalise for spectral differences due to differences in anatomy. The combination of data augmentation and VTLN reduced the average WER and bias across various diverse speaker groups by 6.9% and 3.9%, respectively. The VTLN model trained on Dutch was also effective in improving performance of Mandarin Chinese child speech, thus, showing generalisability across languages
    摘要 语音技术在 норм 说话人(即成年本地语言无异常发音或强调的人)方面有很大的进步。然而,非 norm 或多样化 speaker group display distinct performance gap with norm speakers,我们称之为偏见。在这种工作中,我们想减少对不同年龄组和非本地说话人的偏见。为一个端到端(E2E)语音识别系统,我们使用现有的速度扰动和spectral augmentation作为数据增强技术,并探索 vocals tract length normalization(VTLN)来归一化因为身体差异而导致的spectral differences。这些技术的组合reduced the average WER和偏见 across various diverse speaker groups by 6.9% and 3.9%, respectively。VTLN模型在荷兰语上训练也有效地提高了中文儿童语音的性能,因此显示了语言通用性。

PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records

  • paper_url: http://arxiv.org/abs/2307.02006
  • repo_url: https://github.com/yuping-wu/pulsar
  • paper_authors: Viktor Schlegel, Hao Li, Yuping Wu, Anand Subramanian, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Daniel Beck, Xiaojun Zeng, Riza Theresa Batista-Navarro, Stefan Winkler, Goran Nenadic
  • for: 本文描述了我们在 ImageClef 2023 MediQA-Sum 任务中对医生 диалогу摘要的提交系统 PULSAR。
  • methods: 该方案基于域pecific预训练,生成专门的自然语言模型,并在任务特定的自然数据上进行训练,同时使用黑盒 LLM 生成的 sintetic 数据进行数据增强。
  • results: 我们发现域pecific预训练和数据增强的证据有限,但是将语言模型扩大scale 实现了最好的性能提升。我们的方法在任务B中排名第二和第三,其中 code 可以在 https://github.com/yuping-wu/PULSAR 上获取。
    Abstract This paper describes PULSAR, our system submission at the ImageClef 2023 MediQA-Sum task on summarising patient-doctor dialogues into clinical records. The proposed framework relies on domain-specific pre-training, to produce a specialised language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. We find limited evidence towards the efficacy of domain-specific pre-training and data augmentation, while scaling up the language model yields the best performance gains. Our approach was ranked second and third among 13 submissions on task B of the challenge. Our code is available at https://github.com/yuping-wu/PULSAR.
    摘要 Translated into Simplified Chinese:这篇论文描述了我们在ImageClef 2023 MediQA-Sum任务中提交的PULSAR系统,该系统使用域pecific预训练,生成特殊化语言模型,并在任务特定的自然数据上进行训练,同时使用由黑盒LLM生成的 sintetic数据进行数据增强。我们发现域pecific预训练和数据增强具有有限的效果,而scale up语言模型则能够实现最佳性能提升。我们的方法在任务B中 ranking第二和第三,排名13个提交。我们的代码可以在https://github.com/yuping-wu/PULSAR中找到。

Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification

  • paper_url: http://arxiv.org/abs/2307.01972
  • repo_url: https://github.com/raspberryice/inc-schema
  • paper_authors: Sha Li, Ruining Zhao, Manling Li, Heng Ji, Chris Callison-Burch, Jiawei Han
  • for: 本研究旨在提取事件知识Graph结构,并从大语言模型(LLM)中提取事件schema。
  • methods: 我们提出了一种新的方法,即将事件schema视为常识知识,并使用增量提示和验证方法来构建事件Graph。
  • results: 我们的方法可以生成大型和复杂的事件Graph,并且与直接使用LLM生成线性Graph相比,可以提高7.2%的时间关系和31.0%的层次关系。此外,我们的方法也可以让人类评估者在翻译事件Graph时覆盖了$\sim$10%更多的事件,并评估我们的schema得分高于前一个关闭领域的模型1.3分(在5分满分标准下)。
    Abstract Event schemas are a form of world knowledge about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents, and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of commonsense knowledge that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, our method can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover $\sim$10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.
    摘要 Event schemas are a form of common sense about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of common sense that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, our method can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover approximately 10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.

Transformed Protoform Reconstruction

  • paper_url: http://arxiv.org/abs/2307.01896
  • repo_url: https://github.com/cmu-llab/acl-2023
  • paper_authors: Young Min Kim, Kalvin Chang, Chenxuan Cui, David Mortensen
  • for: 这个论文是为了重构拉丁语族 protoform(祖语形态)而写的。
  • methods: 这个论文使用了 RNN 基本循环网络和注意力机制来实现 protoform 重构。
  • results: 该模型在两个不同的数据集上(拉丁语族数据集和中文数据集)都达到了新的高水平,并且在多个不同的指标上比前一代模型(Meloni et al., 2021)表现出色。
    Abstract Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at https://github.com/cmu-llab/acl-2023.
    摘要 protoform reconstruction是推理古代语言的 morpheme或词语在祖语言中的推理任务。 Meloni et al. (2021) 使用 RNN 基于 Encoder-Decoder 模型 WITH attention 模型实现了拉丁protoform reconstruction 的状态对。我们对其模型进行了更新,使用现代 seq2seq 模型:Transformer。我们的模型在两个不同的数据集上(Romance 数据集和 Hou 2004 中的 Chinese 数据集)表现出了更高的性能,并且在这两个数据集上进行了可能的phylogenetic signal的探索。我们的代码可以在https://github.com/cmu-llab/acl-2023 中找到。

ProPILE: Probing Privacy Leakage in Large Language Models

  • paper_url: http://arxiv.org/abs/2307.01881
  • repo_url: None
  • paper_authors: Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, Seong Joon Oh
  • for: 这篇论文旨在帮助数据主人(data subject)了解大语言模型(LLM)中可能泄露的个人识别资料(PII)的水平。
  • methods: 这篇论文提出了一个名为ProPILE的新工具,可以让数据主人通过自己的个人资料来评估 LLM 模型中是否泄露 PII。
  • results: 这篇论文显示了 ProPILE 可以帮助数据主人评估 OPT-1.3B 模型是否泄露 PII,并且可以运用更强大的提示来评估 LLM 服务提供者自己的 PII 泄露水平。
    Abstract The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web.
    摘要 “快速发展和广泛使用大型语言模型(LLM)已引发了重要的个人隐私泄露(PII)问题。这些模型通常是基于互联网收集的大量数据进行训练,可能包含敏感个人数据。本文介绍了一种名为ProPILE的新的探测工具,用于赋予数据主(PII的所有者)对LLM基础设施中的隐私侵犯程度进行了解和控制。ProPILE允许数据主根据自己的PII提取模型中的敏感数据,以评估PII泄露的可能性。我们在OPT-1.3B模型基于公共可用的Pile数据集上进行了应用。我们显示了如何假设的数据主可以评估其PII是否包含在Pile数据集中被泄露。此外,ProPILE还可以被LLM服务提供者使用来评估自己的PII泄露水平,并通过特定于自己模型的更强的提示来进行评估。这种工具为数据主提供了一个前所未有的控制和了解自己数据在互联网上的能力。”

Decoding the Popularity of TV Series: A Network Analysis Perspective

  • paper_url: http://arxiv.org/abs/2307.05329
  • repo_url: None
  • paper_authors: Melody Yu
  • for: 这个研究用于探究电视剧集的人物网络和IMDB评分之间的关系。
  • methods: 研究使用电视剧集的剧情中人物之间的互动关系创建人物网络,然后对每集的人物网络指标进行计算,包括节点度和图密度等。
  • results: 研究发现,电视剧集的certain network metrics和IMDB评分之间存在强相关关系。
    Abstract In this paper, we analyze the character networks extracted from three popular television series and explore the relationship between a TV show episode's character network metrics and its review from IMDB. Character networks are graphs created from the plot of a TV show that represents the interactions of characters in scenes, indicating the presence of a connection between them. We calculate various network metrics for each episode, such as node degree and graph density, and use these metrics to explore the potential relationship between network metrics and TV series reviews from IMDB. Our results show that certain network metrics of character interactions in episodes have a strong correlation with the review score of TV series. Our research aims to provide more quantitative information that can help TV producers understand how to adjust the character dynamics of future episodes to appeal to their audience. By understanding the impact of character interactions on audience engagement and enjoyment, producers can make informed decisions about the development of their shows.
    摘要 在这篇论文中,我们分析了三部电视剧中的人物网络,探究电视剧集 episoden 的人物网络指标与IMDB的评论之间的关系。人物网络是从电视剧的剧情中提取的人物之间的互动关系图,表明了各个人物之间的连接存在。我们计算了每集的不同网络指标,如节点度和图密度,并使用这些指标来探究电视剧集的评论分数与人物网络之间的可能的关系。我们的结果表明,一些集 episoden 的人物互动网络指标与电视剧的IMDB评论分数具有强相关性。我们的研究旨在为电视制作人提供更多的量化信息,帮助他们更好地理解如何通过调整人物之间的互动,来满足观众的需求。通过了解人物互动对观众参与度和满意度的影响,制作人可以做出更有知识的决策,以提高他们的电视剧的质量。