cs.CL - 2023-10-13

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

  • paper_url: http://arxiv.org/abs/2310.09424
  • repo_url: https://github.com/NVIDIA/NeMo
  • paper_authors: Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
  • for: 本研究旨在提出一种新型的语音增强语言模型(SALM),具有多任务和contextual学习能力。
  • methods: SALM包括冻结文本LLM、音频编码器、模态适应模块以及LoRA层,以处理语音输入和相关任务指令。
  • results: 研究表明,SALM不仅可以与任务特定的Conformer基线相比的性能,同时还具有零扩展域学习能力,通过关键词提升任务的ASR和AST。此外,对于LLM训练和下游语音任务之间的差距,提出了speech supervised in-context training方法,进一步提高了语音识别模型的contextual学习能力。
    Abstract We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
    摘要 我们介绍了一种新的语音增强语言模型(SALM),具有多任务和 Contextual Learning 能力。 SALM 包含一个冻结文本 LLM,一个音频编码器,一个模态适应模块,以及LoRA层来处理语音输入和相关任务指令。这个一体的 SALM 不仅实现与任务特定 Conformer 基elines 相当的性能,还能够采用零shot Contextual Learning 能力,通过关键词增强任务来证明。此外,我们还提出了基于语音超级vised Contextual Training 的方法,以填补 LLM 训练和下游语音任务之间的差距,这进一步提高了语音到文本模型的Contextual Learning 能力。我们将该模型通过 NeMo 工具包开源。

A Computational Approach to Style in American Poetry

  • paper_url: http://arxiv.org/abs/2310.09357
  • repo_url: None
  • paper_authors: David M. Kaplan, David M. Blei
  • for: 这个论文是为了开发一种量化方法来评估美国诗歌的风格和Visualize a Collection of Poems。
  • methods: 这个论文使用了qualitative poetry criticism来导向开发 metrics,这些metrics分析了诗歌中的不同的字幕、 sintactic 和phonemic特征。
  • results: 这个方法可以从诗歌中提取了全面的风格信息,并计算出诗歌之间的距离。Visualizations提供了Ready access to analytical components。在 tested on several collections of poetry 中,这个方法可以更好地定义诗歌的风格,并且有可能应用于文学研究、个人对诗歌的感受研究以及为基于用户喜爱诗歌的推荐。
    Abstract We develop a quantitative method to assess the style of American poems and to visualize a collection of poems in relation to one another. Qualitative poetry criticism helped guide our development of metrics that analyze various orthographic, syntactic, and phonemic features. These features are used to discover comprehensive stylistic information from a poem's multi-layered latent structure, and to compute distances between poems in this space. Visualizations provide ready access to the analytical components. We demonstrate our method on several collections of poetry, showing that it better delineates poetry style than the traditional word-occurrence features that are used in typical text analysis algorithms. Our method has potential applications to academic research of texts, to research of the intuitive personal response to poetry, and to making recommendations to readers based on their favorite poems.
    摘要 我们开发了一种量化方法,用于评估美国诗歌的风格和对诗歌集的视觉化。 qualitative poetry criticism 帮助我们开发了一些测量不同语言、 sintactic 和 phonemic 特征的维度,以探索诗歌的多层次潜在结构,并计算诗歌之间的距离。 我们在几个诗歌集上应用了这种方法,并证明它可以更好地分类诗歌风格,比传统的单词出现频率特征更加精准。我们的方法有可能应用于文本研究、个人对诗歌的直觉反应的研究以及根据读者喜欢的诗歌进行推荐。

User Inference Attacks on Large Language Models

  • paper_url: http://arxiv.org/abs/2310.09266
  • repo_url: None
  • paper_authors: Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter Kairouz, Christopher A. Choquette-Choo, Zheng Xu
  • for: 本研究探讨了大语言模型(LLM)的精细调整过程中的隐私问题。
  • methods: 作者提出了一种威胁模型,称为用户推理(user inference),其中攻击者通过对用户数据进行推理来推断用户数据是否被使用于精细调整。作者实现了这种威胁模型下的攻击,需要只有一小部分的用户样本和黑盒访问精细调整后的LLM。
  • results: 研究发现,LLM在不同的精细调整数据集上都具有攻击 succeess rate,有时成功率接近100%。此外,研究发现特定用户(例如异常用户,即其数据分布与其他用户差异较大)和贡献大量数据的用户容易受到攻击。 finally, 作者考虑了一些防范隐私攻击的办法,发现在训练算法中进行批处理或每个例子的梯度剪切和早停等方法无法防止用户推理攻击,但是限制单个用户提供的精细调整样本数量可以降低攻击效果,尽管会减少总的精细调整数据量。
    Abstract Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specialized tasks and applications. In this paper, we study the privacy implications of fine-tuning LLMs on user data. To this end, we define a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. We implement attacks for this threat model that require only a small set of samples from a user (possibly different from the samples used for training) and black-box access to the fine-tuned LLM. We find that LLMs are susceptible to user inference attacks across a variety of fine-tuning datasets, at times with near perfect attack success rates. Further, we investigate which properties make users vulnerable to user inference, finding that outlier users (i.e. those with data distributions sufficiently different from other users) and users who contribute large quantities of data are most susceptible to attack. Finally, we explore several heuristics for mitigating privacy attacks. We find that interventions in the training algorithm, such as batch or per-example gradient clipping and early stopping fail to prevent user inference. However, limiting the number of fine-tuning samples from a single user can reduce attack effectiveness, albeit at the cost of reducing the total amount of fine-tuning data.
    摘要 大型语言模型(LLM)的精致化是一种常见且有效的方法,用于适应特定任务和应用。在这篇研究中,我们研究了精致化LLM的隐私问题。为此,我们定义了一个实际威胁模型,called user inference,其中攻击者可以推断用户的数据是否用于精致化。我们实现了这个威胁模型的攻击,只需要一小批的用户数据(可能与训练数据不同)和黑盒式存取精致化LLM。我们发现,精致化LLM在不同的训练数据集上都受到攻击者的攻击,有时成功率接近100%。我们进一步研究了哪些特性使用户容易受到攻击,发现个别用户(即与其他用户的数据分布不同)和贡献大量数据的用户最容易受到攻击。最后,我们探索了一些防护隐私措施,发现在训练算法中的干预,如批次或每个例子的梯度调整和早期停止,无法防止用户推断。但是,限制单一用户精致化数据的来源可以降低攻击效果,尽管这会导致精致化数据减少。

PromptRE: Weakly-Supervised Document-Level Relation Extraction via Prompting-Based Data Programming

  • paper_url: http://arxiv.org/abs/2310.09265
  • repo_url: None
  • paper_authors: Chufan Gao, Xulin Fan, Jimeng Sun, Xuan Wang
  • for: 文章的目的是提出一种新的弱监督文档关系提取方法,以解决 tradicional的人工标注方法存在的时间和劳动成本问题。
  • methods: 该方法使用了提示技术和数据编程技术,同时利用标签分布和实体类型作为先验知识来提高性能。
  • results: 实验结果表明,PromptRE方法在ReDocRED测试集上比基eline方法有更高的表现,能够有效地处理”没有关系”问题。
    Abstract Relation extraction aims to classify the relationships between two entities into pre-defined categories. While previous research has mainly focused on sentence-level relation extraction, recent studies have expanded the scope to document-level relation extraction. Traditional relation extraction methods heavily rely on human-annotated training data, which is time-consuming and labor-intensive. To mitigate the need for manual annotation, recent weakly-supervised approaches have been developed for sentence-level relation extraction while limited work has been done on document-level relation extraction. Weakly-supervised document-level relation extraction faces significant challenges due to an imbalanced number "no relation" instances and the failure of directly probing pretrained large language models for document relation extraction. To address these challenges, we propose PromptRE, a novel weakly-supervised document-level relation extraction method that combines prompting-based techniques with data programming. Furthermore, PromptRE incorporates the label distribution and entity types as prior knowledge to improve the performance. By leveraging the strengths of both prompting and data programming, PromptRE achieves improved performance in relation classification and effectively handles the "no relation" problem. Experimental results on ReDocRED, a benchmark dataset for document-level relation extraction, demonstrate the superiority of PromptRE over baseline approaches.
    摘要 relation extraction的目标是将两个实体之间的关系分类为预定义的类别。而前期研究主要集中在句子水平的关系抽取,而最近的研究则扩展到文档水平的关系抽取。传统的关系抽取方法几乎完全依赖于人工标注训练数据,这是时间消耗和劳动密集的。为了减轻人工标注的需求,最近的弱级支持方法在句子水平的关系抽取中得到了应用。然而,弱级支持的文档水平关系抽取受到了“无关”实例的强烈抗衡和直接使用预训练大语言模型进行文档关系抽取的失败。为解决这些挑战,我们提出了PromptRE,一种新的弱级支持的文档水平关系抽取方法,该方法将招徕技术和数据编程相结合。此外,PromptRE还利用标签分布和实体类型作为先验知识来提高性能。通过利用招徕和数据编程的优势,PromptRE实现了对关系分类的改进表现,并有效地处理“无关”问题。实验结果表明,PromptRE在ReDocRED测试集上表现出色,比基eline方法更高。

Political claim identification and categorization in a multilingual setting: First experiments

  • paper_url: http://arxiv.org/abs/2310.09256
  • repo_url: None
  • paper_authors: Urs Zaberer, Sebastian Padó, Gabriella Lapesa
  • for: 这篇论文旨在探讨跨语言政治宣言分析的方法。
  • methods: 这篇论文使用了机器翻译和多语言嵌入来进行跨语言政治宣言分析。
  • results: 在德国DebateNet2.0 dataset上,这些方法在政策辩论中的难民危机问题上进行了实验,并取得了良好的成绩。
    Abstract The identification and classification of political claims is an important step in the analysis of political newspaper reports; however, resources for this task are few and far between. This paper explores different strategies for the cross-lingual projection of political claims analysis. We conduct experiments on a German dataset, DebateNet2.0, covering the policy debate sparked by the 2015 refugee crisis. Our evaluation involves two tasks (claim identification and categorization), three languages (German, English, and French) and two methods (machine translation -- the best method in our experiments -- and multilingual embeddings).
    摘要 政治声明的识别和分类是政治报道分析中的重要步骤,但资源却稀缺。这篇论文探讨了不同的横跨语言政治声明分析投影策略。我们在德国 dataset DebateNet2.0 上进行实验,该 dataset 覆盖2015年难民危机引发的政策辩论。我们的评估包括两个任务(声明识别和分类)、三种语言(德语、英语、法语)和两种方法(机器翻译——我们实验中最佳方法——和多语言嵌入)。

Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy

  • paper_url: http://arxiv.org/abs/2310.09247
  • repo_url: https://github.com/yandex-research/text-to-img-hypernymy
  • paper_authors: Anton Baryshnikov, Max Ryabinin
  • for: 这项研究的目的是对 популяр的文本到图像模型进行语言理解能力的测试和评估。
  • methods: 该研究使用了WordNetsemantic hierarchy和现有的图像分类器pretrained on ImageNet来设计了两种自动度量器,以便对文本到图像模型的语言能力进行广泛的量化比较,并找到细腻的质量差异,如模型中不熟悉的词汇。
  • results: 研究对 популяр的文本到图像模型进行了广泛的评估,包括GLIDE、Latent Diffusion和Stable Diffusion等模型,并显示了这些度量器可以为我们提供更好的理解这些模型的个体优劣点。
    Abstract Text-to-image synthesis has recently attracted widespread attention due to rapidly improving quality and numerous practical applications. However, the language understanding capabilities of text-to-image models are still poorly understood, which makes it difficult to reason about prompt formulations that a given model would understand well. In this work, we measure the capability of popular text-to-image models to understand $\textit{hypernymy}$, or the "is-a" relation between words. We design two automatic metrics based on the WordNet semantic hierarchy and existing image classifiers pretrained on ImageNet. These metrics both enable broad quantitative comparison of linguistic capabilities for text-to-image models and offer a way of finding fine-grained qualitative differences, such as words that are unknown to models and thus are difficult for them to draw. We comprehensively evaluate popular text-to-image models, including GLIDE, Latent Diffusion, and Stable Diffusion, showing how our metrics can provide a better understanding of the individual strengths and weaknesses of these models.
    摘要

  • paper_url: http://arxiv.org/abs/2310.09241
  • repo_url: https://github.com/wuyiquan/PLJP
  • paper_authors: Yiquan Wu, Siying Zhou, Yifei Liu, Weiming Lu, Xiaozhong Liu, Yating Zhang, Changlong Sun, Fei Wu, Kun Kuang
  • for: 预测法律案件判决(Legal Judgment Prediction,LJP)在法律人工智能领域变得越来越重要,即根据案件事实描述预测案件判决。
  • methods: 我们提出了一种基于前例的LJP框架(PLJP),利用大语言模型(LLM)和域pecific模型的优势,在前例上进行预测。域pecific模型可以快速提供候选标签和有效找到相关前例,而LLM则可以在上下文中理解和生成复杂的自然语言。
  • results: 我们在实际数据集上进行了实验,并证明了我们的PLJP方法的有效性。此外,我们的工作还采用了LLM和域模型的合作方式,可以推广到其他垂直领域。
    Abstract Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI, i.e., predicting the judgment of the case in terms of case fact description. Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems. Thus, it is worthwhile to explore the utilization of precedents in the LJP. Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task. These can be broken down into two categories: large language models (LLMs) and domain-specific models. LLMs are capable of interpreting and generating complex natural language, while domain models are efficient in learning task-specific information. In this paper, we propose the precedent-enhanced LJP framework (PLJP), a system that leverages the strength of both LLM and domain models in the context of precedents. Specifically, the domain models are designed to provide candidate labels and find the proper precedents efficiently, and the large models will make the final prediction with an in-context precedents comprehension. Experiments on the real-world dataset demonstrate the effectiveness of our PLJP. Moreover, our work shows a promising direction for LLM and domain-model collaboration that can be generalized to other vertical domains.
    摘要 法律判断预测(LJP)在法律人工智能中变得越来越重要,即根据案件事实描述预测案件的判断。前例是国家法律系统中的前一次案件,它们成为后续案件的判断基础。因此,探索利用前例的使用在LJP中是有价值的。现代深度学习技术的进步使得可以使用多种解决LJP任务的技术。这些技术可以分为两类:大自然语言模型(LLM)和域特定模型。LLM可以解释和生成复杂的自然语言,而域特定模型可以高效地学习任务特定的信息。在这篇论文中,我们提出了前例增强的LJP框架(PLJP),一个利用LLM和域模型的优点来解决LJP任务的系统。具体来说,域模型用于提供候选标签和快速找到相关前例,而LLM则使用在前例上进行最终预测。实验表明我们的PLJP在实际数据集上具有效果。此外,我们的工作还释明了LLM和域模型之间的合作方向,这种方向可以普遍应用于其他垂直领域。

BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media Posts

  • paper_url: http://arxiv.org/abs/2310.09238
  • repo_url: https://github.com/Saumajit/BanglaNLP/tree/main/Task_2
  • paper_authors: Saumajit Saha, Albert Nanda
  • for: 本研究主要针对 Bangla 社交媒体帖子中的 sentiment analysis 问题,即在 low-resource 语言enario 中使用 Transformer 架构进行模型学习和评价。
  • methods: 本研究采用了多种 Transformer 架构进行实验,包括 Twitter 数据集上已经 finetuned 的模型,以及不同的 hyperparameter 和搅拌策略。
  • results: 研究发现,通过 transfer learning 可以在 low-resource 语言enario 中更好地学习模型,并且 finetuned 模型在 test 集上 obtaint 微 F1 分数为 67.02%,在共同任务中排名第 21。此外,研究还进行了详细的错误分析,发现一些批处标注需要重新审查。
    Abstract Bangla is the 7th most widely spoken language globally, with a staggering 234 million native speakers primarily hailing from India and Bangladesh. This morphologically rich language boasts a rich literary tradition, encompassing diverse dialects and language-specific challenges. Despite its linguistic richness and history, Bangla remains categorized as a low-resource language within the natural language processing (NLP) and speech community. This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop. We experiment with various Transformer-based architectures to solve this task. Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario. This becomes evident when we further finetune a model which has already been finetuned on twitter data for sentiment analysis task and that finetuned model performs the best among all other models. We also perform a detailed error analysis where we find some instances where ground truth labels need to be relooked at. We obtain a micro-F1 of 67.02\% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
    摘要 孟加拉语是全球第七最流行的语言,拥有234万名native speaker,主要来自印度和孟加拉。这种语言拥有丰富的 morphology,包括多种方言和语言特有的挑战。尽管孟加拉语的语言富裕和历史,但它在自然语言处理(NLP)和speech社区内仍被视为低资源语言。这篇文章介绍我们对Task 2(孟加拉社交媒体文章情感分析)的参与。我们试用了不同的Transformer架构来解决这个任务。我们的量化结果表明,在低资源语言情况下,传输学习确实有助于模型更好地学习。这成为可见的,当我们再finetune一个已经在推特数据上进行情感分析任务的模型时,该模型在所有其他模型中表现最佳。我们还进行了详细的错误分析,发现一些实例,需要重新审查真实的标注。我们在测试集上 obtiain micro-F1的67.02%,在共享任务中排名第21名。

AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems

  • paper_url: http://arxiv.org/abs/2310.09233
  • repo_url: None
  • paper_authors: Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, Ji-Rong Wen
    for:这个论文的目的是为了模拟用户行为,尤其是在推荐系统中的用户-项目互动。methods:这个论文使用了代理人机制,将用户和项目都视为代理人,并通过协同学习方法来优化这两个类型的代理人。results:这个论文的结果表明,使用这种方法可以模拟用户的个性化行为,并且可以预测用户将在未来的互动中展现出的行为。
    Abstract Recently, there has been an emergence of employing LLM-powered agents as believable human proxies, based on their remarkable decision-making capability. However, existing studies mainly focus on simulating human dialogue. Human non-verbal behaviors, such as item clicking in recommender systems, although implicitly exhibiting user preferences and could enhance the modeling of users, have not been deeply explored. The main reasons lie in the gap between language modeling and behavior modeling, as well as the incomprehension of LLMs about user-item relations. To address this issue, we propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering. We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimizes both kinds of agents together. Specifically, at each time step, we first prompt the user and item agents to interact autonomously. Then, based on the disparities between the agents' decisions and real-world interaction records, user and item agents are prompted to reflect on and adjust the misleading simulations collaboratively, thereby modeling their two-sided relations. The optimized agents can also propagate their preferences to other agents in subsequent interactions, implicitly capturing the collaborative filtering idea. Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions. The results show that these agents can demonstrate personalized behaviors akin to those of real-world individuals, sparking the development of next-generation user behavior simulation.
    摘要 现在,有一种趋势是利用基于LLM的代理人作为可信的人类代理人,基于它们的决策能力的很好。然而,现有的研究主要集中在模拟人类对话。用户非语言行为,如推荐系统中的物品点击,虽然做出了用户喜好的含义,但尚未得到深入研究。这主要的原因在于语言模型和行为模型之间的差距,以及LLM对用户-项目关系的无知。为解决这个问题,我们提出了 AgentCF,一种通过代理人合作 filtering 来模拟用户-项目交互的方法。我们创新地将用户和项目都视为代理人,并开发了一种合作学习方法,以同时优化这两种代理人。具体来说,在每次时间步骤时,我们先让用户和项目代理人自主互动。然后,根据代理人决策和真实交互记录之间的差异,用户和项目代理人被让reflect和调整模拟的不符合行为,以模型他们的两面关系。最优化的代理人还可以在后续交互中传递它们的偏好, implicit capture 合 filtering 的想法。总的来说,我们的框架中的优化代理人展现出了多样化的交互行为,包括用户-项目、用户-用户、项目-项目和集体交互。结果显示,这些代理人可以展现出与实际世界个体类似的个性化行为,鼓励下一代用户行为模拟的发展。

Automated Claim Matching with Large Language Models: Empowering Fact-Checkers in the Fight Against Misinformation

  • paper_url: http://arxiv.org/abs/2310.09223
  • repo_url: None
  • paper_authors: Eun Cheol Choi, Emilio Ferrara
  • for: 增强 Fact-checking Automation (增强 Fact-checking 自动化)
  • methods: 使用 Large Language Models (LLMs) 生成 simulated social media posts 并 fine-tune 特殊化的 LLMs for claim matching tasks (使用 LLMs 生成 simulated social media posts,并对 claims matching tasks 进行 fine-tuning)
  • results: Fine-tuned LLMs rival the performance of larger pre-trained LLMs in claim matching tasks, aligning closely with human annotations (特殊化的 LLMs 与更大的预训练 LLMs 的表现相似,与人类注释Alignment)
    Abstract In today's digital era, the rapid spread of misinformation poses threats to public well-being and societal trust. As online misinformation proliferates, manual verification by fact checkers becomes increasingly challenging. We introduce FACT-GPT (Fact-checking Augmentation with Claim matching Task-oriented Generative Pre-trained Transformer), a framework designed to automate the claim matching phase of fact-checking using Large Language Models (LLMs). This framework identifies new social media content that either supports or contradicts claims previously debunked by fact-checkers. Our approach employs GPT-4 to generate a labeled dataset consisting of simulated social media posts. This data set serves as a training ground for fine-tuning more specialized LLMs. We evaluated FACT-GPT on an extensive dataset of social media content related to public health. The results indicate that our fine-tuned LLMs rival the performance of larger pre-trained LLMs in claim matching tasks, aligning closely with human annotations. This study achieves three key milestones: it provides an automated framework for enhanced fact-checking; demonstrates the potential of LLMs to complement human expertise; offers public resources, including datasets and models, to further research and applications in the fact-checking domain.
    摘要 今天的数字时代,迅速传播的谣言威胁公众健康和社会信任。随着谣言在线传播,手动验证的困难也在增加。我们介绍FACT-GPT(真实核查增强with Claim Matching Task-oriented Generative Pre-trained Transformer)框架,用于自动化CLAIM Matching阶段的真实核查。这个框架可以识别新的社交媒体内容, Either Supports or Contradicts previously debunked by fact-checkers。我们的方法使用GPT-4生成一个标注数据集,用于训练特殊化的LLMs。我们对一个大量社交媒体内容 related to public health进行评估,结果表明,我们的精心特殊化LLMs可以与更大的预训练LLMs在CLAIM Matching任务中 rival,与人工笔记相似。本研究实现了三个关键突破口:提供了自动化的增强真实核查框架;证明LLMs可以补充人类专家知识;提供了公共资源,包括数据集和模型,以便进一步的研究和应用在真实核查领域。

Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration

  • paper_url: http://arxiv.org/abs/2310.09168
  • repo_url: https://github.com/fanqiwan/explore-instruct
  • paper_authors: Fanqi Wan, Xinting Huang, Tao Yang, Xiaojun Quan, Wei Bi, Shuming Shi
  • for: 提高适用范围和任务覆盖率的模型调教数据准备
  • methods: 使用大自然语言模型进行活动探索,实现域pecific instrucion-tuning数据的多样性和域dialect化
  • results: 对多个基线进行比较,实现了域pecific instruction coverage的明显提高,并且模型性能得到了显著改进
    Abstract Instruction-tuning can be substantially optimized through enhanced diversity, resulting in models capable of handling a broader spectrum of tasks. However, existing data employed for such tuning often exhibit an inadequate coverage of individual domains, limiting the scope for nuanced comprehension and interactions within these areas. To address this deficiency, we propose Explore-Instruct, a novel approach to enhance the data coverage to be used in domain-specific instruction-tuning through active exploration via Large Language Models (LLMs). Built upon representative domain use cases, Explore-Instruct explores a multitude of variations or possibilities by implementing a search algorithm to obtain diversified and domain-focused instruction-tuning data. Our data-centric analysis validates the effectiveness of this proposed approach in improving domain-specific instruction coverage. Moreover, our model's performance demonstrates considerable advancements over multiple baselines, including those utilizing domain-specific data enhancement. Our findings offer a promising opportunity to improve instruction coverage, especially in domain-specific contexts, thereby advancing the development of adaptable language models. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/Explore-Instruct}.
    摘要 具有增强多样性的指导调整可以具有更好的优化效果,使模型能够涵盖更广泛的任务范围。然而,现有的用于这种调整的数据经常表现出不够的域名覆盖率,这限制了模型在这些领域内的细化理解和互动的范围。为了解决这一问题,我们提出了Explore-Instruct方法,它通过使用大型自然语言模型(LLM)进行活动探索,以获取具有多样性和域名焦点的指导调整数据。我们基于域名使用情况建立了代表性的域名案例,然后通过搜索算法来探索多种可能性和域名专注的指导调整数据。我们的数据分析表明,我们的提议方法可以提高域名特定的指导覆盖率。此外,我们的模型性能也超过了多个基线,包括使用域名特定数据增强的基线。我们的发现对于提高指导覆盖率,特别是在域名特定上,是一个有前途的发展。我们的代码、模型 веса和数据可以在 \url{https://github.com/fanqiwan/Explore-Instruct} 中找到。

Developing a Natural Language Understanding Model to Characterize Cable News Bias

  • paper_url: http://arxiv.org/abs/2310.09166
  • repo_url: None
  • paper_authors: Seth P. Benson, Iain J. Cruickshank
  • for: 本研究旨在开发一种无需人工标注的媒体偏见检测方法,以便对有线电视新闻节目进行客观评估。
  • methods: 本方法基于名实Recognition和态度分析,对有线电视新闻节目的话题和讨论方式进行分析,并通过聚类分析将相似偏见的节目集成起来。
  • results: 应用本方法于2020年有线电视新闻脚本,发现节目团集在时间上保持一致,roughly对应有线电视新闻网络。本方法显示了未来可能开发出客观评估媒体偏见的工具,并可以对未知媒体环境进行描述。
    Abstract Media bias has been extensively studied by both social and computational sciences. However, current work still has a large reliance on human input and subjective assessment to label biases. This is especially true for cable news research. To address these issues, we develop an unsupervised machine learning method to characterize the bias of cable news programs without any human input. This method relies on the analysis of what topics are mentioned through Named Entity Recognition and how those topics are discussed through Stance Analysis in order to cluster programs with similar biases together. Applying our method to 2020 cable news transcripts, we find that program clusters are consistent over time and roughly correspond to the cable news network of the program. This method reveals the potential for future tools to objectively assess media bias and characterize unfamiliar media environments.
    摘要 媒体偏见已经由社会科学和计算机科学广泛研究。然而,现有工作仍然具有大量的人工输入和主观评估来标识偏见。这尤其是在有线电视新闻研究中。为解决这些问题,我们开发了一种无监督机器学习方法,用于无人工输入地识别有线电视节目的偏见。这种方法基于命名实体识别和立场分析来 clustering 节目的偏见。在应用于2020年有线电视脚本时,我们发现program集群在时间上具有一定的稳定性,并roughly对应于电视新闻网络。这种方法揭示了未来工具的可能性,用于 объектив地评估媒体偏见并描述未知的媒体环境。

BibRank: Automatic Keyphrase Extraction Platform Using~Metadata

  • paper_url: http://arxiv.org/abs/2310.09151
  • repo_url: https://github.com/dallal9/bibrank
  • paper_authors: Abdelrhman Eldallal, Eduard Barbu
  • for: 本文是为了提供一个 integrate keyphrase 数据集和评估关键短语提取算法的平台。
  • methods: 本文使用了 BibRank 自动关键短语提取算法,该算法利用 BibTeX 格式的 bibliographic 数据获得了丰富的数据集,并 combining 创新的权重技术、位置信息、统计信息和单词相似度信息来提取关键短语。
  • results: 本平台可以为研究人员和开发人员提供一个便捷的平台来提高关键短语提取算法和自然语言处理领域的进步。
    Abstract Automatic Keyphrase Extraction involves identifying essential phrases in a document. These keyphrases are crucial in various tasks such as document classification, clustering, recommendation, indexing, searching, summarization, and text simplification. This paper introduces a platform that integrates keyphrase datasets and facilitates the evaluation of keyphrase extraction algorithms. The platform includes BibRank, an automatic keyphrase extraction algorithm that leverages a rich dataset obtained by parsing bibliographic data in BibTeX format. BibRank combines innovative weighting techniques with positional, statistical, and word co-occurrence information to extract keyphrases from documents. The platform proves valuable for researchers and developers seeking to enhance their keyphrase extraction algorithms and advance the field of natural language processing.
    摘要 自动KEYPHRASE提取关键词phraseextraction的核心是从文档中提取重要的短语。这些关键词phrase是各种任务,如文档分类、聚类、推荐、索引、搜索、摘要和文本简化中的关键。这篇文章介绍了一个集成关键词phrase数据集和评估关键词提取算法的平台。该平台包括BibRank自动关键词提取算法,该算法利用BibTeX格式文献数据中的丰富数据来提取关键词phrase。BibRank combining创新权重技术、位置、统计和词语相似性信息来从文档中提取关键词phrase。该平台对研究人员和开发人员来进行关键词提取算法的优化和自然语言处理领域的发展具有价值。

PuoBERTa: Training and evaluation of a curated language model for Setswana

  • paper_url: http://arxiv.org/abs/2310.09141
  • repo_url: https://github.com/dsfsi/puodata
  • paper_authors: Vukosi Marivate, Moseli Mots’Oehli, Valencia Wagner, Richard Lastrucci, Isheanesu Dzingirai
  • for: 本研究旨在提高LOW-RESOURCE语言如setswana的自然语言处理(NLP)能力。
  • methods: 该研究使用自定义的masked language model PuoBERTa进行训练,并利用多样化的单语言文本生成高质量训练集。
  • results: 研究表明PuoBERTa在PART-OF-SPEECH标注、命名实体识别和新闻分类等NLP任务中表现出色,并提供了一个新的setswana新闻分类数据集的初步评测结果。
    Abstract Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa's training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.
    摘要 自然语言处理(NLP)在英语等资源丰富语言方面做出了重要进展,但对低资源语言如setswana来说,却存在落后的问题。这篇论文旨在填补这个差距,通过提出puoberta,一种特定于setswana的掩码语言模型的训练。我们详细介绍了如何收集、 curación和准备了多样化的单语言文本,以生成高质量的训练集 дляpuoberta。在以前的尝试中,我们为setswana语言创造了单语言资源,并评估了puoberta在多个NLP任务上,包括分词(POS)标注、命名实体识别(NER)和新闻分类。此外,我们还提供了一个新的setswana新闻分类数据集,并在puoberta上提供了初步的benchmark。我们的工作表明puoberta在推动低资源语言like setswana的NLP能力具有潜力,并为未来的研究提供了道路。

A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese Spelling Check

  • paper_url: http://arxiv.org/abs/2310.09119
  • repo_url: None
  • paper_authors: Haojing Huang, Jingheng Ye, Qingyu Zhou, Yinghui Li, Yangning Li, Feng Zhou, Hai-Tao Zheng
  • for: 提高中文拼写检查(CSC)的性能,通过更直接和有效地利用中文语言的外部知识。
  • methods: 将CSC工作流程分解为检测、理解和搜索子任务,并设计了兼容现有SOTA非自适应CSC模型的检测和理解模块。
  • results: 提出了一个可插入式的检测和理解模块,可以为现有模型提高性能,并发现这种模块在不同模型上也可以提供主要的解释性。经过广泛的实验和详细分析,证明了该模块的效果和竞争力。
    Abstract In recent years, Chinese Spelling Check (CSC) has been greatly improved by designing task-specific pre-training methods or introducing auxiliary tasks, which mostly solve this task in an end-to-end fashion. In this paper, we propose to decompose the CSC workflow into detection, reasoning, and searching subtasks so that the rich external knowledge about the Chinese language can be leveraged more directly and efficiently. Specifically, we design a plug-and-play detection-and-reasoning module that is compatible with existing SOTA non-autoregressive CSC models to further boost their performance. We find that the detection-and-reasoning module trained for one model can also benefit other models. We also study the primary interpretability provided by the task decomposition. Extensive experiments and detailed analyses demonstrate the effectiveness and competitiveness of the proposed module.
    摘要 Note:* "CSC" stands for "Chinese Spelling Check"* "SOTA" stands for "State-of-the-Art"* "non-autoregressive" means that the model does not use feedback connections to previous time steps.

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

  • paper_url: http://arxiv.org/abs/2310.09089
  • repo_url: https://github.com/williamliujl/Qilin-Med
  • paper_authors: Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, Andrew Liu
  • for: 这篇论文的目的是探讨如何使用大型自然语言模型(LLM)在医疗领域中提高表现。
  • methods: 该论文使用了多 Stage 训练方法,包括域специфи的继续预训练(DCPT)、监督精度优化(SFT)和直接偏好优化(DPO)。
  • results: 通过使用这种训练策略,研究人员减少了LLM的资源消耗,同时提高了医疗领域的表现。在 CPT 和 SFT 阶段,LLM 的准确率达到了 38.4% 和 40.0%,比 Baichuan-7B 的 33.5% 高。在 DPO 阶段,LLM 在 Huatuo-26M 测试集上的 BLEU-1 和 ROUGE1 分别达到了 16.66 和 27.44,比 SFT 的 12.69 和 24.21 高。这显示了该训练方法在医疗领域中提高 LLM 表现的力量。
    Abstract Integrating large language models (LLMs) into healthcare presents potential but faces challenges. Directly pre-training LLMs for domains like medicine is resource-heavy and sometimes unfeasible. Sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domain specific insights. Addressing these challenges, we present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), SFT, and Direct Preference Optimization (DPO). A notable contribution of our study is the introduction of a 3Gb Chinese Medicine (ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs, and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, exhibits significant performance boosts. In the CPT and SFT phases, it achieves 38.4% and 40.0% accuracy on the CMExam, surpassing Baichuan-7B's 33.5%. In the DPO phase, on the Huatuo-26M test set, it scores 16.66 in BLEU-1 and 27.44 in ROUGE1, outperforming the SFT's 12.69 and 24.21. This highlights the strength of our training approach in refining LLMs for medical applications.
    摘要 把大语言模型(LLM)应用于医疗领域存在潜在的潜力,但也面临着挑战。直接为医疗领域预训练LLM可能是资源占用过重,而且可能不可能实现。凭借Supervised Fine-tuning(SFT) alone不能捕捉医疗领域专业知识。为了解决这些挑战,我们提出了一种多 stage 训练方法,包括域pecific Continued Pre-training(DCPT)、SFT和Direct Preference Optimization(DPO)。我们的研究的一个重要贡献是提供了3Gb的中药学(ChiMed)数据集,包括医学问答、普通文本、知识图谱和对话,分为三个训练阶段。使用我们的训练管道,Qilin-Med,医学LLM在CPT和SFT阶段取得了38.4%和40.0%的准确率,超过了Baichuan-7B的33.5%。在DPO阶段,在Huatuo-26M测试集上,它得分16.66在BLEU-1和27.44在ROUGE1,超过了SFT的12.69和24.21。这表明我们的训练方法在医疗应用中具有强大的细化LLM的能力。

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

  • paper_url: http://arxiv.org/abs/2310.09036
  • repo_url: https://github.com/declare-lab/mm-bigbench
  • paper_authors: Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria
  • for: 本研究的目的是评估多modal大语言模型(MLLMs)的性能,尤其是在多modal内容理解任务中。
  • methods: 本研究使用了多种指标来全面评估不同模型和指令的性能,包括Best Performance指标、Mean Relative Gain指标和Stability指标。
  • results: 研究发现了20种语言模型(14种MLLMs)在14个多modal数据集上的性能,并derived了新的发现。
    Abstract The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.
    摘要 具有多模态语言模型(MLLM)的受欢迎程度已经引发了研究人员对这些模型的评估的新一轮努力。然而,现有的评估研究主要集中在视觉内容上的理解和推理,忽视了多模态内容理解的评估。除了多模态理解外,多模态内容理解任务需要深入理解多模态上下文,通过多模态交互获得最终答案。在本文中,我们提出了一个完整的评估框架 called MM-BigBench,它包括多种纪录来评估不同模型和指令的表现。因此,我们的工作补充了关于 MLLM 在多模态理解任务中的性能研究,实现了更加全面和彻底的 MLLM 评估。首先,我们使用 Best Performance 纪录来确定每个模型在不同的数据集上的性能最高 bound。然后,Mean Relative Gain 纪录用于评估不同模型和指令的总体性能,而 Stability 纪录则测量它们的敏感度。此外,前一代的研究主要集中在独立评估模型或仅仅评估指令,忽视模型和指令之间的适应性。我们提出了 Adaptability 纪录来衡量模型和指令之间的适应性。我们的研究评估了 20 种语言模型(14 MLLM)在 14 个多模态数据集上,涵盖 6 个任务,每个任务有 10 个指令,并 derivates 新的发现。我们的代码将在 GitHub 上发布。

Dont Add, dont Miss: Effective Content Preserving Generation from Pre-Selected Text Spans

  • paper_url: http://arxiv.org/abs/2310.09017
  • repo_url: https://github.com/lovodkin93/cdr_ctr
  • paper_authors: Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan
  • for: 这篇论文旨在提供一个可靠的 Controlled Text Reduction (CTR) 模型,以解决当前存在 mediocre 性能的基eline 问题。
  • methods: 该论文使用 reinforcement learning (RL) 和 controlled decoding strategy 来强化内容保留约束,并使用 GPT-4 distillation 提高银色训练数据质量。
  • results: Comparing with current baseline, 该论文的模型可以提供 marked gains 的性能,最高提高了 ROUGE-L 分数30个点,提供了一个可靠的 CTR 模型。
    Abstract The recently introduced Controlled Text Reduction (CTR) task isolates the text generation step within typical summarization-style tasks. It does so by challenging models to generate coherent text conforming to pre-selected content within the input text ("highlights"). This framing enables increased modularity in summarization-like tasks, allowing to couple a single CTR model with various content-selection setups and modules. However, there are currently no reliable CTR models, while the performance of the existing baseline for the task is mediocre, falling short of practical utility. Here, we address this gap by introducing a high-quality, open-source CTR model that tackles two prior key limitations: inadequate enforcement of the content-preservation constraint, and suboptimal silver training data. Addressing these, we amplify the content-preservation constraint in both training, via RL, and inference, via a controlled decoding strategy. Further, we substantially improve the silver training data quality via GPT-4 distillation. Overall, pairing the distilled dataset with the highlight-adherence strategies yields marked gains over the current baseline, of up to 30 ROUGE-L points, providing a reliable CTR model for downstream use.
    摘要 新引入的Controlled Text Reduction(CTR)任务将文本生成步骤与传统的概要化任务分离开来。它通过要求模型生成符合输入文本中预选内容的 coherent 文本来实现这一点。这种框架允许在概要化任务中增加模块化,使得可以将CTR模型与不同的内容选择设置和模块集成。然而,目前没有可靠的CTR模型,而现有的基eline性能不佳,落后于实际应用中的需求。在这里,我们填补这一漏洞,引入一个高质量、开源的CTR模型,解决了两个关键的前提限制:不足的内容保持约束和低质量的银色训练数据。我们在训练和推理中强制实施内容保持约束,通过RL学习和控制的解码策略来强制实施。此外,我们通过GPT-4浸泡来大幅提高银色训练数据的质量。总的来说,将浸泡数据与突出重点策略相结合,可以获得与当前基eline的ROUGE-L分数提高至30个点,提供一个可靠的CTR模型。

Towards Example-Based NMT with Multi-Levenshtein Transformers

  • paper_url: http://arxiv.org/abs/2310.08967
  • repo_url: https://github.com/maxwell1447/fairseq
  • paper_authors: Maxime Bouthors, Josep Crego, François Yvon
  • For: 提高翻译 metric 以及域适应性* Methods: 使用 retrieve-augmented 翻译模型,并允许用户查看翻译决策的示例* Results: 实验结果显示,对多个示例进行编辑可以提高翻译分数,并增加目标句子中的复制 span 数量
    Abstract Retrieval-Augmented Machine Translation (RAMT) is attracting growing attention. This is because RAMT not only improves translation metrics, but is also assumed to implement some form of domain adaptation. In this contribution, we study another salient trait of RAMT, its ability to make translation decisions more transparent by allowing users to go back to examples that contributed to these decisions. For this, we propose a novel architecture aiming to increase this transparency. This model adapts a retrieval-augmented version of the Levenshtein Transformer and makes it amenable to simultaneously edit multiple fuzzy matches found in memory. We discuss how to perform training and inference in this model, based on multi-way alignment algorithms and imitation learning. Our experiments show that editing several examples positively impacts translation scores, notably increasing the number of target spans that are copied from existing instances.
    摘要 Retrieval-Augmented Machine Translation (RAMT) 在最近吸引了越来越多的注意。这是因为 RAMT 不仅改善翻译指标,而且还被 assumes 实现了一种形式的领域适应。在这篇论文中,我们研究了 RAMT 另一个醒目的特点,即它可以让用户回到翻译决策中的示例。为了实现这一点,我们提议了一种新的架构,该架构基于改进的 Levenshtein Transformer,并使其可以同时修改内存中的多个混淆匹配。我们讨论了在这种模型中进行训练和推断的方法,包括多重对齐算法和模仿学习。我们的实验表明,编辑多个示例可以正面影响翻译分数,特别是增加目标词串中的复制数。

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

  • paper_url: http://arxiv.org/abs/2310.08958
  • repo_url: https://github.com/e0397123/xdial-eval
  • paper_authors: Chen Zhang, Luis Fernando D’Haro, Chengguang Tang, Ke Shi, Guohua Tang, Haizhou Li
  • for: 本研究旨在提出一个多语言对话评估 benchmark,以便检验英语对话评估metric的一致性和扩展性。
  • methods: 本研究使用了预训条件语言模型和商业机器翻译系统将英语对话扩展到其他九种语言。研究者还将BERT基础的metric和大型自然语言模型进行了广泛的分析。
  • results: 研究结果显示,新建立的自我超vised和多语言基准达到了优秀的成绩,在所有dataset和语言上的平均pearson相互 correlations中,最佳基准比OpenAI的ChatGPT提高了6.5%和4.6%的统计差。
    Abstract Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is largely due to the absence of a multilingual dialogue evaluation benchmark. To address the issue, we introduce xDial-Eval, built on top of open-source English dialogue evaluation datasets. xDial-Eval includes 12 turn-level and 6 dialogue-level English datasets, comprising 14930 annotated turns and 8691 annotated dialogues respectively. The English dialogue data are extended to nine other languages with commercial machine translation systems. On xDial-Eval, we conduct comprehensive analyses of previous BERT-based metrics and the recently-emerged large language models. Lastly, we establish strong self-supervised and multilingual baselines. In terms of average Pearson correlations over all datasets and languages, the best baseline outperforms OpenAI's ChatGPT by absolute improvements of 6.5% and 4.6% at the turn and dialogue levels respectively, albeit with much fewer parameters. The data and code are publicly available at https://github.com/e0397123/xDial-Eval.
    摘要 现代技术的参照无关学习度量对开放领域对话评价有所进步,主要归功于预训练语言模型和高质量人工标注的对话数据的可用性。然而,当前的研究主要集中在英文对话上,对其他语言的普适性尚未得到全面的检验。这主要是因为没有一个多语言对话评价标准 benchmark。为解决这个问题,我们介绍了xDial-Eval,基于开源的英文对话评价数据集。xDial-Eval包括12个转折级和6个对话级英文数据集,共计14930个标注的转折和8691个标注的对话。英文对话数据被扩展到九种其他语言,使用商业机器翻译系统。在xDial-Eval上,我们进行了前BERT基于度量的全面分析,以及最近出现的大语言模型。最后,我们建立了强Self-supervised和多语言基elines。相对所有数据集和语言,最佳基eline的平均对预 correlation coefficient的提升为6.5%和4.6%,尽管它具有许多 fewer 参数。数据和代码在https://github.com/e0397123/xDial-Eval 上公开 available。

  • paper_url: http://arxiv.org/abs/2310.08954
  • repo_url: https://github.com/sulcantonin/text_icalepcs23
  • paper_authors: Antonin Sulc, Annika Eichler, Tim Wilksen
  • for: 本研究通过文献分析、自然语言处理技术,对过去ICALEPCS和IPAC会议论文进行文本分析,以获得Field的研究趋势和话题。
  • methods: 本研究使用自然语言处理技术提取有意义信息,分析和可视化论文中的话题,识别研究趋势,并高亮一些基于内容的出色论文。
  • results: 本研究提供了Field的研究领域的全面概述,帮助研究者和实践者更好地了解当前 estado-of-the-art,并且为未来研究提供了方向。
    Abstract In this paper, we show a textual analysis of past ICALEPCS and IPAC conference proceedings to gain insights into the research trends and topics discussed in the field. We use natural language processing techniques to extract meaningful information from the abstracts and papers of past conference proceedings. We extract topics to visualize and identify trends, analyze their evolution to identify emerging research directions, and highlight interesting publications based solely on their content with an analysis of their network. Additionally, we will provide an advanced search tool to better search the existing papers to prevent duplication and easier reference findings. Our analysis provides a comprehensive overview of the research landscape in the field and helps researchers and practitioners to better understand the state-of-the-art and identify areas for future research.
    摘要 在这篇论文中,我们对过去的ICALEPCS和IPAC会议论文进行文本分析,以获得研究趋势和话题的洞察。我们使用自然语言处理技术来提取有用的信息从会议论文摘要和论文中。我们提取话题以可视化和识别趋势,分析其演化以识别emerging research direction,并高亮一些基于内容的出色论文。此外,我们还将提供一个高级搜索工具,以避免重复和更方便地找到相关结果。我们的分析提供了领域的全面评估和未来研究方向,帮助研究人员和实践者更好地理解领域的状态和识别未来研究领域。

CAMELL: Confidence-based Acquisition Model for Efficient Self-supervised Active Learning with Label Validation

  • paper_url: http://arxiv.org/abs/2310.08944
  • repo_url: None
  • paper_authors: Carel van Niekerk, Christian Geishauser, Michael Heck, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Benjamin Ruppik, Renato Vukovic, Milica Gašić
  • for: 这篇论文的目的是提出一个可以应对Sequential Task的 актив学习框架,以提高模型的性能。
  • methods: 这篇论文使用了一个名为CAMELL的池化型活动学习框架,具有三个核心特点:首先,它只需要专家标注少量的序列中的一部分;其次,它可以为其余的序列进行自我标注;最后,它使用了标签验证机制来防止错误的标签填充数据集和害模型性能。
  • results: 在实验中,CAMELL比基eline的性能更高,且提出的数据更正确。
    Abstract Supervised neural approaches are hindered by their dependence on large, meticulously annotated datasets, a requirement that is particularly cumbersome for sequential tasks. The quality of annotations tends to deteriorate with the transition from expert-based to crowd-sourced labelling. To address these challenges, we present \textbf{CAMELL} (Confidence-based Acquisition Model for Efficient self-supervised active Learning with Label validation), a pool-based active learning framework tailored for sequential multi-output problems. CAMELL possesses three core features: (1) it requires expert annotators to label only a fraction of a chosen sequence, (2) it facilitates self-supervision for the remainder of the sequence, and (3) it employs a label validation mechanism to prevent erroneous labels from contaminating the dataset and harming model performance. We evaluate CAMELL on sequential tasks, with a special emphasis on dialogue belief tracking, a task plagued by the constraints of limited and noisy datasets. Our experiments demonstrate that CAMELL outperforms the baselines in terms of efficiency. Furthermore, the data corrections suggested by our method contribute to an overall improvement in the quality of the resulting datasets.
    摘要 supervised neural方法受到大量、精心标注的数据的依赖,这种需求对于Sequential任务特别是困难。标注质量随着从专家标注转移到群体标注而逐渐下降。为解决这些挑战,我们提出了\textbf{CAMELL}(Confidence-based Acquisition Model for Efficient self-supervised active Learning with Label validation),一种适用于Sequential多输出问题的池化式活动学习框架。CAMELL具有以下三个核心特点:1. 仅需专家标注部分序列中的一小部分,而不是整个序列。2. 为剩余的序列自我标注。3. 使用标签验证机制,以避免错误标签污染数据集,从而危害模型性能。我们在Sequential任务上进行了实验,尤其是对话信念跟踪任务,这种任务受到数据的局限和噪声的限制。我们的实验结果表明,CAMELL在效率方面超过基eline。此外,我们的方法建议的数据修正也对数据集的质量产生了总体改善。

Multi-level Adaptive Contrastive Learning for Knowledge Internalization in Dialogue Generation

  • paper_url: http://arxiv.org/abs/2310.08943
  • repo_url: None
  • paper_authors: Chenxu Yang, Zheng Lin, Lanrui Wang, Chong Tian, Liang Pang, Jiangnan Li, Qirong Ho, Yanan Cao, Weiping Wang
  • for: 这篇论文旨在解决文本塌行问题,即模型通过吸收外部知识来增强对话生成的人类化性。
  • methods: 该论文提出了一种多级 adaptive contrastive learning(MACL)框架,该框架在模型内部动态选择负例,并对模型的塌行行为进行惩罚,以避免模型仅仅在 superficies 上匹配知识段落而无法 internalize 这些信息。
  • results: 对 WoW 数据集进行了广泛的实验,证明了我们的方法在不同的预训练模型下具有效果,能够提高对话生成的人类化性。
    Abstract Knowledge-grounded dialogue generation aims to mitigate the issue of text degeneration by incorporating external knowledge to supplement the context. However, the model often fails to internalize this information into responses in a human-like manner. Instead, it simply inserts segments of the provided knowledge into generic responses. As a result, the generated responses tend to be tedious, incoherent, and in lack of interactivity which means the degeneration problem is still unsolved. In this work, we first find that such copying-style degeneration is primarily due to the weak likelihood objective, which allows the model to "cheat" the objective by merely duplicating knowledge segments in a superficial pattern matching based on overlap. To overcome this challenge, we then propose a Multi-level Adaptive Contrastive Learning (MACL) framework that dynamically samples negative examples and subsequently penalizes degeneration behaviors at both the token-level and sequence-level. Extensive experiments on the WoW dataset demonstrate the effectiveness of our approach across various pre-trained models.
    摘要 知识基于对话生成旨在解决文本衰退问题,通过外部知识补充上下文。然而,模型往往无法将此信息Internalize到回答中,而是简单地插入提供的知识到通用回答中。这导致生成的回答往往是无聊、无 coherence 和不互动的,这意味着衰退问题仍未解决。在这种工作中,我们首先发现,这种拷贝式衰退主要是由弱化概率目标负担,这使得模型可以“偷懒”地通过 superficies 的pattern matching来满足目标。为了解决这个挑战,我们then propose a Multi-level Adaptive Contrastive Learning (MACL) 框架,该框架在运行时动态 sampling negative examples,并在字元级和序列级进行 penalty,以避免衰退行为。我们在 WoW 数据集上进行了广泛的实验,并证明了我们的方法在不同的预训练模型上的效果。

Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning

  • paper_url: http://arxiv.org/abs/2310.08923
  • repo_url: None
  • paper_authors: Hongfu Liu, Ye Wang
  • for: 本研究旨在提高大语言模型(LLM)在新下游任务上的培育环境中的稳定性。
  • methods: 本研究使用了一种新的采样策略,即通过量化选择的示例来评估其信息增强。此外,本研究还提出了一种减少模板偏见的纠正策略。
  • results: 实验结果显示,提案的方法可以在六个分类任务中提高 LLM 的平均相对改进率为 14.3%。
    Abstract Large Language models (LLMs) possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions. However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats. In this work, we demonstrate that even when all these factors are held constant, the random selection of examples still results in high variance. Consequently, we aim to explore the informative ability of data examples by quantifying the Information Gain (IG) obtained in prediction after observing a given example candidate. Then we propose to sample those with maximum IG. Additionally, we identify the presence of template bias, which can lead to unfair evaluations of IG during the sampling process. To mitigate this bias, we introduce Calibration Before Sampling strategy. The experimental results illustrate that our proposed method can yield an average relative improvement of 14.3% across six classification tasks using three LLMs.
    摘要

Human-in-the-loop Machine Translation with Large Language Model

  • paper_url: http://arxiv.org/abs/2310.08908
  • repo_url: https://github.com/nlp2ct/hil-mt
  • paper_authors: Xinyi Yang, Runzhe Zhan, Derek F. Wong, Junchao Wu, Lidia S. Chao
  • for: 这个研究旨在应用人工智能语言模型(LLM)到机器翻译任务中,并评估其性能从多个角度。
  • methods: 该研究使用了LLM的启发式学习机制和自然语言处理技术,并提出了一个人工智能干预框架,以指导LLM生成自定义输出并进行修订。
  • results: 研究表明,人工智能干预框架可以提高LLM的翻译性能,并且可以适应不同领域的翻译需求。此外,研究还发现了不同的启发式检索方法和建立低资源情况下的检索数据库的可行性。
    Abstract The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM's translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation. Additionally, we discuss the results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed domains differences; 4) the quantitative analysis of linguistic statistics; and 5) the qualitative analysis of translation cases. The code and data are available at https://github.com/NLP2CT/HIL-MT/.
    摘要 大型语言模型(LLM)已引起广泛关注,因其在语言任务中的增强功能和上下文学习机制。研究者们已经通过多个预研究来应用 LLM 到机器翻译任务中,并评估其性能从多个角度。然而,之前的研究主要集中在 LLM 本身,忽略了人类在推理过程中的干预。LLM 的特点,如上下文学习和提示工程,与人类语言任务的认知能力很相似,提供了一种直观的解决方案。在本研究中,我们提议了一个人类在循环(HIL)管道,用于指导 LLM 生成自定义输出,并通过修订指令进行修正。这个管道从提示 LLM 生成稿本开始,然后利用自动检索或人类反馈作为监督信号,通过上下文学习进行改进。人机交互生成在这个管道中也被存储在外部数据库中,以扩展上下文检索数据库,以便在线上利用人类监督。我们使用 GPT-3.5-turbo API 测试我们的管道,并在五个域限翻译 benchmark 上进行评估。结果表明,我们的管道可以适应域限翻译,提高翻译性能,比直接翻译更好。此外,我们从以下几个角度分析结果:1)不同的上下文检索方法的效果;2)在低资源下构建检索数据库的问题;3)观察到的域差异;4)语言统计量的分析;以及5)翻译案例的质量分析。代码和数据可以在 GitHub 上获取:.

SeqXGPT: Sentence-Level AI-Generated Text Detection

  • paper_url: http://arxiv.org/abs/2310.08903
  • repo_url: https://github.com/jihuai-wpy/seqxgpt
  • paper_authors: Pengyu Wang, Linyang Li, Ke Ren, Botian Jiang, Dong Zhang, Xipeng Qiu
  • for: 本研究旨在提出一种新的句子水平AI生成文本检测方法,以满足现有的文本检测方法只考虑文档水平的需求。
  • methods: 我们提出了一种基于白盒LM的log概率列表的特征,称为SeqXGPT,它使用了卷积网络和自注意网络来实现句子水平AI生成文本检测。
  • results: 我们的方法在句子和文档水平的检测挑战中都显示出了 significatively 高的表现,并且具有强大的泛化能力。
    Abstract Widely applied large language models (LLMs) can generate human-like content, raising concerns about the abuse of LLMs. Therefore, it is important to build strong AI-generated text (AIGT) detectors. Current works only consider document-level AIGT detection, therefore, in this paper, we first introduce a sentence-level detection challenge by synthesizing a dataset that contains documents that are polished with LLMs, that is, the documents contain sentences written by humans and sentences modified by LLMs. Then we propose \textbf{Seq}uence \textbf{X} (Check) \textbf{GPT}, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection. These features are composed like \textit{waves} in speech processing and cannot be studied by LLMs. Therefore, we build SeqXGPT based on convolution and self-attention networks. We test it in both sentence and document-level detection challenges. Experimental results show that previous methods struggle in solving sentence-level AIGT detection, while our method not only significantly surpasses baseline methods in both sentence and document-level detection challenges but also exhibits strong generalization capabilities.
    摘要 广泛应用的大语言模型(LLM)可以生成人类样式的内容,因此建立强大的人工智能生成文本检测器(AIGT)变得非常重要。现有的工作只考虑文档级别的AIGT检测,因此在这篇论文中,我们首先提出了句子级别的检测挑战, Synthesize一个包含由人类和LLM修改的句子的数据集。然后,我们提出了序列检测(Seq)逻辑检测(X)(Check)大语言模型(GPT),一种新的方法,它利用白盒LLM的log概率列作为句子级别AIGT检测的特征。这些特征如speech处理中的波动,不能被LLM研究。因此,我们建立了SeqXGPT基于卷积网络和自注意网络。我们在句子和文档级别的检测挑战中测试了我们的方法,实验结果表明,先前的方法在句子级别AIGT检测中很难解决,而我们的方法不仅在句子和文档级别的检测挑战中明显超越基线方法,还表现出了强大的泛化能力。

Exploration with Principles for Diverse AI Supervision

  • paper_url: http://arxiv.org/abs/2310.08899
  • repo_url: None
  • paper_authors: Hao Liu, Matei Zaharia, Pieter Abbeel
  • for: 提高人工监督需求的AI模型表现,以增强自然语言处理技术的发展。
  • methods: 基于自主探索学习的语言模型,通过评估生成内容的新鲜度来驱动探索。
  • results: 对复杂逻辑任务的模型表现显著提高,减少人工监督需求。
    Abstract Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing large language models to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content following exploration principles and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision.
    摘要 <>使用下一个token预测训练大型transformer教学对人工智能发展带来了重要突破。这种生成AI方法产生了吸引人的结果,但它对人类监督依赖很强,甚至最新的AI模型如ChatGPT也需要人类精心组译,需要广泛的人类输入和领域专业知识。这强大的人类监督限制了AI创新的发展。为解决这个限制,我们提出了一个新的思想,称为探索AI(EAI),旨在自动生成高质量训练数据。参考无监督学习(RL)的预训练,EAI在自然语言空间中进行探索。我们通过使用大型语言模型评估生成的内容新鲜度,实现这一目标。我们的方法包括两个关键 ком成分:一个actor生成 seguir内容,以探索原则为 guide,另一个critic评估生成的内容,提供反馈来引导actor。我们的实验结果表明,EAI可以对复杂推理任务的模型表现有 significiant提高,解决人类专业监督的限制。

PerturbScore: Connecting Discrete and Continuous Perturbations in NLP

  • paper_url: http://arxiv.org/abs/2310.08889
  • repo_url: https://github.com/renke999/perturbscore
  • paper_authors: Linyang Li, Ke Ren, Yunfan Shao, Pengyu Wang, Xipeng Qiu
  • for: 这个论文主要目标是研究NLP模型的Robustness问题,具体来说是将离散干扰与连续干扰连接起来,以便更好地理解NLP模型中的离散干扰。
  • methods: 作者们首先研究如何连接和度量离散干扰和连续干扰之间的相关性。然后,他们设计了一个回归任务来自动学习这种相关性。通过实验结果,作者们发现可以建立离散和连续干扰之间的连接,并使用提议的PerturbScore来学习这种相关性,超过了之前在离散干扰量化中使用的方法。
  • results: 作者们通过实验结果发现,可以建立离散和连续干扰之间的连接,并使用提议的PerturbScore来学习这种相关性,超过了之前在离散干扰量化中使用的方法。此外,提议的PerturbScore可以在不同的 dataset、干扰方法上进行普适化,这表明可以将其用作NLP模型的Robustness研究中的一种有力的工具。
    Abstract With the rapid development of neural network applications in NLP, model robustness problem is gaining more attention. Different from computer vision, the discrete nature of texts makes it more challenging to explore robustness in NLP. Therefore, in this paper, we aim to connect discrete perturbations with continuous perturbations, therefore we can use such connections as a bridge to help understand discrete perturbations in NLP models. Specifically, we first explore how to connect and measure the correlation between discrete perturbations and continuous perturbations. Then we design a regression task as a PerturbScore to learn the correlation automatically. Through experimental results, we find that we can build a connection between discrete and continuous perturbations and use the proposed PerturbScore to learn such correlation, surpassing previous methods used in discrete perturbation measuring. Further, the proposed PerturbScore can be well generalized to different datasets, perturbation methods, indicating that we can use it as a powerful tool to study model robustness in NLP.
    摘要 随着自然语言处理(NLP)领域中神经网络应用的快速发展,模型Robustness问题在引起更多的关注。与计算机视觉不同,文本的整数性质使其更加挑战性地探索Robustness。因此,在这篇论文中,我们尝试将整数扰动与连续扰动连接起来,以此为桥梁来理解NLP模型中的整数扰动。特别是,我们首先探索如何连接和度量整数扰动和连续扰动之间的相关性。然后,我们设计了一个 regression 任务,即 PerturbScore,以自动学习这种相关性。经过实验结果,我们发现可以建立整数扰动和连续扰动之间的连接,并使用我们提议的 PerturbScore 来学习这种相关性,超过过去在整数扰动测量中使用的方法。此外,我们的 PerturbScore 可以通过不同的数据集、扰动方法进行普适性测试,表明可以使用它作为NLP模型Robustness的可能的工具。

InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems

  • paper_url: http://arxiv.org/abs/2310.08885
  • repo_url: None
  • paper_authors: Willy Chung, Samuel Cahyawijaya, Bryan Wilie, Holy Lovenia, Pascale Fung
  • for: 这个论文旨在开发一种适用于多个领域的零整合端到端任务对话系统框架,不需要特定任务数据或 Fine-tuning。
  • methods: 该框架利用大型自然语言模型(LLMs)生成代理信仰状态,以便精准地翻译用户意图为动态查询,以便与任何知识库(KB)进行高效的交互。
  • results: 对比于完全精心调整的端到端任务对话系统,InstructTODS在无需任务特定数据或 Fine-tuning的情况下达到了相同的完成率。此外,人工评估表明,InstructTODS生成的对话响应比金标准响应和现有的端到端任务对话系统更有用、更有信息、更人性化。
    Abstract Large language models (LLMs) have been used for diverse tasks in natural language processing (NLP), yet remain under-explored for task-oriented dialogue systems (TODS), especially for end-to-end TODS. We present InstructTODS, a novel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue systems that can adapt to diverse domains without fine-tuning. By leveraging LLMs, InstructTODS generates a proxy belief state that seamlessly translates user intentions into dynamic queries for efficient interaction with any KB. Our extensive experiments demonstrate that InstructTODS achieves comparable performance to fully fine-tuned TODS in guiding dialogues to successful completion without prior knowledge or task-specific data. Furthermore, a rigorous human evaluation of end-to-end TODS shows that InstructTODS produces dialogue responses that notably outperform both the gold responses and the state-of-the-art TODS in terms of helpfulness, informativeness, and humanness. Moreover, the effectiveness of LLMs in TODS is further supported by our comprehensive evaluations on TODS subtasks: dialogue state tracking, intent classification, and response generation. Code and implementations could be found here https://github.com/WillyHC22/InstructTODS/
    摘要 大型自然语言处理(NLP)模型(LLM)已经在不同任务上使用,但它们在任务导向对话系统(TODS)中仍然尚未得到充分探索。我们介绍了一个新的协助TODS框架,名为InstructTODS,可以在零基础情况下实现终端到终端的任务导向对话系统。通过利用LLM,InstructTODS生成了一个代理信念状态,可以快速和高效地与任何知识库(KB)进行交互。我们的广泛实验表明,InstructTODS可以与完全精心调整的TODS相比,在不需要先知或任务特定数据的情况下,导航对话到成功完成。此外,我们进行了严格的人类评估,表明InstructTODS生成的对话响应比金标准响应和现有的TODS更有帮助、更有信息和更人性化。此外,我们还对TODS任务进行了广泛的评估,包括对话状态跟踪、意图类型分类和响应生成等。代码和实现可以在以下链接中找到:https://github.com/WillyHC22/InstructTODS/。

Retrieval-Generation Alignment for End-to-End Task-Oriented Dialogue System

  • paper_url: http://arxiv.org/abs/2310.08877
  • repo_url: https://github.com/shenwzh3/mk-tod
  • paper_authors: Weizhou Shen, Yingqi Gao, Canbin Huang, Fanqi Wan, Xiaojun Quan, Wei Bi
  • for: 本研究旨在开发一个高效的知识检索器,以便对大规模知识库(KB)进行有效的任务导向对话。
  • methods: 我们提出了使用最大最大似然来培训一个敏感的检索器,并利用回归生成器提供的信号进行监督。此外,我们的方法还会考虑多种元知识,以便更好地利用知识。
  • results: 我们在三个任务导向对话数据集上使用T5和ChatGPT作为基础模型进行评估。结果表明,当与元知识相结合时,回归生成器可以有效地利用高质量的知识记录,并提高生成回答的质量。
    Abstract Developing an efficient retriever to retrieve knowledge from a large-scale knowledge base (KB) is critical for task-oriented dialogue systems to effectively handle localized and specialized tasks. However, widely used generative models such as T5 and ChatGPT often struggle to differentiate subtle differences among the retrieved KB records when generating responses, resulting in suboptimal quality of generated responses. In this paper, we propose the application of maximal marginal likelihood to train a perceptive retriever by utilizing signals from response generation for supervision. In addition, our approach goes beyond considering solely retrieved entities and incorporates various meta knowledge to guide the generator, thus improving the utilization of knowledge. We evaluate our approach on three task-oriented dialogue datasets using T5 and ChatGPT as the backbone models. The results demonstrate that when combined with meta knowledge, the response generator can effectively leverage high-quality knowledge records from the retriever and enhance the quality of generated responses. The codes and models of this paper are available at https://github.com/shenwzh3/MK-TOD.
    摘要 开发一个高效的检索器,以检索大规模知识库(KB)中的知识,是对话系统来处理本地化和特殊化任务的关键。然而,广泛使用的生成模型,如T5和ChatGPT,经常很难以在生成响应时,对检索到的KB记录进行细致的区分。这会导致生成的响应质量下降。在这篇论文中,我们提议使用最大极值似然来训练一个敏感的检索器,通过响应生成提供的信号进行超vision。此外,我们的方法不仅考虑检索到的实体,还 incorporates 多种元知识,以便更好地利用知识。我们在三个任务对话数据集上使用T5和ChatGPT作为基础模型进行评估。结果显示,当与元知识相结合,响应生成器可以办法利用高质量的知识记录,从检索器中获得更高质量的响应。codes和models的github地址为https://github.com/shenwzh3/MK-TOD。

Guiding AMR Parsing with Reverse Graph Linearization

  • paper_url: http://arxiv.org/abs/2310.08860
  • repo_url: https://github.com/pkunlp-icler/amr_reverse_graph_linearization
  • paper_authors: Bofei Gao, Liang Chen, Peiyi Wang, Zhifang Sui, Baobao Chang
  • for: 这个论文主要是提出了一种解决sequence-to-sequence方法在AMR分析中strucutre loss聚集问题的方法,以提高AMR分析的精度。
  • methods: 该方法基于一种新的反向图线性化(RGL)框架,该框架定义了AMR图的默认和反向线性化顺序,并通过自我蒸馏机制将RGL纳入原始的AMR分析模型中。
  • results: 对AMR 2.0和AMR 3.0数据集进行测试,该方法与之前最佳的AMR分析模型相比,提高了0.8和0.5的Smatch分数。
    Abstract Abstract Meaning Representation (AMR) parsing aims to extract an abstract semantic graph from a given sentence. The sequence-to-sequence approaches, which linearize the semantic graph into a sequence of nodes and edges and generate the linearized graph directly, have achieved good performance. However, we observed that these approaches suffer from structure loss accumulation during the decoding process, leading to a much lower F1-score for nodes and edges decoded later compared to those decoded earlier. To address this issue, we propose a novel Reverse Graph Linearization (RGL) enhanced framework. RGL defines both default and reverse linearization orders of an AMR graph, where most structures at the back part of the default order appear at the front part of the reversed order and vice versa. RGL incorporates the reversed linearization to the original AMR parser through a two-pass self-distillation mechanism, which guides the model when generating the default linearizations. Our analysis shows that our proposed method significantly mitigates the problem of structure loss accumulation, outperforming the previously best AMR parsing model by 0.8 and 0.5 Smatch scores on the AMR 2.0 and AMR 3.0 dataset, respectively. The code are available at https://github.com/pkunlp-icler/AMR_reverse_graph_linearization.
    摘要 抽象意义表示(AMR)分析目标是从给定句子中提取一个抽象semantic图。序列到序列方法, Linearization的Semantic graph into a sequence of nodes and edges and generate the linearized graph directly, have achieved good performance. However, we observed that these approaches suffer from structure loss accumulation during the decoding process, leading to a much lower F1-score for nodes and edges decoded later compared to those decoded earlier. To address this issue, we propose a novel Reverse Graph Linearization (RGL) enhanced framework. RGL defines both default and reverse linearization orders of an AMR graph, where most structures at the back part of the default order appear at the front part of the reversed order and vice versa. RGL incorporates the reversed linearization to the original AMR parser through a two-pass self-distillation mechanism, which guides the model when generating the default linearizations. Our analysis shows that our proposed method significantly mitigates the problem of structure loss accumulation, outperforming the previously best AMR parsing model by 0.8 and 0.5 Smatch scores on the AMR 2.0 and AMR 3.0 dataset, respectively. 代码可以在https://github.com/pkunlp-icler/AMR_reverse_graph_linearization中找到。

End-to-end Story Plot Generator

  • paper_url: http://arxiv.org/abs/2310.08796
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Hanlin Zhu, Andrew Cohen, Danqing Wang, Kevin Yang, Xiaomeng Yang, Jiantao Jiao, Yuandong Tian
  • for: 本研究targets the problem of automatic generation of story plots, including premise, character descriptions, and plot outlines.
  • methods: 我们提出了三种模型来解决这些挑战:$\texttt{OpenPlot}$, $\texttt{E2EPlot}$, 和 $\texttt{RLPlot}$. $\texttt{OpenPlot}$使用LLaMA2取代了开源API调用,通过精心设计提示语来实现便宜的生成高质量的故事情节训练集。 $\texttt{E2EPlot}$通过终端到终端的精度调整来训练,使用约13000个故事情节生成器生成的故事情节。 $\texttt{RLPlot}$通过RLHF进行进一步的微调,使其在不同的奖励模型下实现不同方面的故事质量的优化。
  • results: 我们的实验结果显示,$\texttt{RLPlot}$可以在不同的奖励模型下实现60.0%的胜率,比$\texttt{E2EPlot}$高。
    Abstract Story plots, while short, carry most of the essential information of a full story that may contain tens of thousands of words. We study the problem of automatic generation of story plots, which includes story premise, character descriptions, plot outlines, etc. To generate a single engaging plot, existing plot generators (e.g., DOC (Yang et al., 2022a)) require hundreds to thousands of calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot, which is costly and takes at least several minutes. Moreover, the hard-wired nature of the method makes the pipeline non-differentiable, blocking fast specialization and personalization of the plot generator. In this paper, we propose three models, $\texttt{OpenPlot}$, $\texttt{E2EPlot}$ and $\texttt{RLPlot}$, to address these challenges. $\texttt{OpenPlot}$ replaces expensive OpenAI API calls with LLaMA2 (Touvron et al., 2023) calls via careful prompt designs, which leads to inexpensive generation of high-quality training datasets of story plots. We then train an end-to-end story plot generator, $\texttt{E2EPlot}$, by supervised fine-tuning (SFT) using approximately 13000 story plots generated by $\texttt{OpenPlot}$. $\texttt{E2EPlot}$ generates story plots of comparable quality to $\texttt{OpenPlot}$, and is > 10$\times$ faster (1k tokens in only 30 seconds on average). Finally, we obtain $\texttt{RLPlot}$ that is further fine-tuned with RLHF on several different reward models for different aspects of story quality, which yields 60.0$\%$ winning rate against $\texttt{E2EPlot}$ along the aspect of suspense and surprise.
    摘要 文本情节,即使短,拥有大部分完整的故事信息。我们研究自动生成文本情节的问题,包括故事前提、人物描述、剧本大纲等。现有的故事情节生成器(例如,DOC(Yang et al., 2022a))需要数百到千个LLM(例如,OpenAI API)的调用,以便在故事情节的规划阶段生成一个有趣的情节,这是成本高昂,至少需要几分钟。此外,这种硬编程的方法使得流水线不可 diferenciable,阻碍快速个性化和特化的情节生成器。在这篇论文中,我们提出了三种模型: $\texttt{OpenPlot}$、 $\texttt{E2EPlot}$ 和 $\texttt{RLPlot}$,以解决这些挑战。 $\texttt{OpenPlot}$ 将Expensive OpenAI API 调用替换为LLLaMA2(Touvron et al., 2023)调用,通过仔细设计提示,以便便宜地生成高质量的故事情节训练数据集。然后,我们通过监督微调(SFT)使用约13000个故事情节,生成了一个终到终的故事情节生成器 $\texttt{E2EPlot}$。 $\texttt{E2EPlot}$ 生成的故事情节质量与 $\texttt{OpenPlot}$ 相当,且速度 > 10$\times$ (1k 字在30秒内平均 completion)。最后,我们通过RLHF(Reinforcement Learning with Human Feedback)进行进一步微调,在不同的奖励模型下,以不同的剧情质量方面获得60.0%的胜率。