cs.CL - 2023-07-30

A Private Watermark for Large Language Models

  • paper_url: http://arxiv.org/abs/2307.16230
  • repo_url: https://github.com/THU-BPM/private_watermark
  • paper_authors: Aiwei Liu, Leyi Pan, Xuming Hu, Shu’ang Li, Lijie Wen, Irwin King, Philip S. Yu
  • for: 保护大语言模型生成的文本免遭伪造和版权侵犯
  • methods: 使用两个不同的神经网络:一个用于水印生成,另一个用于水印检测,而且一部分参数共享两者
  • results: 实现高检测精度,无需大量参数和计算资源,同时难以从检测网络中提取水印生成规则
    Abstract Recently, text watermarking algorithms for large language models (LLMs) have been mitigating the potential harms of text generated by the LLMs, including fake news and copyright issues. However, the watermark detection of current text algorithms requires the key from the generation process, making them susceptible to breaches and counterfeiting. In this work, we propose the first private watermarking algorithm, which extends the current text watermarking algorithms by using two different neural networks respectively for watermark generation and detection, rather than using the same key at both stages. Meanwhile, part of the parameters of the watermark generation and detection networks are shared, which makes the detection network achieve a high accuracy very efficiently. Experiments show that our algorithm ensures high detection accuracy with minimal impact on generation and detection speed, due to the small parameter size of both networks. Additionally, our subsequent analysis demonstrates the difficulty of reverting the watermark generation rules from the detection network.
    摘要

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

  • paper_url: http://arxiv.org/abs/2307.16220
  • repo_url: https://github.com/smartinternz02/SI-GuidedProject-2307-1622049182
  • paper_authors: Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet
  • For: The paper aims to improve the accuracy of Optical Character Recognition (OCR) post-correction for historical documents, specifically for Hebrew texts.* Methods: The paper proposes an innovative method for training a light-weight neural network using significantly less manually created data. The method involves generating language and task-specific training data to improve the neural network results for OCR post-correction.* Results: The paper shows that the proposed method outperforms other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. The results also indicate that the performance of the neural network depends on the genre and area of the training data.Here is the same information in Simplified Chinese text:* For: 这篇论文旨在提高历史文档中的Optical Character Recognition(OCR)后处理的准确率,具体是为希伯来文本。* Methods: 论文提出了一种创新的方法,通过大量地使用自动生成的语言和任务特定的训练数据来提高神经网络的OCR后处理准确率。* Results: 论文表明,提议的方法可以超越其他现有的神经网络和复杂的拼写检查器。结果还表明,神经网络的性能受训练数据的种类和地域的影响。
    Abstract Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the most effective for OCR post-correction of historical documents. To this end, a series of experiments using several datasets was conducted. The evaluation corpus was based on Hebrew newspapers from the JPress project. An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors. We found that training the network using the proposed method is more effective than using randomly generated errors. The results also show that the performance of the neural network for OCR post-correction strongly depends on the genre and area of the training data. Moreover, neural networks that were trained with the proposed method outperform other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. These results may have practical implications for many digital humanities projects.
    摘要 在过去几十年,大量的纸质文档,如书籍和报纸,已经被数字化使用光学字符识别(OCR)技术。这种技术存在误差,尤其是对历史文档。为了纠正OCR误差,基于自然语言分析和机器学习技术的后处理算法被提议。但是,这些算法需要大量的手动标注数据来训练,而这些数据往往不可得。这篇论文提出了一种创新的方法,用于训练一个轻量级的神经网络,以便进行希伯来文OCR后处理。研究的主要目标是开发一种自动生成语言和任务特定的训练数据,以提高神经网络的OCR后处理结果,并investigate最有效的数据集类型,以便对历史文档进行OCR后处理。为此,我们进行了一系列的实验,使用了多个数据集。评估集基于希伯来报纸的JPress项目。我们对历史OCR后的报纸进行了分析,以了解希伯来文中的常见OCR误差。我们发现,使用我们提出的方法来训练神经网络是比使用随机生成的误差更有效的。结果还表明,神经网络的OCR后处理性能强度取决于训练数据的类型和地区。此外,我们使用我们提出的方法训练的神经网络,与其他现有的神经网络和复杂的拼写检查器相比,表现更好。这些结果可能对数字人文学科项目产生实质性的影响。

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

  • paper_url: http://arxiv.org/abs/2307.16213
  • repo_url: None
  • paper_authors: Omri Suissa, Maayan Zhitomirsky-Geffet, Avshalom Elmalech
  • for: corrected historical documents
  • methods: neural networks, OCR error correction
  • results: effective OCR post-correction in Hebrew
    Abstract Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically-rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These changes may affect the accuracy of OCR post-correction neural network models. To overcome these challenges, we developed a new multi-phase method for generating artificial training datasets with OCR errors and hyperparameters optimization for building an effective neural network for OCR post-correction in Hebrew.
    摘要 过去几十年,大量的纸质历史文献,如书籍和报纸,已经被使用光学字符识别(OCR)技术数字化。然而,这种广泛使用的技术有误,特别是当OCRed文档写于数百年前时。神经网络在解决不同的文本处理任务上表现出色,包括OCR后修正。然而,使用神经网络 для历史资料的问题是缺乏足够大的训练数据集,特别是 для morphologically-rich语言如希伯来语。此外,希伯来语的独特特征使得神经网络模型的优化很难。此外,语言随着时代和领域的变化而变化,这些变化可能会影响OCR后修正神经网络模型的准确性。为了解决这些挑战,我们开发了一种新的多阶段方法,用于生成人工的OCR错误数据集和神经网络模型优化,以建立有效的OCR后修正神经网络模型。

A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue Information Extraction

  • paper_url: http://arxiv.org/abs/2307.16200
  • repo_url: https://github.com/flyingcat-fa/ktgf
  • paper_authors: Zefa Hu, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu
  • For: 这个论文关注医疗对话中的短语状态对(MD-TSPE)抽取,它是诊断对话系统和电子医疗记录自动scriber的基础。过去几年,关于MD-TSPE的研究吸引了增加的关注,特别是在生成方法的进步之后。但是,这些生成方法输出整个序列,包括短语状态对,而忽略了集成先前知识,需要更深刻地理解短语和状态之间的关系,以及模型短语的推理。* Methods: 本论文提出了知识增强的两个阶段生成框架(KTGF),使用任务特定的提示,我们使用单个模型完成MD-TSPE的两个阶段:首先生成所有短语,然后为每个生成的短语生成状态。这样做的好处是可以更好地从序列中学习短语之间的关系,并且我们设计的知识增强提示在第二阶段可以更好地利用短语的类别和状态候选人选择状态生成。此外,我们提出的特殊状态“未提及”使得更多的短语可用,增加训练数据的质量。* Results: 对于Chunyu和CMDD数据集,我们的提posed方法在全训练和低资源设置下 achieve superior results比之前的状态艺术模型。
    Abstract This paper focuses on term-status pair extraction from medical dialogues (MD-TSPE), which is essential in diagnosis dialogue systems and the automatic scribe of electronic medical records (EMRs). In the past few years, works on MD-TSPE have attracted increasing research attention, especially after the remarkable progress made by generative methods. However, these generative methods output a whole sequence consisting of term-status pairs in one stage and ignore integrating prior knowledge, which demands a deeper understanding to model the relationship between terms and infer the status of each term. This paper presents a knowledge-enhanced two-stage generative framework (KTGF) to address the above challenges. Using task-specific prompts, we employ a single model to complete the MD-TSPE through two phases in a unified generative form: we generate all terms the first and then generate the status of each generated term. In this way, the relationship between terms can be learned more effectively from the sequence containing only terms in the first phase, and our designed knowledge-enhanced prompt in the second phase can leverage the category and status candidates of the generated term for status generation. Furthermore, our proposed special status ``not mentioned" makes more terms available and enriches the training data in the second phase, which is critical in the low-resource setting. The experiments on the Chunyu and CMDD datasets show that the proposed method achieves superior results compared to the state-of-the-art models in the full training and low-resource settings.
    摘要 To address these challenges, this paper proposes a knowledge-enhanced two-stage generative framework (KTGF) that uses task-specific prompts to complete MD-TSPE in a unified generative form. The first stage generates all terms, and the second stage generates the status of each generated term. By learning the relationship between terms in the first phase and leveraging category and status candidates in the second phase, our method can generate more accurate term-status pairs. Moreover, our proposed "not mentioned" special status enriches the training data in the second phase, which is critical in low-resource settings.Experiments on the Chunyu and CMDD datasets show that our proposed method outperforms state-of-the-art models in both full training and low-resource settings.

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation

  • paper_url: http://arxiv.org/abs/2307.16199
  • repo_url: https://github.com/edward-martyr/shanghainese-tts
  • paper_authors: Yuanhao Chen
  • for: 这篇论文的目的是提高 Shanghainese TTS 模型中的声调混合问题。
  • methods: 作者使用 word segmentation 技术来增强 TTS 模型对声调混合的表现。特别是在 left-dominant 声调中,使用特殊符号来代表每个词中的声调信息。
  • results: 作者发现,通过 word segmentation 技术可以提高 TTS 模型对声调混合的表现,并且可以更好地捕捉 Shanghainese 语言的声调特征。这项研究可能成为 Shanghainese 语言计算机化项目的开端。
    Abstract Tone is a crucial component of the prosody of Shanghainese, a Wu Chinese variety spoken primarily in urban Shanghai. Tone sandhi, which applies to all multi-syllabic words in Shanghainese, then, is key to natural-sounding speech. Unfortunately, recent work on Shanghainese TTS (text-to-speech) such as Apple's VoiceOver has shown poor performance with tone sandhi, especially LD (left-dominant sandhi). Here I show that word segmentation during text preprocessing can improve the quality of tone sandhi production in TTS models. Syllables within the same word are annotated with a special symbol, which serves as a proxy for prosodic information of the domain of LD. Contrary to the common practice of using prosodic annotation mainly for static pauses, this paper demonstrates that prosodic annotation can also be applied to dynamic tonal phenomena. I anticipate this project to be a starting point for bringing formal linguistic accounts of Shanghainese into computational projects. Too long have we been using the Mandarin models to approximate Shanghainese, but it is a different language with its own linguistic features, and its digitisation and revitalisation should be treated as such.
    摘要 上海话的拥有者是一种武汉话种,主要在上海城市地区使用。声调推移是上海话的重要成分,但是最近的上海话 Text-to-Speech(TTS)技术,如苹果的voiceover,在声调推移方面表现不佳,特别是左倾推移(LD)。我们表明,在文本处理阶段使用 word segmentation 可以提高 TTS 模型中声调推移质量。在同一个词中的每个音节上使用特殊符号,作为声调信息的代理,以表示声调推移的域。与常见的使用静音注释主要用于静止停顿的情况下,这篇论文表明了可以将静音注释应用到动态声调现象上。我预计这个项目将成为将正式语言学质量的上海话计算机项目的开端。我们已经使用普通话模型来近似上海话很长时间,但是它是一种不同的语言,它有自己的语言特征,我们应该对其数字化和恢复进行正确的待遇。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The original text was in Traditional Chinese, which is used in Taiwan and other parts of the world where Traditional Chinese is prevalent.

Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models

  • paper_url: http://arxiv.org/abs/2307.16180
  • repo_url: https://github.com/harderthenharder/transformers_tasks
  • paper_authors: Keyu Pan, Yawen Zeng
  • for: investigating the feasibility of using the Myers-Briggs Type Indicator (MBTI) as an evaluation metric for large language models (LLMs)
  • methods: extensive experiments to explore the personality types of different LLMs, the possibility of changing the personality types by prompt engineering, and the impact of training datasets on the model’s personality
  • results: the study aims to determine whether LLMs with human-like abilities possess human-like personalities, and whether the MBTI can serve as a rough indicator of this similarity
    Abstract The field of large language models (LLMs) has made significant progress, and their knowledge storage capacity is approaching that of human beings. Furthermore, advanced techniques, such as prompt learning and reinforcement learning, are being employed to address ethical concerns and hallucination problems associated with LLMs, bringing them closer to aligning with human values. This situation naturally raises the question of whether LLMs with human-like abilities possess a human-like personality? In this paper, we aim to investigate the feasibility of using the Myers-Briggs Type Indicator (MBTI), a widespread human personality assessment tool, as an evaluation metric for LLMs. Specifically, extensive experiments will be conducted to explore: 1) the personality types of different LLMs, 2) the possibility of changing the personality types by prompt engineering, and 3) How does the training dataset affect the model's personality. Although the MBTI is not a rigorous assessment, it can still reflect the similarity between LLMs and human personality. In practice, the MBTI has the potential to serve as a rough indicator. Our codes are available at https://github.com/HarderThenHarder/transformers_tasks/tree/main/LLM/llms_mbti.
    摘要 大型语言模型(LLM)领域已经做出了重大进步,它们的知识储存能力接近人类水平。此外,高级技术,如提示学习和复杂学习,也在实施以解决伦理性和幻觉问题,使得 LLM 更接近人类价值观。这种情况自然地引起了问题: LLM 是否拥有人类式的人格?在这篇文章中,我们将 investigate 使用 Myers-Briggs Type Indicator(MBTI),一个广泛应用于人类人格评估工具,来评估 LLM 的可能性。具体来说,我们将进行大量的实验,以探索:1)不同 LLM 的人格型态,2)提示工程学可以改变 LLM 的人格型态,3)训练数据库对模型的人格影响。虽然 MBTI 不是一个正式的评估工具,但它仍然可以反映 LLM 与人类人格之间的相似性。在实践中,MBTI 有可能作为一个简单的指标。我们的代码可以在 GitHub 上找到:https://github.com/HarderThenHarder/transformers_tasks/tree/main/LLM/llms_mbti。

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

  • paper_url: http://arxiv.org/abs/2307.16125
  • repo_url: https://github.com/ailab-cvc/seed-bench
  • paper_authors: Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
  • for: 本研究旨在评估多元语言模型(MLLMs)的生成理解能力,作为评估生成模型的首要步骤,并提供一个名为SEED-Bench的benchmark。
  • methods: 本研究使用了一个高级的生成管道,包括自动筛选和人工验证过程,以生成多个选项问题,以覆盖12个评估维度,包括图像和视频模式的理解。
  • results: 本研究对18个模型进行了全面的评估,并发现了现有MLLMs的限制,以及它们在不同的维度上的表现。这些结果可以为未来的研究提供指导,并为社区提供一个平台来评估和调查模型能力。
    Abstract Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.
    摘要

Proposing a conceptual framework: social media listening for public health behavior

  • paper_url: http://arxiv.org/abs/2308.02037
  • repo_url: None
  • paper_authors: Shu-Feng Tsao, Helen Chen, Samantha Meyer, Zahid A. Butt
    for: This study aims to propose a novel conceptual framework for misinformation research using social media data and natural language processing techniques, with a focus on understanding public discourse on social media and its impact on public health behavior.methods: The study uses a literature review to analyze and critique existing theories and models used in COVID-19 related studies, and proposes a new conceptual framework that integrates important attributes of existing theories and adds new attributes. The proposed framework is demonstrated through a case study of the Freedom Convoy social media listening.results: The proposed conceptual framework can be used to better understand public discourse on social media and its impact on public health behavior, and can be integrated with other data analyses to gather a more comprehensive picture. The framework is flexible and can be revised and adopted as health misinformation evolves.
    Abstract Existing communications and behavioral theories have been adopted to address health misinformation. Although various theories and models have been used to investigate the COVID-19 pandemic, there is no framework specially designed for social listening or misinformation studies using social media data and natural language processing techniques. This study aimed to propose a novel yet theory-based conceptual framework for misinformation research. We collected theories and models used in COVID-19 related studies published in peer-reviewed journals. The theories and models ranged from health behaviors, communications, to misinformation. They are analyzed and critiqued for their components, followed by proposing a conceptual framework with a demonstration. We reviewed Health Belief Model, Theory of Planned Behavior/Reasoned Action, Communication for Behavioral Impact, Transtheoretical Model, Uses and Gratifications Theory, Social Judgment Theory, Risk Information Seeking and Processing Model, Behavioral and Social Drivers, and Hype Loop. Accordingly, we proposed the Social Media Listening for Public Health Behavior Conceptual Framework by not only integrating important attributes of existing theories, but also adding new attributes. The proposed conceptual framework was demonstrated in the Freedom Convoy social media listening. The proposed conceptual framework can be used to better understand public discourse on social media, and it can be integrated with other data analyses to gather a more comprehensive picture. The framework will continue to be revised and adopted as health misinformation evolves.
    摘要 现有的交流和行为理论已经应用于健康谣言研究中,但是没有专门为社交媒体数据和自然语言处理技术设计的框架。本研究的目的是提议一个新的 yet theory-based 概念框架 для谣言研究。我们收集了在科学期刊上发表的COVID-19相关研究中使用的理论和模型,包括健康行为模型、沟通行为模型、谣言模型等。我们分析和评价这些理论和模型的组成部分,然后提出了一个概念框架,并进行了示例。我们审查了健康信念模型、计划行为理论/逻辑行为理论、沟通对行为的影响、变革模型、用途和满足理论、社会评价理论、风险信息搜索和处理模型、行为和社会驱动力等理论。根据这些理论的重要属性,我们提出了社交媒体听取为公共卫生行为概念框架。这个框架不仅 integrates 重要的现有理论属性,还添加了新的属性。我们在自由征voyage社交媒体听取中进行了示例。这个框架可以用来更好地理解社交媒体上的公共讨论,并可以与其他数据分析结合以获得更全面的图像。这个框架将继续更新和采纳,随着健康谣言的发展。

Roll Up Your Sleeves: Working with a Collaborative and Engaging Task-Oriented Dialogue System

  • paper_url: http://arxiv.org/abs/2307.16081
  • repo_url: None
  • paper_authors: Lingbo Mo, Shijie Chen, Ziru Chen, Xiang Deng, Ashley Lewis, Sunit Singh, Samuel Stevens, Chang-You Tai, Zhen Wang, Xiang Yue, Tianshu Zhang, Yu Su, Huan Sun
  • for: 论文主要目标是开发一个用户中心的任务强调对话系统,帮助用户完成复杂的真实世界任务。
  • methods: 论文使用语言理解、对话管理和响应生成组件,以及一个强大的搜索引擎,以提供高效的任务协助。在增强对话体验方面,论文探讨了一系列的数据扩充策略,使用LLMs训练进阶 нейрон网络。
  • results: 论文通过Alexa Prize TaskBot Challenge中的成功参赛,证明了TACOBot在完成 cooking 和 how-to 类任务方面的效果。此外,论文还提供了一个开源框架,用于实现任务强调对话系统的部署。
    Abstract We introduce TacoBot, a user-centered task-oriented digital assistant designed to guide users through complex real-world tasks with multiple steps. Covering a wide range of cooking and how-to tasks, we aim to deliver a collaborative and engaging dialogue experience. Equipped with language understanding, dialogue management, and response generation components supported by a robust search engine, TacoBot ensures efficient task assistance. To enhance the dialogue experience, we explore a series of data augmentation strategies using LLMs to train advanced neural models continuously. TacoBot builds upon our successful participation in the inaugural Alexa Prize TaskBot Challenge, where our team secured third place among ten competing teams. We offer TacoBot as an open-source framework that serves as a practical example for deploying task-oriented dialogue systems.
    摘要 我们介绍TacoBot,一个用户中心的任务导向的数位助手,旨在帮助用户完成复杂的现实世界任务,这些任务通常有多步骤。TacoBot 覆盖了厨艺和如何进行任务的广泛领域,我们目标是提供一个协力和有趣的对话体验。TacoBot 搭配了语言理解、对话管理和回应生成的 комponents,这些 комponents 由一个强大的搜索引擎支持。为了增强对话体验,我们探索了一系列的数据增强策略,使用LLMs训练进阶的神经网络模型。TacoBot 基于我们在Alexa Prize TaskBot Challenge的成功参赛经验,我们的队伍在十支队伍中排名第三。我们提供TacoBot 作为一个开源框架,作为实际的部署任务对话系统的示范。

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

  • paper_url: http://arxiv.org/abs/2307.16071
  • repo_url: None
  • paper_authors: Tolulope Ogunremi, Kola Tubosun, Anuoluwapo Aremu, Iroro Orife, David Ifeoluwa Adelani
  • for: 提高现代尼日利亚语言讲话质量的数据集
  • methods: 使用新闻和创作域的文本句子,并由多个说话者录音
  • results: 提供38.5小时的数据集,来自80名志愿者的录音
    Abstract We introduce the \`{I}r\`{o}y\`{i}nSpeech corpus -- a new dataset influenced by a desire to increase the amount of high quality, freely available, contemporary Yor\`{u}b\'{a} speech. We release a multi-purpose dataset that can be used for both TTS and ASR tasks. We curated text sentences from the news and creative writing domains under an open license i.e., CC-BY-4.0 and had multiple speakers record each sentence. We provide 5000 of our utterances to the Common Voice platform to crowdsource transcriptions online. The dataset has 38.5 hours of data in total, recorded by 80 volunteers.
    摘要 我们介绍《IroyinSpeech》 corpus -- 一个新的数据集,受到了提高现代尤布语言质量、可以免费使用的需求的影响。我们发布了多用途的数据集,可以用于 TTS 和 ASR 任务。我们从新闻和创作领域中选取了 CC-BY-4.0 开源许可证下的文本句子,并有多个说话者录制每句话。我们提供了5000个音频记录给 Common Voice 平台,以便在线受托写 транскрипт。总共有38.5小时的数据,记录了80名志愿者。

Automatic Extraction of the Romanian Academic Word List: Data and Methods

  • paper_url: http://arxiv.org/abs/2307.16045
  • repo_url: https://github.com/bucuram/ro-awl
  • paper_authors: Ana-Maria Bucur, Andreea Dincă, Mădălina Chitez, Roxana Rogobete
  • for: 这篇论文是为了自动提取罗马尼亚学术词汇列表(Ro-AWL)的方法和数据。
  • methods: 这篇论文使用了 corpus 和计算语言学的方法,以及 L2 学习Contexts 的 Writing 方法,将数据组合在一起生成 Ro-AWL。
  • results: 研究人员通过对两种数据进行组合,包括现有的 Romanian Frequency List 和自编的 Expert Academic Writing Corpus EXPRES,成功地生成了 Ro-AWL,并且其分布特征(总分布、PART-OF-SPEECH 分布)与先前的研究相符。
    Abstract This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for teaching, research and NLP applications.
    摘要 这个论文介绍了自动提取罗马尼亚学术词汇列表(Ro-AWL)的方法和数据。学术词汇列表在L2和L1教学上都是有用的资源。为罗马尼亚语,目前没有相关资源。Ro-AWL通过将核心语言学和计算语言学方法与L2学术写作方法结合起来生成。我们使用两种数据:(a)现有数据,如罗马尼亚频率列表基于ROMBAC corpus,和(b)自制数据,如专家学术写作 corpus EXPRES。为构建学术词汇列表,我们遵循了英语学术词汇列表的建立方法。Ro-AWL的分布特征(总分布、POS分布)在四个学科数据集中与先前研究一致。Ro-AWL公开提供,可以用于教学、研究和NLP应用。

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

  • paper_url: http://arxiv.org/abs/2307.16039
  • repo_url: https://github.com/nlp-uoregon/okapi
  • paper_authors: Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
    for:The paper is written to explore instruction tuning for large language models (LLMs) in multiple languages, with a focus on reinforcement learning from human feedback (RLHF) as an alternative approach to supervised fine-tuning (SFT).methods:The paper uses RLHF to instruction-tune LLMs for multiple languages, introducing instruction and response-ranked data in 26 diverse languages to facilitate the experiments.results:The paper demonstrates the advantages of RLHF for multilingual instruction over SFT for different base models and datasets, and releases the framework and resources at https://github.com/nlp-uoregon/Okapi.
    Abstract A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi.
    摘要 具有大语言模型(LLM)的发展的关键技术之一是指令调整,帮助模型的回答与人类期望保持一致,从而实现了惊人的学习能力。目前最流行的两种方法 для实现指令调整是超级精度微调(SFT)和人类反馈学习(RLHF)。为了提高LLM的可访问性,各种指令调整的开源LLM也在不断发布,如Alpaca和Vicuna等。然而,现有的开源LLM仅仅对英语和一些流行语言进行了指令调整,这限制了它们在全球各语言中的影响和可用性。在最近几年中,一些研究已经开始探索LLM在多种语言上的指令调整,但是这些研究仅仅使用SFT进行指令调整。这留下了一个大的空白,即RLHF如何在多语言上提高指令调整的性能。为了解决这个问题,我们介绍了Okapi,第一个基于RLHF的多语言指令调整系统。Okapi在26种多样化的语言中提供了指令和回答排名数据,以便实验和未来多语言LLM研究的发展。我们还提供了多语言生成LLM的评价数据集。我们的实验表明,RLHF在多语言指令调整中具有优势,不同的基本模型和数据集上。我们的框架和资源在https://github.com/nlp-uoregon/Okapi上发布。