results: 作者们编译了30,207个问答对的安全元标和30,144个专家对比数据,并在内容审核和人工反馈学习(RLHF)中应用了BeaverTails,证明了其在LLM的实际安全措施方面的潜在价值。Abstract
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails.
摘要
在本文中,我们介绍了BeaverTails数据集,用于推动大语言模型(LLM)的安全对齐研究。这个数据集独特地将问答对的有用性和无害性标注分开,因此可以提供不同的视角。总共,我们对30,207个问答对进行了安全元标注,并收集了30,144对专家比较数据,以便在有用性和无害性metric上进行评估。我们还展示了在内容审核和人工回馈学习(RLHF)中使用BeaverTails的应用,强调其在LLM的安全实施中的潜在价值。我们认为这个数据集为研究者提供了重要的资源,帮助开发和部署LLM的安全。项目页面的URL为:https://sites.google.com/view/pku-beavertails。
Measuring Lexical Diversity in Texts: The Twofold Length Problem
results: 三个英语语言学习者文本 dataset 的分析显示,使用这些方法可以解决长度问题,但都未能解决第二个问题Abstract
The impact of text length on the estimation of lexical diversity has captured the attention of the scientific community for more than a century. Numerous indices have been proposed, and many studies have been conducted to evaluate them, but the problem remains. This methodological review provides a critical analysis not only of the most commonly used indices in language learning studies, but also of the length problem itself, as well as of the methodology for evaluating the proposed solutions. The analysis of three datasets of English language-learners' texts revealed that indices that reduce all texts to the same length using a probabilistic or an algorithmic approach solve the length dependency problem; however, all these indices failed to address the second problem, which is their sensitivity to the parameter that determines the length to which the texts are reduced. The paper concludes with recommendations for optimizing lexical diversity analysis.
摘要
Text length的影响对语言多样性的估计已经在科学社区内引起了超过一个世纪的关注。许多指标已经被提出,但问题仍然存在。本方法评论提供了不只是语言学研究中最常用的指标的重要分析,还包括长度问题本身以及评估提出的解决方案的方法学分析。分析三个英语学习者的文本数据表明,使用概率或算法方法减小所有文本到同一个长度可以解决长度依赖问题,但所有这些指标都无法解决第二个问题,即它们对参数的敏感性。文章结束于优化语言多样性分析的建议。
On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion
results: 本文提出了语言学习机器人需要满足的一些要求,以及对未来语言模型的建议。Abstract
This document chronicles this author's attempt to explore how words come to mean what they do, with a particular focus on child language acquisition and what that means for models of language understanding.\footnote{I say \emph{historical} because I synthesize the ideas based on when I discovered them and how those ideas influenced my later thinking.} I explain the setting for child language learning, how embodiment -- being able to perceive and enact in the world, including knowledge of concrete and abstract concepts -- is crucial, and how emotion and cognition relate to each other and the language learning process. I end with what I think are some of the requirements for a language-learning agent that learns language in a setting similar to that of children. This paper can act as a potential guide for ongoing and future work in modeling language.
摘要
这份文档记录作者对语言意义的探索,尤其是儿童语言学习的过程。作者提出了一些关于语言理解的模型,并在文档中解释了儿童语言学习的背景和embodiment的重要性。文档还讨论了情感和认知之间的关系,以及语言学习过程中的感知和行为。最后,作者提出了一些对于模拟语言学习的agent来说的需求。这份文档可以作为未来语言模型研究的指南。
Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases
paper_authors: Michael Sheinman Orenstrakh, Oscar Karnalim, Carlos Anibal Suarez, Michael Liut
for: This paper aims to evaluate the effectiveness of eight publicly-available LLM-generated text detectors in detecting LLM-generated text in computer science submissions.
methods: The authors collected 124 submissions from computer science students and generated 40 ChatGPT submissions to evaluate the eight LLM-generated text detectors using accuracy, false positives, and resilience measures.
results: The results show that CopyLeaks is the most accurate LLM-generated text detector, GPTKit is the best LLM-generated text detector to reduce false positives, and GLTR is the most resilient LLM-generated text detector. However, the authors also note that all LLM-generated text detectors are less accurate with code, other languages, and after the use of paraphrasing tools.Here’s the same information in Simplified Chinese:
results: 结果显示,CopyLeaks是LLM生成文本检测器中最准确的,GPTKit可以减少假阳性,而GLTR是最有抗耗力的LLM生成文本检测器。然而,作者还注意到,所有LLM生成文本检测器在代码、其他语言和使用篇章工具(如QuillBot)后都有减少准确性的问题。Abstract
Due to the recent improvements and wide availability of Large Language Models (LLMs), they have posed a serious threat to academic integrity in education. Modern LLM-generated text detectors attempt to combat the problem by offering educators with services to assess whether some text is LLM-generated. In this work, we have collected 124 submissions from computer science students before the creation of ChatGPT. We then generated 40 ChatGPT submissions. We used this data to evaluate eight publicly-available LLM-generated text detectors through the measures of accuracy, false positives, and resilience. The purpose of this work is to inform the community of what LLM-generated text detectors work and which do not, but also to provide insights for educators to better maintain academic integrity in their courses. Our results find that CopyLeaks is the most accurate LLM-generated text detector, GPTKit is the best LLM-generated text detector to reduce false positives, and GLTR is the most resilient LLM-generated text detector. We also express concerns over 52 false positives (of 114 human written submissions) generated by GPTZero. Finally, we note that all LLM-generated text detectors are less accurate with code, other languages (aside from English), and after the use of paraphrasing tools (like QuillBot). Modern detectors are still in need of improvements so that they can offer a full-proof solution to help maintain academic integrity. Further, their usability can be improved by facilitating a smooth API integration, providing clear documentation of their features and the understandability of their model(s), and supporting more commonly used languages.
摘要
因为最近的大语言模型(LLM)的改进和普遍可用性,它们对教育的学术 integrity 造成了严重的威胁。现代 LLM 生成文本检测器尝试通过为教师提供检测 LLM 生成文本的服务,以确保学术 integrity 的维护。在这项工作中,我们收集了 124 篇计算机科学学生的作业,然后生成了 40 篇 ChatGPT 作业。我们使用这些数据来评估 eight 个公共可用的 LLM 生成文本检测器,通过准确率、假阳性和抗耗能力三个指标进行评估。本研究的目的是通过检测器的评估,了解哪些 LLM 生成文本检测器效果好、哪些需要改进,以便为教育行业提供更好的学术 integrity 维护方案。我们的结果显示,CopyLeaks 是最准确的 LLM 生成文本检测器,GPTKit 是减少假阳性的最佳选择,GLTR 是最有抗耗能力的 LLM 生成文本检测器。此外,我们还发现 GPTZero 对 114 篇人工写作中的 52 个假阳性存在问题。最后,我们注意到所有 LLM 生成文本检测器都对代码、其他语言(除英语外)和使用副作业工具(如 QuillBot)后的文本准确率较低。现代检测器仍需进一步改进,以提供不可攻击的解决方案,并且可以提高使用者体验,例如通过简单的 API 集成、清晰的功能和模型文档、以及支持更常用的语言。
Enhancing Biomedical Text Summarization and Question-Answering: On the Utility of Domain-Specific Pre-Training
results: 研究结果表明,没有域专预训练的大语言模型在某些域pecific生物医学文本生成任务中可以具有显著优势。Abstract
Biomedical summarization requires large datasets to train for text generation. We show that while transfer learning offers a viable option for addressing this challenge, an in-domain pre-training does not always offer advantages in a BioASQ summarization task. We identify a suitable model architecture and use it to show a benefit of a general-domain pre-training followed by a task-specific fine-tuning in the context of a BioASQ summarization task, leading to a novel three-step fine-tuning approach that works with only a thousand in-domain examples. Our results indicate that a Large Language Model without domain-specific pre-training can have a significant edge in some domain-specific biomedical text generation tasks.
摘要
TIM: Teaching Large Language Models to Translate with Comparison
paper_authors: Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou
for: 提高大型自然语言模型(LLM)在翻译任务中的表现
methods: 使用比较例子来教育LLM学习翻译
results: 比较例子学习翻译的方法可以超越现有的方法,并提高LLM在翻译任务中的表现Abstract
Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. Moreover, it can be more challenging for tuning smaller LLMs with lower-quality training data. To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning. We evaluate our method on WMT2022 test sets and show that it outperforms existing methods. Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. Please refer to Github for more details: https://github.com/lemon0830/TIM.
摘要
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
results: 本研究的试验结果表明,PhoneXL可以提高cross-lingual transfer的效果,特别是在CJKV语言之间。在Named Entity Recognition和Part-of-Speech Tagging两个token-level任务上,PhoneXL可以实现了Consistent improvements over orthographic-based multilingual PLMs。Abstract
Previous cross-lingual transfer methods are restricted to orthographic representation learning via textual scripts. This limitation hampers cross-lingual transfer and is biased towards languages sharing similar well-known scripts. To alleviate the gap between languages from different writing scripts, we propose PhoneXL, a framework incorporating phonemic transcriptions as an additional linguistic modality beyond the traditional orthographic transcriptions for cross-lingual transfer. Particularly, we propose unsupervised alignment objectives to capture (1) local one-to-one alignment between the two different modalities, (2) alignment via multi-modality contexts to leverage information from additional modalities, and (3) alignment via multilingual contexts where additional bilingual dictionaries are incorporated. We also release the first phonemic-orthographic alignment dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech Tagging) among the understudied but interconnected Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer and bridge the gap among CJKV languages, leading to consistent improvements on cross-lingual token-level tasks over orthographic-based multilingual PLMs.
摘要
Local one-to-one alignment between the two modalities2. Alignment via multi-modality contexts to leverage information from additional modalities3. Alignment via multilingual contexts using bilingual dictionariesWe also release the first phonemic-orthographic alignment dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech Tagging) among the understudied but interconnected Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study shows that phonemic transcriptions provide essential information beyond the orthography to enhance cross-lingual transfer and bridge the gap among CJKV languages, leading to consistent improvements on cross-lingual token-level tasks over orthographic-based multilingual PLMs.Translation notes:* “orthographic representation” is translated as “文字表示” (wén zì biǎo yì)* “phonemic transcriptions” is translated as “phoneme транскрипции” (fōnēm yīn xiǎng)* “alignment” is translated as “对应” (duì yì)* “modality” is translated as “modalities” is translated as “语言 modalities” (yǔ yán modalities)* “token-level tasks” is translated as “токен级任务” (tuō kēn jīn yè)* “multilingual PLMs” is translated as “多语言 PLMs” (duō yǔ yán PLMs)
Event Extraction as Question Generation and Answering
results: 实验表明, compared to prior single-task-based models, QGA-EE在ACE05英语 dataset上的表现更高, indicating that the proposed method can effectively improve the accuracy and efficiency of event extraction.Abstract
Recent work on Event Extraction has reframed the task as Question Answering (QA), with promising results. The advantage of this approach is that it addresses the error propagation issue found in traditional token-based classification approaches by directly predicting event arguments without extracting candidates first. However, the questions are typically based on fixed templates and they rarely leverage contextual information such as relevant arguments. In addition, prior QA-based approaches have difficulty handling cases where there are multiple arguments for the same role. In this paper, we propose QGA-EE, which enables a Question Generation (QG) model to generate questions that incorporate rich contextual information instead of using fixed templates. We also propose dynamic templates to assist the training of QG model. Experiments show that QGA-EE outperforms all prior single-task-based models on the ACE05 English dataset.
摘要
最近的Event Extraction研究已经将任务重新定义为问题回答(QA),并取得了良好的结果。这种方法可以直接预测事件参数而不是首先提取候选人选。然而,问题通常基于固定模板,rarely leveraging上下文信息如相关参数。此外,先前的QA-based方法很难处理多个参数的同一个角色情况。在这篇论文中,我们提议QGA-EE,即使用问题生成(QG)模型生成含有丰富上下文信息的问题,而不是使用固定模板。我们还提出了动态模板,以帮助QG模型的训练。实验显示,QGA-EE在ACE05英语数据集上的单任务模型都超过了所有之前的模型。
HistRED: A Historical Document-Level Relation Extraction Dataset
methods: 该研究使用了 Yeonhaengnok 集成了的 HistRED 数据集,该数据集包含了 Hanja 和 Korean 文本的双语注释,以支持历史 RE 任务的研究。
results: 研究提出了一种双语 RE 模型,利用了 Korean 和 Hanja 文本上的上下文来预测实体之间的关系。模型在 HistRED 数据集上表现出色,超过了单语基elines,表明使用多语言上下文可以补充 RE 预测。Here’s the simplified Chinese text for each point:
for: 这个研究的目的是推动历史关系抽取(RE)研究,探索历史数据中的潜在应用场景。
methods: 该研究使用了《연향록》集成的 HistRED 数据集,该数据集包含了汉字和韩语文本的双语注释,以支持历史 RE 任务的研究。
results: 研究提出了一种双语 RE 模型,利用了韩语和汉字文本上的上下文来预测实体之间的关系。模型在 HistRED 数据集上表现出色,超过了单语基elines,表明使用多语言上下文可以补充 RE 预测。Abstract
Despite the extensive applications of relation extraction (RE) tasks in various domains, little has been explored in the historical context, which contains promising data across hundreds and thousands of years. To promote the historical RE research, we present HistRED constructed from Yeonhaengnok. Yeonhaengnok is a collection of records originally written in Hanja, the classical Chinese writing, which has later been translated into Korean. HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts. In addition, HistRED supports various self-contained subtexts with different lengths, from a sentence level to a document level, supporting diverse context settings for researchers to evaluate the robustness of their RE models. To demonstrate the usefulness of our dataset, we propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities. Our model outperforms monolingual baselines on HistRED, showing that employing multiple language contexts supplements the RE predictions. The dataset is publicly available at: https://huggingface.co/datasets/Soyoung/HistRED under CC BY-NC-ND 4.0 license.
摘要
尽管关系提取(RE)任务在不同领域得到了广泛应用,但历史上的应用尚未得到了充分的研究。为推动历史RE研究,我们现在提出了 HistRED,它是基于《연해нг록》的一个建构。《연해нг록》是一种原始写于汉字的记录,后来被翻译成朝鲜语。HistRED提供了双语注释,使得RE可以在朝鲜语和汉字文本之间进行。此外,HistRED支持多种自 contenido Subtexts,其中 lengths 从句子级到文档级,以支持多种上下文设置,以便研究人员可以评估其RE模型的可靠性。为了证明我们的数据集的有用性,我们提出了一种双语RE模型,该模型利用了朝鲜语和汉字上下文来预测实体之间的关系。我们的模型在 HistRED 上表现出色,超过了单语基eline,显示了employna多语言上下文可以补充RE预测。该数据集现在可以在以下链接获取:https://huggingface.co/datasets/Soyoung/HistRED, unter CC BY-NC-ND 4.0 license。
Automated Essay Scoring in Argumentative Writing: DeBERTeachingAssistant
paper_authors: Yann Hicke, Tonghua Tian, Karan Jha, Choong Hee Kim
for: This paper aims to improve the assessment of argumentative writing by developing a transformer-based architecture that can annotate discourse elements for their persuasiveness quality.
methods: The proposed method uses a transformer-based architecture to analyze argumentative writing and provide annotations for the persuasiveness quality of various discourse elements.
results: The proposed method achieved above-human accuracy in annotating argumentative writing discourse elements for their persuasiveness quality.Here’s the text in Simplified Chinese:
results: 提议的方法在评估口头写作中的语言元素评价上达到了人类以上的准确率。Abstract
Automated Essay scoring has been explored as a research and industry problem for over 50 years. It has drawn a lot of attention from the NLP community because of its clear educational value as a research area that can engender the creation of valuable time-saving tools for educators around the world. Yet, these tools are generally focused on detecting good grammar, spelling mistakes, and organization quality but tend to fail at incorporating persuasiveness features in their final assessment. The responsibility to give actionable feedback to the student to improve the strength of their arguments is left solely on the teacher's shoulders. In this work, we present a transformer-based architecture capable of achieving above-human accuracy in annotating argumentative writing discourse elements for their persuasiveness quality and we expand on planned future work investigating the explainability of our model so that actionable feedback can be offered to the student and thus potentially enable a partnership between the teacher's advice and the machine's advice.
摘要
Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt Augmentation and Text-To-Image Diffusion
results: 实验结果表明,增强CLIP和SD Sampling可以提高图像和文本的匹配率,并且可以减少多对多的问题。Abstract
This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation (VWSD) Task in English. Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs. We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language. Conversely, the descriptive focus of the phrase varies from instance to instance. We address these issues in our two systems, Augment-CLIP and Stable Diffusion Sampling (SD Sampling). Augment-CLIP augments the text prompt by generating sentences that contain the context phrase with the help of large language models (LLMs). We further explore CLIP models in other languages, as the an ambiguous word may be translated into an unambiguous one in the other language. SD Sampling uses text-to-image Stable Diffusion to generate multiple images from the given phrase, increasing the likelihood that a subset of images match the one that paired with the text.
摘要
Assessing the efficacy of large language models in generating accurate teacher responses
results: 研究发现GPT-4在Techer-Student Chatroom Corpus子集上表现出色, measured using BERTScore和DialogRPT。 Additionally, the study found that certain dataset characteristics, such as sampling, representativeness, and dialog completeness, can pose challenges to fine-tuning and contribute to the poor generalizability of the fine-tuned models.Abstract
(Tack et al., 2023) organized the shared task hosted by the 18th Workshop on Innovative Use of NLP for Building Educational Applications on generation of teacher language in educational dialogues. Following the structure of the shared task, in this study, we attempt to assess the generative abilities of large language models in providing informative and helpful insights to students, thereby simulating the role of a knowledgeable teacher. To this end, we present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we fine-tuned the Flan-T5 model using reinforcement learning. Our experimental findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT. We hypothesize that several dataset characteristics, including sampling, representativeness, and dialog completeness, pose significant challenges to fine-tuning, thus contributing to the poor generalizability of the fine-tuned models. Finally, we note the need for these generative models to be evaluated with a metric that relies not only on dialog coherence and matched language modeling distribution but also on the model's ability to showcase pedagogical skills.
摘要
我们认为, dataset 特性,包括采样、 representativeness 和对话完整性,对于调整带来了 significi cant 挑战,这些挑战对于调整模型的泛化性具有负面影响。最后,我们注意到,为了评估这些生成模型,需要使用一种指标,不仅考虑对话 coherence 和模型语言分布的匹配,还需要考虑模型在教学技巧方面的表现。
Automatic Coding at Scale: Design and Deployment of a Nationwide System for Normalizing Referrals in the Chilean Public Healthcare System
methods: 该论文提出了一种两步方法,首先使用 state-of-the-art NER 模型来识别疾病提到,然后使用基于 Elasticsearch 的搜索引擎系统将最相关的疾病代码分配给疾病提到。
results: 论文的实验结果表明,该系统可以准确地自动分配疾病代码,MAP 得分为 0.63 和 0.83 分别在 subcategory 和 category 两个水平上。Abstract
The disease coding task involves assigning a unique identifier from a controlled vocabulary to each disease mentioned in a clinical document. This task is relevant since it allows information extraction from unstructured data to perform, for example, epidemiological studies about the incidence and prevalence of diseases in a determined context. However, the manual coding process is subject to errors as it requires medical personnel to be competent in coding rules and terminology. In addition, this process consumes a lot of time and energy, which could be allocated to more clinically relevant tasks. These difficulties can be addressed by developing computational systems that automatically assign codes to diseases. In this way, we propose a two-step system for automatically coding diseases in referrals from the Chilean public healthcare system. Specifically, our model uses a state-of-the-art NER model for recognizing disease mentions and a search engine system based on Elasticsearch for assigning the most relevant codes associated with these disease mentions. The system's performance was evaluated on referrals manually coded by clinical experts. Our system obtained a MAP score of 0.63 for the subcategory level and 0.83 for the category level, close to the best-performing models in the literature. This system could be a support tool for health professionals, optimizing the coding and management process. Finally, to guarantee reproducibility, we publicly release the code of our models and experiments.
摘要
疾病编码任务是将每个在临床文档中提到的疾病分配一个从控制词汇中获取的唯一标识符。这项任务非常重要,因为它允许从无结构数据中提取信息,以进行例如,疾病发生率和患病率的评估。然而,手动编码过程受到误差的影响,因为医疗人员需要熟悉编码规则和术语。此外,这个过程需要很多时间和能量,这些资源可以用于更有价值的临床任务。为解决这些困难,我们提出了一种自动将疾病编码为疾病名称的两步系统。具体来说,我们的模型使用了当前领域的最佳NER模型,以识别疾病提到的文本,并使用基于Elasticsearch的搜索引擎系统,将疾病提到的最相关的编码词归类。我们的系统在专家手动编码的referral上进行评估,我们的系统在分类层次上获得了MAP分数为0.63,在类别层次上获得了MAP分数为0.83,与文献中最佳模型几乎相同。这个系统可以作为医疗专业人员的支持工具,优化编码和管理过程。最后,为保证可重现性,我们在线上公开发布了我们的模型和实验。