results: 通过标准指标评估ipeline的效果Abstract
With the increase in video-sharing platforms across the internet, it is difficult for humans to moderate the data for explicit content. Hence, an automated pipeline to scan through video data for explicit content has become the need of the hour. We propose a novel pipeline that uses multi-modal deep learning to first extract the explicit segments of input videos and then summarize their content using text to determine its age appropriateness and age rating. We also evaluate our pipeline's effectiveness in the end using standard metrics.
摘要
随着互联网上的视频分享平台的增加,人类对数据进行监管已变得困难。因此,一个自动化的数据扫描管道已成为当前的需求。我们提出了一种新的管道,使用多Modal深度学习来提取输入视频中的Explicit段落,然后使用文本来确定其适合年龄和评级。我们还在结束时对管道的效果进行评估,使用标准指标。Here's the translation of the text into Traditional Chinese:随着互联网上的视频分享平台的增加,人类对数据进行监管已变得困难。因此,一个自动化的数据扫描管道已成为当前的需求。我们提出了一个新的管道,使用多Modal深度学习来提取输入视频中的Explicit段落,然后使用文本来确定其适合年龄和评级。我们也在结束时对管道的效果进行评估,使用标准指标。
Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models
results: 我们在Active Vision dataset和ADE20K dataset上证明了提出的方法的效iveness。我们对比了我们的标注过程与人工标注,并证明了下游任务中的object goal navigation和part discovery的性能得到了提高。Abstract
The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.
摘要
“图像标注阶段是训练和评估物件探测和Semantic Segmentation模型的 kritical 和最时间consuming 部分。在部署现有模型到新环境时,需要探测到 novel 的Semantic classes,而且室内场景具有重要的视角差异,需要对训练好的感知模型进行正确处理。我们提出利用最新的顶尖模型,包括底部分Segmentation(SAM)、物件探测(Detic)和Semantic Segmentation(MaskFormer),所有它们在大规模数据上进行训练。我们目标是发展一种可以实现Cost-effective的标签方法,以获得室内环境中Semantic segmentation和物件实例探测的pseudo-labels,以便训练轻量级模型供不同的下游任务。我们还提出一个多视角标签融合阶段,考虑到场景中有多个视角可用,并可以使用多个视角来识别和修正单一视角的不一致。我们在Active Vision dataset和ADE20K dataset上显示了提案的效果。我们评估标签过程的质量,与人工标注进行比较,以及下游任务中的物件目标航行和部件发现的效果。在物件目标航行中,我们显示了基于融合方法的表现,较之零基eline模型使用大型视觉语言预训模型。”
Token-level Adaptation of LoRA Adapters for Downstream Task Generalization
results: 研究结果显示,将LoRA束进行单位水平的适应,可以超过基本Llama-2-7b模型的性能,并且在数学(GSM8K)、科学(ARC-Challenge)、阅读理解(SQuAD)和程式设计(CodeAlpaca-20k)任务中都能够取得佳绩。Abstract
This paper introduces a method for adapting LoRA adapters in smaller-sized language models to arbitrary downstream tasks. Unlike standard mixture-of-expert architectures, our method employs a gradient-free routing function to choose a weighted combination of experts without increasing the compute requirements for training or inference. The results show that token-level adaptation of LoRA adapters outperforms the base Llama-2-7b model across mathematical (GSM8K), scientific (ARC-Challenge), reading comprehension (SQuAD), and coding (CodeAlpaca-20k) tasks. Further evaluations also show that the average performance of token-level adaptation outperforms individual models fine-tuned for each of the tasks with the best performance observed in adaptation of every-other token during inference. The code for this study is made available through a public repository.
摘要
Here's the Simplified Chinese translation:这篇论文提出了一种方法,用于在小型语言模型中适应任意下游任务。与传统的权重混合体系不同,我们的方法使用梯度自由路由函数来选择一个权重加权的专家组合,不会增加训练或推理计算的需求。结果表明,将LoRA适应器 token 级别适应到下游任务时,超过了基础 Llama-2-7b 模型在数学 (GSM8K)、科学 (ARC-Challenge)、阅读理解 (SQuAD) 和编程 (CodeAlpaca-20k) 任务上的性能。此外,token 级别适应的平均性能还超过了每个任务的最佳模型 fine-tune 后的性能,并且在推理时采用每个 token 适应的方式得到了最佳性能。该研究的代码公开在公共存储库中。
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
paper_authors: Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi
for: 提高适应大语言模型下沟通和用户偏好的理解和最佳实践
methods: 利用新的迭代优化技术和更高质量的指导数据集进行模型优化
results: 实现了开源模型的状态天化表现和匹配或超越GPT-3.5-turbo-0301的一些benchmarkHere’s a more detailed explanation of each point:
for: The paper aims to improve the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences.
methods: The authors use a number of advances in open resources for instruction tuning, including better base models and new finetuning techniques, to improve the T"ULU models. They release a suite of improved models, including T"ULU-V2-mix, T"ULU 2, T"ULU 2+DPO, and CODE T"ULU 2.
results: The authors evaluate the T"ULU 2 suite on multiple benchmarks and show that it achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks.Abstract
Since the release of T\"ULU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.
摘要
Since the release of T\"ULU [王等,2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
results: 我们的实验结果表明,使用这些”注意力less Transformer”可以与原始建筑物rival的性能。我们通过精心的减少研究和不同类型和大小的置换网络来证明我们的方法的可行性。Abstract
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
摘要
Countering Misinformation via Emotional Response Generation
results: 提供了首个大规模的证据Response集(约12千个CLAIM-RESPONSE对),覆盖了社交媒体平台上的基本情感和谣言吸引力两大因素,并通过了大规模实验,证明模型在输出质量和总体适应能力方面具有显著改进。Abstract
The proliferation of misinformation on social media platforms (SMPs) poses a significant danger to public health, social cohesion and ultimately democracy. Previous research has shown how social correction can be an effective way to curb misinformation, by engaging directly in a constructive dialogue with users who spread -- often in good faith -- misleading messages. Although professional fact-checkers are crucial to debunking viral claims, they usually do not engage in conversations on social media. Thereby, significant effort has been made to automate the use of fact-checker material in social correction; however, no previous work has tried to integrate it with the style and pragmatics that are commonly employed in social media communication. To fill this gap, we present VerMouth, the first large-scale dataset comprising roughly 12 thousand claim-response pairs (linked to debunking articles), accounting for both SMP-style and basic emotions, two factors which have a significant role in misinformation credibility and spreading. To collect this dataset we used a technique based on an author-reviewer pipeline, which efficiently combines LLMs and human annotators to obtain high-quality data. We also provide comprehensive experiments showing how models trained on our proposed dataset have significant improvements in terms of output quality and generalization capabilities.
摘要
社交媒体平台上的谣言泛洪 pose 一大难题 для公共健康、社会凝聚和最终是民主。 previous research 表明,社交更正可以有效地遏制谣言,通过直接与传播谣言的用户进行构建性对话。 although professional fact-checkers are crucial to debunking viral claims, they usually do not engage in social media conversations. Therefore, significant effort has been made to automate the use of fact-checker material in social correction, but no previous work has tried to integrate it with the style and pragmatics that are commonly employed in social media communication. To fill this gap, we present VerMouth, the first large-scale dataset consisting of approximately 12,000 claim-response pairs (linked to debunking articles), taking into account both SMP-style and basic emotions, which have a significant impact on the credibility and spread of misinformation. To collect this dataset, we used a technique based on an author-reviewer pipeline, which efficiently combines LLMs and human annotators to obtain high-quality data. We also provide comprehensive experiments showing that models trained on our proposed dataset have significant improvements in terms of output quality and generalization capabilities.
Detection of Offensive and Threatening Online Content in a Low Resource Language
paper_authors: Fatima Muhammad Adam, Abubakar Yakubu Zandam, Isa Inuwa-Dutse for: This study aimed to address the lack of detection systems for offensive and threatening language in Hausa, a low-resource language spoken by over 100 million people in Africa.methods: The study consisted of two user studies (n=308) to investigate cyberbullying-related issues, collecting and annotating the first set of offensive and threatening datasets in Hausa, and developing a detection system to flag offensive and threatening content.results: The detection system was able to detect more than 70% of offensive and threatening content, but many of these were mistranslated by Google’s translation engine. The study highlights the need for a more effective detection system, which can be achieved by involving diverse stakeholders in understanding local conventions and demographics.Abstract
Hausa is a major Chadic language, spoken by over 100 million people in Africa. However, from a computational linguistic perspective, it is considered a low-resource language, with limited resources to support Natural Language Processing (NLP) tasks. Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language, which can go undetected due to the lack of detection systems designed for Hausa. This study aimed to address this issue by (1) conducting two user studies (n=308) to investigate cyberbullying-related issues, (2) collecting and annotating the first set of offensive and threatening datasets to support relevant downstream tasks in Hausa, (3) developing a detection system to flag offensive and threatening content, and (4) evaluating the detection system and the efficacy of the Google-based translation engine in detecting offensive and threatening terms in Hausa. We found that offensive and threatening content is quite common, particularly when discussing religion and politics. Our detection system was able to detect more than 70% of offensive and threatening content, although many of these were mistranslated by Google's translation engine. We attribute this to the subtle relationship between offensive and threatening content and idiomatic expressions in the Hausa language. We recommend that diverse stakeholders participate in understanding local conventions and demographics in order to develop a more effective detection system. These insights are essential for implementing targeted moderation strategies to create a safe and inclusive online environment.
摘要
哈萨语是一种主要的查迪语言,在非洲被讲语言人数超过100万人。然而,从计算机语言学角度来看,哈萨语是一种低资源语言,有限的资源支持自然语言处理(NLP)任务。在线平台经常承载社交交互,可能会导致使用侮辱和威胁语言,这些语言可能因为缺乏适用于哈萨语的检测系统而被忽略。本研究的目的是解决这个问题,通过以下方法:1. 进行了两场用户研究(n=308),检查了在线社交媒体上的煽动和威胁行为。2. 收集和标注了哈萨语的第一批侮辱和威胁数据集,以支持相关的下游任务。3. 开发了一个检测系统,可以检测侮辱和威胁内容。4. 评估了检测系统和Google翻译引擎在哈萨语中检测侮辱和威胁表达的效果。我们发现,在讨论宗教和政治时,侮辱和威胁内容很普遍。我们的检测系统可以检测到大于70%的侮辱和威胁内容,然而许多这些被Google翻译引擎误译。我们认为,这是因为哈萨语中侮辱和威胁内容和idiomatic表达之间存在巧妙的关系。我们建议参与者来理解当地的习俗和人口结构,以开发更有效的检测系统。这些洞察力是必要的,以实施targeted moderation策略,创造一个安全和包容的在线环境。
When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
results: 实验表明,使用 pseudo-labeling 方法可以生成高质量的数据集,并提供了基本的统计分析和模型评估。Abstract
Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on the annotators. The article aims to revisit an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of various language manipulations that cause numerous disinformation and profanity on social media platforms. The conducted experiment highlights three main stages of data annotation and underlines the main obstacles during machine annotation. Ultimately, we provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus to execute more advanced research and extend the existing data samples without annotators' engagement.
摘要
许多资源不足的语言需要高质量的数据集来进行特定任务,如违禁语言检测、谎言或谣言识别。然而,内容的细节可能会对注释者产生负面影响。本文旨在重新考虑使用假标注敏感数据的方法,以便在乌克兰推特上的俄罗斯-乌克兰战争为例。现在,这个紧耦的话题在社交媒体平台上引起了各种语言护卫和词汇滥议。本实验描述了三个主要的数据注释阶段,并强调了机器注释过程中的主要困难。最后,我们提供了基本的统计分析,评估使用pseudo标注模型,并提出了科学家可以通过这些资料执行更高级别的研究和扩展现有数据样本而无需注释员的参与。
Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
paper_authors: Kasun Wickramasinghe, Nisansa de Silva
for: This paper aims to align Sinhala and English word embedding spaces, addressing the lack of attention on low-resource languages in previous research.
methods: The authors use available alignment techniques and introduce a benchmark for Sinhala language embedding alignment, as well as an intermediate task of creating Sinhala-English alignment datasets.
results: While the results are not comparable to those of high-resource languages, the paper lays the groundwork for more specialized alignment between English and Sinhala embeddings.Here’s the simplified Chinese text:
results: 虽然结果与高资源语言相比并不相匹配,但这些研究为特有的英语-Singhala嵌入对齐做出了基础。Abstract
Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddings. Then when it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings given that even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala which is in dire need of help seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Even though we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings.
摘要
自它们的出现以来,嵌入式已成为许多自然语言处理(NLP)任务的主要组成部分,取代了之前的类型的表示。 Although multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala tend to focus more on monolingual embeddings. When it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings because even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala, which is in dire need of help, seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Although we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings.
Causal Graph in Language Model Rediscovers Cortical Hierarchy in Human Narrative Processing
results: 研究发现,这两类特征的预测精度图与人脑的活动时间常数图相似,这表明语言模型和人脑在处理语言信息方面存在共同之处。Abstract
Understanding how humans process natural language has long been a vital research direction. The field of natural language processing (NLP) has recently experienced a surge in the development of powerful language models. These models have proven to be invaluable tools for studying another complex system known to process human language: the brain. Previous studies have demonstrated that the features of language models can be mapped to fMRI brain activity. This raises the question: is there a commonality between information processing in language models and the human brain? To estimate information flow patterns in a language model, we examined the causal relationships between different layers. Drawing inspiration from the workspace framework for consciousness, we hypothesized that features integrating more information would more accurately predict higher hierarchical brain activity. To validate this hypothesis, we classified language model features into two categories based on causal network measures: 'low in-degree' and 'high in-degree'. We subsequently compared the brain prediction accuracy maps for these two groups. Our results reveal that the difference in prediction accuracy follows a hierarchical pattern, consistent with the cortical hierarchy map revealed by activity time constants. This finding suggests a parallel between how language models and the human brain process linguistic information.
摘要
人类语言处理的理解已经是研究方向的一个重要领域。自然语言处理(NLP)领域在最近几年内发展出了一系列强大的语言模型。这些模型已经证明是研究人类大脑语言处理的强大工具。以前的研究表明,语言模型的特征可以与fMRI脑动activity相映射。这引起了问题:语言模型和人类大脑是否存在共同性?为了估计语言模型中信息流动的 patrern,我们研究了不同层之间的 causal 关系。受工作空间框架的启发,我们假设了语言模型中的特征可以更好地集成更多的信息,并更准确地预测高层脑动活动。为了验证这个假设,我们将语言模型的特征分为两个类别:'low in-degree' 和 'high in-degree'。然后,我们比较了这两个类别的脑预测准确率图。我们的结果表明,预测准确率之间存在层次结构,与人类大脑语言处理的 cortical hierarchy map 相符。这一发现表明了语言模型和人类大脑在处理语言信息上的并行性。
Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads
results: 研究发现,在英语语言中,两种类型的 transformer 基于 PLM 中的 gender 和 racial 偏见注意头都存在,并且这些偏见注意头在不同的模型中 exhibit 不同的行为。 结果为理解 PLM 中偏见行为带来了新的认识和指导。Abstract
Transformer-based pretrained large language models (PLM) such as BERT and GPT have achieved remarkable success in NLP tasks. However, PLMs are prone to encoding stereotypical biases. Although a burgeoning literature has emerged on stereotypical bias mitigation in PLMs, such as work on debiasing gender and racial stereotyping, how such biases manifest and behave internally within PLMs remains largely unknown. Understanding the internal stereotyping mechanisms may allow better assessment of model fairness and guide the development of effective mitigation strategies. In this work, we focus on attention heads, a major component of the Transformer architecture, and propose a bias analysis framework to explore and identify a small set of biased heads that are found to contribute to a PLM's stereotypical bias. We conduct extensive experiments to validate the existence of these biased heads and to better understand how they behave. We investigate gender and racial bias in the English language in two types of Transformer-based PLMs: the encoder-based BERT model and the decoder-based autoregressive GPT model. Overall, the results shed light on understanding the bias behavior in pretrained language models.
摘要
transformer-based pre-trained大型自然语言模型(PLM),如BERT和GPT,在NLPTask中取得了很大成功。然而,PLMs具有编码偏见的倾向。虽然一个快速发展的文献出现在PLMs中的偏见减轻中,如对性别和种族刻板印象的减轻,但内部这些偏见的manifestation和行为仍然不为人所知。更深入的理解这些偏见机制可以帮助评估模型公正性,并指导开发有效的减轻策略。在这项工作中,我们专注于Transformer架构中的注意头,并提出一种偏见分析框架,以探索和确定PLM中的偏见注意头。我们进行了广泛的实验 validate the existence of these biased heads and better understand how they behave。我们investigate gender和种族偏见在英语语言中的两种Transformer-based PLMs:BERT模型和GPT模型。总的来说,结果 shed light on understanding pretrained语言模型的偏见行为。
FOAL: Fine-grained Contrastive Learning for Cross-domain Aspect Sentiment Triplet Extraction
methods: 本研究提议使用 Fine-grained cOntrAstive Learning (FOAL) 方法来减少频率域差异,保持每个类别的抽象性,从而提高 ASTE 的性能。
results: 对六个转移对比试验表明,FOAL 方法可以提高 ASTE 的性能达6%,同时显著减少频率域差异。Abstract
Aspect Sentiment Triplet Extraction (ASTE) has achieved promising results while relying on sufficient annotation data in a specific domain. However, it is infeasible to annotate data for each individual domain. We propose to explore ASTE in the cross-domain setting, which transfers knowledge from a resource-rich source domain to a resource-poor target domain, thereby alleviating the reliance on labeled data in the target domain. To effectively transfer the knowledge across domains and extract the sentiment triplets accurately, we propose a method named Fine-grained cOntrAstive Learning (FOAL) to reduce the domain discrepancy and preserve the discriminability of each category. Experiments on six transfer pairs show that FOAL achieves 6% performance gains and reduces the domain discrepancy significantly compared with strong baselines. Our code will be publicly available once accepted.
摘要
ASTE在Specific domain中得到了有前途的结果,但是在每个域名下annotate数据是不可能的。我们提议在cross-domain Setting中运用ASTE,将知识从resource-rich的源域传播到resource-poor的目标域,从而减轻目标域的标注数据的依赖。为了有效地在域之间传递知识和准确地提取情感 triplets,我们提出了一种方法 named Fine-grained cOntrAstive Learning (FOAL),以减少域之间的差异和保持每个类别的抽象能力。我们在六个转移对的实验中发现,FOAL可以提高性能6%,同时显著减少域之间的差异。我们的代码将在接受后公开。
Exploring the Relationship between In-Context Learning and Instruction Tuning
results: 研究发现,ICL和IT都会改变LLM的隐藏状态,但ICL是IT的隐藏状态改变的假设形式。此外,ICL和IT的融合程度受到提供示例的多种因素的影响。本研究提供了对LLM行为的新的理解。Abstract
In-Context Learning (ICL) and Instruction Tuning (IT) are two primary paradigms of adopting Large Language Models (LLMs) to downstream applications. However, they are significantly different. In ICL, a set of demonstrations are provided at inference time but the LLM's parameters are not updated. In IT, a set of demonstrations are used to tune LLM's parameters in training time but no demonstrations are used at inference time. Although a growing body of literature has explored ICL and IT, studies on these topics have largely been conducted in isolation, leading to a disconnect between these two paradigms. In this work, we explore the relationship between ICL and IT by examining how the hidden states of LLMs change in these two paradigms. Through carefully designed experiments conducted with LLaMA-2 (7B and 13B), we find that ICL is implicit IT. In other words, ICL changes an LLM's hidden states as if the demonstrations were used to instructionally tune the model. Furthermore, the convergence between ICL and IT is largely contingent upon several factors related to the provided demonstrations. Overall, this work offers a unique perspective to explore the connection between ICL and IT and sheds light on understanding the behaviors of LLM.
摘要
大型自然语言模型(LLM)在下游应用中采用两种主要方法:卷积学习(ICL)和指导调整(IT)。然而,这两种方法有所不同。在ICL中,批处时提供示例,但LLM的参数未更新。在IT中,用示例进行模型训练时更新LLM的参数,但在批处时不使用示例。虽然有一部分文献研究了ICL和IT,但这两个领域的研究几乎孤立地进行,导致ICL和IT之间的连接受到了挑战。在这项工作中,我们研究了ICL和IT之间的关系,并通过对LLaMA-2(7B和13B)进行精心的实验,发现ICL是卷积学习的隐藏状态的变化。即ICL类似于在训练时使用示例进行调整LLM的模型。此外,ICL和IT的协调性受示例提供的多个因素的影响。总之,本工作提供了研究ICL和IT之间连接的新视角,并照明了LLM的行为。
Complementary Advantages of ChatGPTs and Human Readers in Reasoning: Evidence from English Text Reading Comprehension
paper_authors: Tongquan Zhou, Yao Zhang, Siyi Cao, Yulu Li, Tao Wang
for: investigate how ChatGPTs and Chinese senior school students exhibited their reasoning ability from English narrative texts.
methods: used three reasoning tests: Test 1 for commonsense inference, Test 2 for emotional inference, and Test 3 for causal inference.
results: ChatGPTs outperformed the students in daily-life inferences and positive emotions, but the students showed superiority in negative emotions and logical analysis. ChatGPT Plus excelled in updating command condition.Abstract
ChatGPT has shown its great power in text processing, including its reasoning ability from text reading. However, there has not been any direct comparison between human readers and ChatGPT in reasoning ability related to text reading. This study was undertaken to investigate how ChatGPTs (i.e., ChatGPT and ChatGPT Plus) and Chinese senior school students as ESL learners exhibited their reasoning ability from English narrative texts. Additionally, we compared the two ChatGPTs in the reasoning performances when commands were updated elaborately. The whole study was composed of three reasoning tests: Test 1 for commonsense inference, Test 2 for emotional inference, and Test 3 for causal inference. The results showed that in Test 1, the students outdid the two ChatGPT versions in local-culture-related inferences but performed worse than the chatbots in daily-life inferences. In Test 2, ChatGPT Plus excelled whereas ChatGPT lagged behind in accuracy. In association with both accuracy and frequency of correct responses, the students were inferior to the two chatbots. Compared with ChatGPTs' better performance in positive emotions, the students showed their superiority in inferring negative emotions. In Test 3, the students demonstrated better logical analysis, outdoing both chatbots. In updating command condition, ChatGPT Plus displayed good causal reasoning ability while ChatGPT kept unchanged. Our study reveals that human readers and ChatGPTs have their respective advantages and disadvantages in drawing inferences from text reading comprehension, unlocking a complementary relationship in text-based reasoning.
摘要
chatGPT 在文本处理方面表现出了强大的能力,包括从文本阅读中的理解能力。然而,直到现在,没有直接比较人类阅读者和 chatGPT 在文本阅读中的理解能力。这项研究的目的是 investigate chatGPT 和中国高中生作为 ESOL 学习者在英文故事文本中的理解能力。此外,我们还比较了两个 chatGPT 版本在命令更新后的理解性能。整个研究包括三个理解测验:测验 1 为常识推理,测验 2 为情感推理,测验 3 为 causal 推理。结果表明,在测验 1 中,学生在本地文化相关的推理方面表现出色,但在日常生活相关的推理方面表现落后于 chatbot。在测验 2 中, chatGPT 加 plus 版本表现出色,而 chatGPT 版本则落后于 accuracy。与 accuracy 和正确回答频率相关,学生比 chatbot 弱。在测验 3 中,学生表现出色,在 logical analysis 方面胜过两个 chatbot。在命令更新后的情况下, chatGPT 加 plus 版本表现出色,而 chatGPT 版本保持不变。我们的研究表明,人类阅读者和 chatGPT 在文本阅读中的理解能力各有优劣,可以形成 complementary 关系。
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking
paper_authors: Hong Liu, Yucheng Cai, Yuan Zhou, Zhijian Ou, Yi Huang, Junlan Feng
for: 本研究旨在Addressing continual learning of dialog state tracking (DST) in the class-incremental scenario, where task identities are unknown during testing.
methods: 我们提出使用提问池方法,维护一个包含键值对的提问池,根据对话历史和提问键的距离选择提问。该方法可以自动识别任务并选择合适的提问 during testing.
results: 我们在Schema-Guided Dialog dataset (SGD) 和一个真实世界对话应用中的实验结果表明,提问池方法可以获得远高于基eline的联合目标准确率。并且将该方法与储备缓存结合,可以进一步提高模型性能。Abstract
Continual learning is crucial for dialog state tracking (DST) in dialog systems, since requirements from users for new functionalities are often encountered. However, most of existing continual learning methods for DST require task identities during testing, which is a severe limit in real-world applications. In this paper, we aim to address continual learning of DST in the class-incremental scenario (namely the task identity is unknown in testing). Inspired by the recently emerging prompt tuning method that performs well on dialog systems, we propose to use the prompt pool method, where we maintain a pool of key-value paired prompts and select prompts from the pool according to the distance between the dialog history and the prompt keys. The proposed method can automatically identify tasks and select appropriate prompts during testing. We conduct experiments on Schema-Guided Dialog dataset (SGD) and another dataset collected from a real-world dialog application. Experiment results show that the prompt pool method achieves much higher joint goal accuracy than the baseline. After combining with a rehearsal buffer, the model performance can be further improved.
摘要
Inspired by the recently emerging prompt tuning method that performs well on dialog systems, we propose the prompt pool method. We maintain a pool of key-value paired prompts and select prompts based on the distance between the dialog history and the prompt keys. The proposed method can automatically identify tasks and select appropriate prompts during testing.We conduct experiments on the Schema-Guided Dialog dataset (SGD) and a real-world dialog application dataset. Experiment results show that the prompt pool method achieves much higher joint goal accuracy than the baseline. By combining with a rehearsal buffer, the model performance can be further improved.
Energy and Carbon Considerations of Fine-Tuning BERT
paper_authors: Xiaorong Wang, Clara Na, Emma Strubell, Sorelle Friedler, Sasha Luccioni
for: This paper aims to provide a comprehensive understanding of the energy and carbon footprint of fine-tuning in NLP, in order to better characterize the role of fine-tuning in the landscape of energy and carbon emissions.
methods: The paper uses a careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure, and measurement modalities to place fine-tuning energy and carbon costs into perspective with respect to pre-training and inference.
results: The paper outlines recommendations to NLP researchers and practitioners who wish to improve their fine-tuning energy efficiency.Here is the same information in Simplified Chinese text:
results: 这篇论文为 NLP 研究者和实践者提供了提高 fine-tuning 能效的建议。Abstract
Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP community, existing work quantifying energy costs and associated carbon emissions has largely focused on language model pre-training. Although a single pre-training run draws substantially more energy than fine-tuning, fine-tuning is performed more frequently by many more individual actors, and thus must be accounted for when considering the energy and carbon footprint of NLP. In order to better characterize the role of fine-tuning in the landscape of energy and carbon emissions in NLP, we perform a careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities. Our experimental results allow us to place fine-tuning energy and carbon costs into perspective with respect to pre-training and inference, and outline recommendations to NLP researchers and practitioners who wish to improve their fine-tuning energy efficiency.
摘要
Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2
for: This study aims to investigate the effectiveness of a decoding algorithm in mitigating insults and political bias in generated text, with the goal of contributing to the ongoing effort of examining the ethical and social implications of human-AI interaction.
methods: The study uses generative pretrained transformer (GPT) language models that have the ability to recognize and detect toxicity in generated content, and a decoding algorithm that allows the models to self-debias and reduce the likelihood of generating harmful text.
results: The study aims to evaluate the efficacy of the diagnosing-debiasing approach in mitigating insults and political bias in generated text, and contribute to the ongoing effort of understanding the ethical and social implications of human-AI interaction.Abstract
The training of large language models (LLMs) on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice. Consequently, LLMs have learned and inadvertently reproduced various types of biases, including violent, offensive, and toxic language. However, recent research shows that generative pretrained transformer (GPT) language models can recognize their own biases and detect toxicity in generated content, a process referred to as self-diagnosis. In response, researchers have developed a decoding algorithm that allows LLMs to self-debias, or reduce their likelihood of generating harmful text. This study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias. These biases are often used interchangeably in discourse, despite exhibiting potentially dissimilar semantic and syntactic properties. We aim to contribute to the ongoing effort of investigating the ethical and social implications of human-AI interaction.
摘要
培训大型语言模型(LLM)在互联网上广泛、未经过滤的 corpora 上进行常见和有利的做法。因此,LLM 已经学习并不意气地复制了各种偏见,包括暴力、袋判和毒害的语言。然而,最近的研究表明,生成预训练 transformer(GPT)语言模型可以识别自己的偏见并检测生成内容中的毒害,一个过程称为自我诊断。为此,研究人员已经开发了一种解码算法,allowing LLMs 自我减偏,即减少生成危险的文本的可能性。本研究investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias。这些偏见通常在DISCOURSE中被混用,尽管它们可能具有不同的semantic和 sintactic 特征。我们想要贡献到人机交互的伦理和社会因素的ongoing 探索中。