results: 这个研究的结果显示,使用实体效率的调整(例如提示调整和低维度适应)来优化控制token,然后进行控制生成,可以将控制token的质量提高,并在两个公认的数据集上显著提高控制生成质量,与先前的研究相比。Abstract
Aligning large language models (LLMs) with human preferences is essential for safe and useful LLMs. Previous works mainly adopt reinforcement learning (RLHF) and direct preference optimization (DPO) with human feedback for alignment. Nevertheless, they have certain drawbacks. One such limitation is that they can only align models with one preference at the training time (e.g., they cannot learn to generate concise responses when the preference data prefers detailed responses), or have certain constraints for the data format (e.g., DPO only supports pairwise preference data). To this end, prior works incorporate controllable generations for alignment to make language models learn multiple preferences and provide outputs with different preferences during inference if asked. Controllable generation also offers more flexibility with regard to data format (e.g., it supports pointwise preference data). Specifically, it uses different control tokens for different preferences during training and inference, making LLMs behave differently when required. Current controllable generation methods either use a special token or hand-crafted prompts as control tokens, and optimize them together with LLMs. As control tokens are typically much lighter than LLMs, this optimization strategy may not effectively optimize control tokens. To this end, we first use parameter-efficient tuning (e.g., prompting tuning and low-rank adaptation) to optimize control tokens and then fine-tune models for controllable generations, similar to prior works. Our approach, alignMEnt with parameter-Efficient Tuning (MEET), improves the quality of control tokens, thus improving controllable generation quality consistently by an apparent margin on two well-recognized datasets compared with prior works.
摘要
对大型语言模型(LLM)的调整是非常重要,以确保其安全和有用。以前的工作主要采用了强化学习(RLHF)和直接喜好优化(DPO),并通过人类反馈来进行调整。然而,这些方法有一些缺点。例如,它们只能在训练时间内对一个喜好进行调整(例如,它们无法学习生成简洁响应,当喜好数据偏好详细响应时),或者有一些数据格式的限制(例如,DPO只支持对数据进行对比优化)。为了解决这个问题,先前的工作会 incorporate 可控生成,以使语言模型学习多个喜好,并在推理时根据需要生成不同的响应。可控生成还提供了更多的数据格式灵活性(例如,它支持点对数据)。具体来说,它在训练和推理时使用不同的控制符,使模型在不同的喜好下行为不同。现有的可控生成方法通常使用特殊符号或手工制定的提示作为控制符,并与模型一起优化。然而,这种优化策略可能不能有效地优化控制符。为了解决这个问题,我们首先使用参数高效调整(例如,提示调整和低级变换)来优化控制符,然后继续调整模型以实现可控生成。我们的方法,名为 alignMEnt with parameter-Efficient Tuning(MEET),可以不断提高控制符的质量,从而提高可控生成质量,并在两个常见的数据集上显著超越先前的工作。
Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation
results: 实验结果表明,我们的方法可以带给Transformer模型强制性 inductive bias,从而提高系统泛化和少量数据学习的能力,特别是 для FST-like 任务。Abstract
Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks.
摘要
强大的推导偏好可以帮助学习从少量数据中学习和泛化到训练分布之外。流行的神经网络架构如Transformer在seq2seq NLP任务中缺乏强制性的推导偏好,因此在训练分布之外的泛化方面会遇到困难,如 extrapolating 到更长的输入。我们示示了如何通过在模型中注入结构偏好来增强 seq2seq 模型的泛化能力。特别是,我们将 Transformer 模型预训练以模拟 Finite State Transducers (FSTs) 的结构变换。我们的实验结果表明,我们的方法可以增强模型的泛化能力和几拘学习能力,特别是在 FST-like 任务中。
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
results: 我们通过多个减少研究发现,预训自我的目标可以大幅提高模型的表格特定任务表现。例如,我们发现在表格内容中训练的文本问题回答(QA)模型,虽然已经特化了,但仍然有很大的改善空间。我们的研究是首次尝试将表格特定预训扩展到770M至11B字串处理器模型,并与对表格数据进行特化的模型进行比较。Abstract
Tables stored in databases and tables which are present in web pages and articles account for a large part of semi-structured data that is available on the internet. It then becomes pertinent to develop a modeling approach with large language models (LLMs) that can be used to solve diverse table tasks such as semantic parsing, question answering as well as classification problems. Traditionally, there existed separate models specialized for each task individually. It raises the question of how far can we go to build a unified model that works well on some table tasks without significant degradation on others. To that end, we attempt at creating a shared modeling approach in the pretraining stage with encoder-decoder style LLMs that can cater to diverse tasks. We evaluate our approach that continually pretrains and finetunes different model families of T5 with data from tables and surrounding context, on these downstream tasks at different model scales. Through multiple ablation studies, we observe that our pretraining with self-supervised objectives can significantly boost the performance of the models on these tasks. As an example of one improvement, we observe that the instruction finetuned public models which come specialized on text question answering (QA) and have been trained on table data still have room for improvement when it comes to table specific QA. Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models while also comparing the instruction finetuned variants of the models.
摘要
《文档存储在数据库和网页上的表格占据互联网上很大一部分半结构化数据。随后,我们需要开发一种模型方法,使用大型自然语言模型(LLM)来解决多种表格任务,如semantic parsing、问答以及分类问题。过去,我们有着专门为每个任务设计的单独模型。这引发了我们是否可以建立一个统一的模型,可以在不同任务之间无需重大下降性的情况下工作。为此,我们尝试了在预训练阶段使用encoder-decoder式LLM来建立共享模型方法,可以满足多种任务。我们通过多个缺省研究发现,我们的预训练自然语言对象可以显著提高模型在这些任务上的性能。例如,我们发现,通过训练文本问答(QA)模型,并将其特化为表格数据,仍然可以进一步提高表格特定的QA表现。我们的工作是首次研究表格特定预训练的优点,在扩展自然语言模型规模从770M到11B时进行比较,同时对特定 instrucion 的训练过程进行比较。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
results: 该研究在六个开源LVLM中测试了LURE,并取得了23%的全面对象幻觉评价指标提升,比前一个最佳方法更高。在GPT和人类评估中,LURE一直 ranks at the top。数据和代码可以在https://github.com/YiyangZhou/LURE上获取。Abstract
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.
摘要
(Simplified Chinese translation)大量视力语言模型(LVLM)已经表现出了对人类语言的Visual Information的强大理解能力。然而,LVLM仍然受到对象幻觉的困扰,即生成包含不存在于图像中的对象的描述。这可能会对视力语言任务产生负面影响,如视觉概要和理解。为解决这个问题,我们提议一种简单 yet powerful的算法,即LVLM幻觉修正器(LURE),以后期修正LVLM中的对象幻觉。LURE基于对对象幻觉的关键因素进行了严格的统计分析,包括共occurrence(图像中certain对象的频繁出现)、uncertainty(LVLM解码过程中对象的高度不确定性)和object position(幻觉通常在生成文本的后半部分出现)。此外,LURE还可以与任何LVLM集成。我们对六个开源LVLM进行评估,实现了以往最佳方法的23%提升。在GPT和人类评估中,LURE也一直 ranked at the top。我们的数据和代码可以在https://github.com/YiyangZhou/LURE中获得。
FELM: Benchmarking Factuality Evaluation of Large Language Models
results: 研究发现,虽然 Retrieval 可以帮助factuality evaluation,但目前的LLM仍然远远不够,无法准确检测factual errors。Abstract
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
摘要
正在评估大语言模型(LLM)生成的文本真实性是一个emerging yet crucial的研究领域,旨在警示用户 potential errors和导向更可靠的 LLM 发展。然而,评估真实性的评估人员自己也需要适当的评估,以便衡量进步和促进进步。这个方向还未得到充分的探索,导致 LLM 的发展受到了重大的阻碍。为了解决这个问题,我们提出了一个大语言模型真实性评估标准(Felm)。在这个标准中,我们收集了由 LLM 生成的回答,并对其进行细化的标签分类。与前一些研究主要集中于世界知识(例如Wikipedia)的真实性,felm 强调在多个领域中的真实性,包括世界知识、数学和逻辑。我们的标注基于文本段,可以帮助特定的错误找到。真实性标注还得到了预定义的错误类型和参考链接,这些链接可以支持或反对声明。在我们的实验中,我们调查了一些基于 LLM 的真实性评估器在 felm 上的表现,包括基于 vanilla LLM 和增强了检索机制和链式思维的 LLM。我们的发现表明,虽然检索可以帮助真实性评估,但目前的 LLM 还远不够可靠地检测错误。
Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi
for: 本研究旨在提高低资源语言 sentiment 分析的表现,特别是对印度语言 Marathi 进行了一项全面的数据扩充研究。
methods: 本文提出了四种数据扩充技术,包括 paraphrasing、back-translation、BERT 基于随机Token 替换和 named entity 替换,以及 GPT 基于文本和标签生成。
results: 研究结果显示,这些数据扩充方法可以提高 Marathi 语言的 sentiment 分析模型在跨频道情况下的表现,并且这些技术可以扩展到其他低资源语言和普通文本分类任务。Abstract
Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data. While sentiment analysis research has been extensively conducted in English and other Western languages, there exists a significant gap in research efforts for sentiment analysis in low-resource languages. Limited resources, including datasets and NLP research, hinder the progress in this area. In this work, we present an exhaustive study of data augmentation approaches for the low-resource Indic language Marathi. Although domain-specific datasets for sentiment analysis in Marathi exist, they often fall short when applied to generalized and variable-length inputs. To address this challenge, this research paper proposes four data augmentation techniques for sentiment analysis in Marathi. The paper focuses on augmenting existing datasets to compensate for the lack of sufficient resources. The primary objective is to enhance sentiment analysis model performance in both in-domain and cross-domain scenarios by leveraging data augmentation strategies. The data augmentation approaches proposed showed a significant performance improvement for cross-domain accuracies. The augmentation methods include paraphrasing, back-translation; BERT-based random token replacement, named entity replacement, and pseudo-label generation; GPT-based text and label generation. Furthermore, these techniques can be extended to other low-resource languages and for general text classification tasks.
摘要
Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech
results: 这篇论文的结果表明,使用这种方法可以对文本到语音转化系统的质量进行更广泛的评估,而不是仅仅是测试语音智能的准确率。此外,这种方法还可以与人工评估方法相比肩,并且可以减少人工评估的成本。Abstract
Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Error Rate (WER) to measure intelligibility. Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. Our main assumption is that by training the ASR model on the synthetic speech, the WER on real speech reflects the similarity between distributions, a broader assessment of synthetic speech quality beyond intelligibility. Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet on three recent Text-to-Speech (TTS) systems: MQTTS, StyleTTS, and YourTTS.
摘要
现代语音合成系统已经进步很 significatively,Synthetic speech 和 real speech 之间的差别已经变得极其微scopic。然而,efficiently and holistically evaluate synthetic speech 仍然是一个主要挑战。人工评分使用 Mean Opinion Score (MOS) 是理想的,但是它的成本很高。因此,研究人员已经开发了auxiliary automatic metrics like Word Error Rate (WER) 来度量语音明亮度。先前的研究主要基于使用预训练的speech recognition 模型来评估合成语音的质量,但这种方法只能测量语音的elligibility。在这篇论文中,我们提出了一种评估技术,即使用 ASR 模型来训练 synthetic speech,并用其在真实语音上的性能来度量合成语音的质量。我们的主要假设是,通过训练 ASR 模型使 synthetic speech 与 real speech 之间的分布相似,那么 WER 在真实语音上的性能将反映合成语音的质量,不仅是语音可理解性。我们的提出的度量与 MOS naturalness 和 MOS intelligibility 具有强相关性,并且在三个 latest Text-to-Speech (TTS) 系统(MQTTS、StyleTTS 和 YourTTS)上进行了比较。
Do the Benefits of Joint Models for Relation Extraction Extend to Document-level Tasks?
results: 实验结果表明,joint 模型在 sentence-level 任务上比 pipeline 模型显示出了更高的性能,但是在 document-level 任务上,joint 模型的性能下降了,与 pipeline 模型的性能相比。Abstract
Two distinct approaches have been proposed for relational triple extraction - pipeline and joint. Joint models, which capture interactions across triples, are the more recent development, and have been shown to outperform pipeline models for sentence-level extraction tasks. Document-level extraction is a more challenging setting where interactions across triples can be long-range, and individual triples can also span across sentences. Joint models have not been applied for document-level tasks so far. In this paper, we benchmark state-of-the-art pipeline and joint extraction models on sentence-level as well as document-level datasets. Our experiments show that while joint models outperform pipeline models significantly for sentence-level extraction, their performance drops sharply below that of pipeline models for the document-level dataset.
摘要
两种不同的方法有被提议用于关系三元EXTRACT - 管道和共同。共同模型, capture关系三元之间的互动,是更新的发展,并在句子级EXTRACT任务中显示出perform得到更好的结果。文档级EXTRACT是一个更加复杂的设定, где交互关系可以是长距离的,并且每个三元也可以跨 sentence。共同模型没有被应用于文档级任务上。在这篇文章中,我们对 sentence级和文档级的EXTRACT模型进行了比较。我们的实验结果表明,虽然共同模型在句子级EXTRACT任务上表现明显 луч于管道模型,但是对文档级数据集的性能下降了很多。
CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
paper_authors: Ma. Beatrice Emanuela Pilar, Ellyza Mari Papas, Mary Loise Buenaventura, Dane Dedoroy, Myron Darrel Montefalcon, Jay Rhald Padilla, Lany Maceda, Mideth Abisado, Joseph Marvin Imperial
for: 这个研究的目的是为了提供一个基线模型 для缅甸语名实体识别(NER)任务。
methods: 这个研究使用了Conditional Random Field和Bidirectional LSTM算法来适应缅甸语文本,并对4000份当地新闻文章进行了标注和训练。
results: 研究发现这个基线模型在精度、准确率和F1指标上达到了70%以上,并且在跨语言设置下与标准模型进行比较表现良好。Abstract
Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are important stepping stones to encourage the growth of research efforts in the field. To answer this call, we introduce CebuaNER, a new baseline model for named entity recognition (NER) in the Cebuano language. Cebuano is the second most-used native language in the Philippines, with over 20 million speakers. To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language, retrieved from online local Cebuano platforms to train algorithms such as Conditional Random Field and Bidirectional LSTM. Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags, as well as potential efficacy in a crosslingual setup with Tagalog.
摘要
Despite being one of the most linguistically diverse regions in the world, computational linguistics and language processing research in Southeast Asia has struggled to keep up with the level of countries from the Global North. To address this challenge, initiatives such as open-sourcing corpora and developing baseline models for basic language processing tasks are crucial stepping stones to encourage the growth of research efforts in the field. In response to this call, we introduce CebuaNER, a new baseline model for named entity recognition (NER) in the Cebuano language. Cebuano is the second most widely spoken native language in the Philippines, with over 20 million speakers. To build the model, we collected and annotated over 4,000 news articles, the largest dataset of any work in the language, retrieved from online local Cebuano platforms and trained algorithms such as Conditional Random Field and Bidirectional LSTM. Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags, as well as potential efficacy in a crosslingual setup with Tagalog.
results: experiments 表明,GeRA 方法在 speech-text 和 image-text 领域中表现出了明显的改进,特别是使用小量对数据的 paired data。Abstract
Pretrained unimodal encoders incorporate rich semantic information into embedding space structures. To be similarly informative, multi-modal encoders typically require massive amounts of paired data for alignment and training. We introduce a semi-supervised Geometrically Regularized Alignment (GeRA) method to align the embedding spaces of pretrained unimodal encoders in a label-efficient way. Our method leverages the manifold geometry of unpaired (unlabeled) data to improve alignment performance. To prevent distortions to local geometry during the alignment process, potentially disrupting semantic neighborhood structures and causing misalignment of unobserved pairs, we introduce a geometric loss term. This term is built upon a diffusion operator that captures the local manifold geometry of the unimodal pretrained encoders. GeRA is modality-agnostic and thus can be used to align pretrained encoders from any data modalities. We provide empirical evidence to the effectiveness of our method in the domains of speech-text and image-text alignment. Our experiments demonstrate significant improvement in alignment quality compared to a variaty of leading baselines, especially with a small amount of paired data, using our proposed geometric regularization.
摘要
<>translate the following text into Simplified Chinese:Pretrained unimodal encoders incorporate rich semantic information into embedding space structures. To be similarly informative, multi-modal encoders typically require massive amounts of paired data for alignment and training. We introduce a semi-supervised Geometrically Regularized Alignment (GeRA) method to align the embedding spaces of pretrained unimodal encoders in a label-efficient way. Our method leverages the manifold geometry of unpaired (unlabeled) data to improve alignment performance. To prevent distortions to local geometry during the alignment process, potentially disrupting semantic neighborhood structures and causing misalignment of unobserved pairs, we introduce a geometric loss term. This term is built upon a diffusion operator that captures the local manifold geometry of the unimodal pretrained encoders. GeRA is modality-agnostic and thus can be used to align pretrained encoders from any data modalities. We provide empirical evidence to the effectiveness of our method in the domains of speech-text and image-text alignment. Our experiments demonstrate significant improvement in alignment quality compared to a variety of leading baselines, especially with a small amount of paired data, using our proposed geometric regularization.Translate the text into Simplified Chinese: preprained 单modal encoders 含有丰富的 semantic 信息,将 embedding 空间结构中的信息升级为多modal encoders 需要巨量的对应数据对Alignment和training。我们介绍了一种 semi-supervised 的 Geometrically Regularized Alignment (GeRA) 方法,用于对 preprained 单modal encoders 的 embedding 空间进行标签效率的对Alignment。我们的方法利用了无对应数据的 manifold geometry,以提高对Alignment的性能。为避免对Local geometry的扭曲,可能导致 semantic 邻居结构的扰乱和未观察对的歪曲,我们引入了一个 geometric 损失项。这个项目基于一个 diffusion 算子,捕捉了单modal 预训练 encoders 的 Local manifold geometry。GeRA 是modal-agnostic,因此可以用于对任何数据模式的预训练 encoders 进行对Alignment。我们提供了实验证明我们的方法在speech-text 和 image-text 对Alignment中的效果。我们的实验表明,使用我们提posed的 geometric 正则化可以在小量对数据情况下达到显著提高对Alignment质量的效果,特别是与多种主流基准值进行比较。
Fewer is More: Trojan Attacks on Parameter-Efficient Fine-Tuning
for: This paper explores the security implications of parameter-efficient fine-tuning (PEFT) for pre-trained language models (PLMs), and reveals a novel attack called PETA that can successfully inject a backdoor into a PLM using PEFT.
methods: The attack uses bilevel optimization to embed a backdoor into a PLM while retaining the PLM’s task-specific performance, and the defense omits PEFT in selected layers of the backdoored PLM and unfreezes a subset of these layers’ parameters to neutralize the attack.
results: The attack is effective in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs PEFT over the backdoored PLM using untainted data. The defense is effective in neutralizing the attack.Here is the summary in Traditional Chinese:
for: 本研究探讨parameter-efficient fine-tuning (PEFT)所带来的安全问题,并发现了一种称为PETA的攻击,可以成功地将backdoor注入到pre-trained language models (PLMs)中。
results: 这个攻击具有成功率和不受污染的清洁率,甚至在受害者使用不混合的数据进行PEFT后仍然有效。防御方法能够有效地中和攻击。Abstract
Parameter-efficient fine-tuning (PEFT) enables efficient adaptation of pre-trained language models (PLMs) to specific tasks. By tuning only a minimal set of (extra) parameters, PEFT achieves performance comparable to full fine-tuning. However, despite its prevalent use, the security implications of PEFT remain largely unexplored. In this paper, we conduct a pilot study revealing that PEFT exhibits unique vulnerability to trojan attacks. Specifically, we present PETA, a novel attack that accounts for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a PLM while the lower-level objective simulates PEFT to retain the PLM's task-specific performance. With extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs PEFT over the backdoored PLM using untainted data. Moreover, we empirically provide possible explanations for PETA's efficacy: the bilevel optimization inherently 'orthogonalizes' the backdoor and PEFT modules, thereby retaining the backdoor throughout PEFT. Based on this insight, we explore a simple defense that omits PEFT in selected layers of the backdoored PLM and unfreezes a subset of these layers' parameters, which is shown to effectively neutralize PETA.
摘要
parameter-efficient fine-tuning (PEFT) 可以快速地适应预训练语言模型 (PLM) 到特定任务。通过只调整一小部分 (Extra) 的参数,PEFT 可以达到与全面 fine-tuning 相同的性能。然而,尽管它在广泛使用,PEFT 的安全性问题仍然未得到足够的探讨。在这篇论文中,我们进行了一个小型研究,揭示 PEFT 存在独特的潜在攻击点。 Specifically, we present PETA, a novel attack that accounts for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a PLM while the lower-level objective simulates PEFT to retain the PLM's task-specific performance. With extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs PEFT over the backdoored PLM using untainted data. Moreover, we empirically provide possible explanations for PETA's efficacy: the bilevel optimization inherently 'orthogonalizes' the backdoor and PEFT modules, thereby retaining the backdoor throughout PEFT. Based on this insight, we explore a simple defense that omits PEFT in selected layers of the backdoored PLM and unfreezes a subset of these layers' parameters, which is shown to effectively neutralize PETA.
Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification
results: 与MFCC相比,使用WST特征可以降低识别错误率,最多降低14.05%和6.40% для同一 corpus和隐藏 VoxLingua107评估 respectivelyAbstract
Commonly used features in spoken language identification (LID), such as mel-spectrogram or MFCC, lose high-frequency information due to windowing. The loss further increases for longer temporal contexts. To improve generalization of the low-resourced LID systems, we investigate an alternate feature representation, wavelet scattering transform (WST), that compensates for the shortcomings. To our knowledge, WST is not explored earlier in LID tasks. We first optimize WST features for multiple South Asian LID corpora. We show that LID requires low octave resolution and frequency-scattering is not useful. Further, cross-corpora evaluations show that the optimal WST hyper-parameters depend on both train and test corpora. Hence, we develop fused ECAPA-TDNN based LID systems with different sets of WST hyper-parameters to improve generalization for unknown data. Compared to MFCC, EER is reduced upto 14.05% and 6.40% for same-corpora and blind VoxLingua107 evaluations, respectively.
摘要
通常使用的语音识别(LID)任务中的特征,如MEL-spectrogram或MFCC,因窗口效应而产生高频信息损失,这种损失随着时间上下文的增加而加大。为了改善低资源的LID系统的通用性,我们 investigate了一种 alternate 特征表示,wavelet scattering transform(WST),该表示可以补偿这些缺点。据我们所知,WST在LID任务中没有被探索过。我们首先优化WST特征 для多个南亚语言LID corpus。我们发现,LID需要低 octave 分辨率,而频率散射并不是有用。此外,跨 corpus 评估表明,优化 WST 超参数取决于训练和测试 corpus。因此,我们开发了 fusion ECAPA-TDNN 基于 WST 的 LID 系统,以提高对不知数据的泛化性。相比 MFCC,我们在同一 corpora 和 blind VoxLingua107 评估中分别减少了 EER 14.05% 和 6.40%。
A Task-oriented Dialog Model with Task-progressive and Policy-aware Pre-training
paper_authors: Lucen Zhong, Hengtong Lu, Caixia Yuan, Xiaojie Wang, Jiashen Sun, Ke Zeng, Guanglu Wan
for: 提高任务对话(TOD)相关任务的顺序性和对话策略学习
methods: 使用两种策略相关预训练任务进行预训练,包括全球策略一致性任务和行为相似学习任务
results: 在多个WOZ和车辆内端对话模型评价标准中表现更好,只使用18%的参数和25%的预训练数据,与之前的状态当前PCMGALAXY相比Abstract
Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However, existing PCMs for Task-oriented dialog (TOD) are insufficient for capturing the sequential nature of the TOD-related tasks, as well as for learning dialog policy information. To alleviate these problems, this paper proposes a task-progressive PCM with two policy-aware pre-training tasks. The model is pre-trained through three stages where TOD-related tasks are progressively employed according to the task logic of the TOD system. A global policy consistency task is designed to capture the multi-turn dialog policy sequential relation, and an act-based contrastive learning task is designed to capture similarities among samples with the same dialog policy. Our model achieves better results on both MultiWOZ and In-Car end-to-end dialog modeling benchmarks with only 18\% parameters and 25\% pre-training data compared to the previous state-of-the-art PCM, GALAXY.
摘要
各种前置模型(PCM)在过去几年内已经取得了令人满意的进步。然而,现有的PCM对任务导向对话(TOD)不足以捕捉TOD相关任务的顺序性,以及对话策略信息的学习。为了解决这些问题,这篇论文提出了一种任务逐步进行的PCM,其中包括两个策略意识的预训练任务。模型在三个阶段中预训练,其中TOD相关任务逐步应用于TOD系统的任务逻辑。为了捕捉多Turn对话策略的顺序关系,我们设计了全球策略一致任务。同时,为了捕捉同一策略下的对话样本的相似性,我们设计了基于行为的对比学习任务。我们的模型在MultiWOZ和In-Car终端对话模型 benchmark上达到了之前的state-of-the-art PCMGALAXY的性能,但它只有18%的参数和25%的预训练数据。
Nine-year-old children outperformed ChatGPT in emotion: Evidence from Chinese writing
results: 结果显示 nine-year-old 儿童在 fluency 和 cohesion 方面的写作水平胜过 chatGPT,但 chatGPT 在 accuracy 方面表现出色。 children 在 science-themed 写作中表现出更高的 complexity,而 chatGPT 在 nature-themed 写作中表现出更高的 accuracy。 最重要的是,这项研究发现 nine-year-old 儿童在中文作文中表达的情感更强于 chatGPT。Abstract
ChatGPT has been demonstrated to possess significant capabilities in generating intricate, human-like text, and recent studies have established that its performance in theory of mind tasks is comparable to that of a nine-year-old child. However, it remains uncertain whether ChatGPT surpasses nine-year-old children in Chinese writing proficiency. To explore this, our study juxtaposed the Chinese writing performance of ChatGPT and nine-year-old children on both narrative and scientific topics, aiming to uncover the relative strengths and weaknesses of ChatGPT in writing. The collected data were analyzed across five linguistic dimensions: fluency, accuracy, complexity, cohesion, and emotion. Each dimension underwent assessment through precise indices. The findings revealed that nine-year-old children excelled beyond ChatGPT in terms of fluency and cohesion within their writing. In contrast, ChatGPT manifested a superior performance in accuracy compared to the children. Concerning complexity, children exhibited superior skills in science-themed writing, while ChatGPT prevailed in nature-themed writing. Significantly, this research is pioneering in revealing that nine-year-old children convey stronger emotions than ChatGPT in their Chinese compositions.
摘要
chatGPT possess了较强的文本生成能力,并且研究表明其在理解人类思维方面的表现与9岁孩子相当。然而,是否chatGPT在中文写作方面超过9岁孩子仍然存在uncertainty。为了解答这个问题,我们的研究将chatGPT和9岁孩子的中文写作比较在 narative和科学话题上。我们通过分析5种语言特征,包括流畅、准确、复杂度、连贯和情感,来评估这两个组合的写作能力。我们发现,9岁孩子在流畅和连贯方面的写作能力比chatGPT更强,而chatGPT在准确性方面表现更优。在复杂度方面,孩子在科学话题上表现出了更高的技巧水平,而chatGPT在自然话题上表现更优。最重要的是,这项研究发现,9岁孩子在中文作文中表达的情感更强于chatGPT。
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length
results: 我们的实验结果显示,使用我们的方法训练 LLMs 可以更快地趋向于极值,并且比使用现有方法训练的模型表现更好。此外,我们的方法不需要任何额外的工程实践,因此是实际的解决方案在 LLMs 领域。Abstract
The evolving sophistication and intricacies of Large Language Models (LLMs) yield unprecedented advancements, yet they simultaneously demand considerable computational resources and incur significant costs. To alleviate these challenges, this paper introduces a novel, simple, and effective method named ``\growlength'' to accelerate the pretraining process of LLMs. Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency. For instance, it begins with a sequence length of 128 and progressively extends to 4096. This approach enables models to process a larger number of tokens within limited time frames, potentially boosting their performance. In other words, the efficiency gain is derived from training with shorter sequences optimizing the utilization of resources. Our extensive experiments with various state-of-the-art LLMs have revealed that models trained using our method not only converge more swiftly but also exhibit superior performance metrics compared to those trained with existing methods. Furthermore, our method for LLMs pretraining acceleration does not require any additional engineering efforts, making it a practical solution in the realm of LLMs.
摘要
大型语言模型(LLMs)的发展和复杂性带来了前所未有的进步,但它们同时需要很大的计算资源和成本。为了解决这些挑战,本文提出了一种新的、简单的和有效的方法名为“\growlength”,用于加速 LLMS 的预训练过程。我们的方法在预训练阶段逐步增长训练长度,从而减少计算成本并提高效率。例如,它从序列长度为 128 开始,逐步增长到 4096。这种方法使得模型在限时内处理更多的字符,可能提高其性能。换句话说,效率提升来自于在限时内训练使用资源的优化。我们对各种现代 LLMS 进行了广泛的实验,发现使用我们的方法训练的模型不仅更快 converges,而且也表现出了较高的性能指标,比于使用现有方法训练的模型。此外,我们的方法不需要任何额外的工程努力,因此是 LLMS 预训练加速方法中的实用解决方案。
Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial Persian Part of Speech Tagging
for: This paper is written for those interested in natural language processing and POS tagging in Persian, specifically for colloquial text in social network analysis.
methods: The paper introduces a novel corpus called “Colloquial Persian POS” (CPPOS), which includes formal and informal text collected from various social media platforms such as Telegram, Twitter, and Instagram. The corpus was manually annotated and verified by a team of linguistic experts, and a POS tagging guideline was defined for annotating the data.
results: The paper evaluates the quality of CPPOS by training various deep learning models, such as the RNN family, on the constructed corpus. The results show that the model trained on CPPOS outperforms other existing Persian POS corpora and tools, achieving a 14% improvement over the previous dataset.Abstract
Introduction: Part-of-Speech (POS) Tagging, the process of classifying words into their respective parts of speech (e.g., verb or noun), is essential in various natural language processing applications. POS tagging is a crucial preprocessing task for applications like machine translation, question answering, sentiment analysis, etc. However, existing corpora for POS tagging in Persian mainly consist of formal texts, such as daily news and newspapers. As a result, smart POS tools, machine learning models, and deep learning models trained on these corpora may not perform optimally for processing colloquial text in social network analysis. Method: This paper introduces a novel corpus, "Colloquial Persian POS" (CPPOS), specifically designed to support colloquial Persian text. The corpus includes formal and informal text collected from various domains such as political, social, and commercial on Telegram, Twitter, and Instagram more than 520K labeled tokens. After collecting posts from these social platforms for one year, special preprocessing steps were conducted, including normalization, sentence tokenizing, and word tokenizing for social text. The tokens and sentences were then manually annotated and verified by a team of linguistic experts. This study also defines a POS tagging guideline for annotating the data and conducting the annotation process. Results: To evaluate the quality of CPPOS, various deep learning models, such as the RNN family, were trained using the constructed corpus. A comparison with another well-known Persian POS corpus named "Bijankhan" and the Persian Hazm POS tool trained on Bijankhan revealed that our model trained on CPPOS outperforms them. With the new corpus and the BiLSTM deep neural model, we achieved a 14% improvement over the previous dataset.
摘要
Introduction: 部件之分标记(POS)标注,将词语分类为它们的各种部件(如动词或名词),是自然语言处理应用中的重要预处理任务。POS标注是机器翻译、问答、情感分析等应用中的关键预处理任务。然而,现有的波斯语POS标注 corpora主要由正式文本组成,如日报和报纸。这导致了聪明POS工具、机器学习模型和深度学习模型在处理社交网络分析中的混乱文本时可能不具备最佳性能。方法:本文介绍了一个新的 corpora,名为“通用波斯语POS”(CPPOS),用于支持通用波斯语文本。该 corpora 包括了正式和非正式文本,从各种领域,如政治、社会和商业,收集自 Telegram、Twitter 和 Instagram 等社交平台上的大于520K个标注的字符。在收集一年的社交媒体文本后,我们进行了特殊的预处理步骤,包括Normalization、句子分割和词语分割。这些字符和句子 THEN 被一群语言专家 manually annotate 和验证。本研究还定义了POS标注指南,用于标注数据并进行标注过程。结果:为评估 CPPOS 的质量,我们使用constructed corpora trains 了多种深度学习模型,如 RNN 家族。与另一个已知的波斯语POS corpus名为“ Bijankhan” 和 Persian Hazm POS 工具在 Bijankhan 上训练而成的模型相比,我们的模型在 CPPOS 上训练的结果表明,我们的模型在 CPPOS 上训练的结果表明,我们的模型在 CPPOS 上训练的结果比之前的数据集提高了14%。