results: KA2G 在 speech-based single-turn SLURP 数据集和一个商业 ToD 系统中的 multi-turn 数据集上进行了实验,并显示了与先前作品相比,特别是在 few-shot 和 zero-shot 设置下,具有强大和一致的提升。Abstract
Manually annotating fine-grained slot-value labels for task-oriented dialogue (ToD) systems is an expensive and time-consuming endeavour. This motivates research into slot-filling methods that operate with limited amounts of labelled data. Moreover, the majority of current work on ToD is based solely on text as the input modality, neglecting the additional challenges of imperfect automatic speech recognition (ASR) when working with spoken language. In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input. KA2G achieves robust and data-efficient slot filling for speech-based ToD by 1) framing it as a text generation task, 2) grounding text generation additionally in the audio modality, and 3) conditioning on available external knowledge (e.g. a predefined list of possible slot values). We show that combining both modalities within the KA2G framework improves the robustness against ASR errors. Further, the knowledge-aware slot-value generator in KA2G, implemented via a pointer generator mechanism, particularly benefits few-shot and zero-shot learning. Experiments, conducted on the standard speech-based single-turn SLURP dataset and a multi-turn dataset extracted from a commercial ToD system, display strong and consistent gains over prior work, especially in few-shot and zero-shot setups.
摘要
人工标注细腻槽值标签 для任务对话(ToD)系统是一项昂贵和耗时的努力。这种情况激励了对槽 filling 方法的研究,这些方法可以采用有限量的标注数据。此外,当前大多数 ToD 研究仅基于文本输入模式,忽略了自动语音识别(ASR)的额外挑战。在这种工作中,我们提出了一个知识感知音频根据 generator 框架(KA2G),这个框架专注于几 shot 和零 shot 槽 filling для speech-based ToD。KA2G 通过以下三种方法实现了稳定和数据效果的槽 filling:1. 将其视为文本生成任务。2. 在音频模式中进一步地固定文本生成。3. 使用可用的外部知识(如预定的槽值列表)进行条件。我们发现,在 KA2G 框架中结合两个模式可以提高对 ASR 错误的强度。此外,KA2G 中的知识感知槽值生成器,通过使用指针生成机制实现,尤其是在几 shot 和零 shot 学习中具有优势。我们在标准的speech-based single-turn SLURP 数据集和一个商业 ToD 系统提取的多turn 数据集上进行了实验,并表现出了强大和一致的提升,特别是在几 shot 和零 shot 设置下。
Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
results: 在 ASR 领域中应用该框架,可以提高 emission time 的优化效果,最多提高 570ms,同时只有较少影响 WER 的准确率。此外,还可以提高 WER 的准确率,相比基eline模型,提高了4.5%。Abstract
Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.
摘要
Connectionist Temporal Classification (CTC) 是一种广泛使用的训练监督序列到序列(seq2seq)模型的评价标准。它使得学习输入和输出序列之间的关系,称为对齐,通过排除完美对齐(导致真实的输出)的权重,以换取不完美对齐。这种二分法对于真实世界应用中的其他重要对齐特性不足。我们提出了《Align With Purpose》,一种通用的插件和替换框架,可以增强模型在 CTc 评价标准下的愿望特性。我们通过补充 CTc loss函数中的一个额外损失项来实现这一点,该项将对齐按照愿望特性进行优先级排序。我们的方法不需要对 CTc loss函数进行任何改变,可以轻松地优化多种特性,并允许对不完美对齐进行区分。我们在自动语音识别(ASR)领域应用了我们的框架,并在不同的特性、模型选择和训练数据集大小(最大达280,000小时)上进行了通用性测试。为证明我们的框架的有效性,我们在两种不相关的特性上应用了它:发射时间和单词错误率(WER)。对于前者,我们report了最多570ms的延迟优化和一定的WER降低,对于后者,我们report了相对于基eline模型的4.5% WER提升。到目前为止,这些应用都没有在这样大的数据集上进行过。值得注意的是,我们的方法只需要几行代码实现,并且可以扩展到其他对齐无法损失函数和领域。
Dipping PLMs Sauce: Bridging Structure and Text for Effective Knowledge Graph Completion via Conditional Soft Prompting
methods: 本研究提出了一种名为 CSProm-KG(Conditional Soft Prompts for KGC)的方法,它只是根据实体和关系表示生成的条件软提示参数进行调整。
results: 对三个常见的静态 KGC 测试集 WN18RR、FB15K-237 和 Wikidata5M 以及两个时间 KGC 测试集 ICEWS14 和 ICEWS05-15 进行测试,CSProm-KG 表现出色,超越了比较基eline模型。我们还进行了进一步的分析,以证明我们的提出的组件的有效性、CSProm-KG 的效率和其可变性。Abstract
Knowledge Graph Completion (KGC) often requires both KG structural and textual information to be effective. Pre-trained Language Models (PLMs) have been used to learn the textual information, usually under the fine-tune paradigm for the KGC task. However, the fine-tuned PLMs often overwhelmingly focus on the textual information and overlook structural knowledge. To tackle this issue, this paper proposes CSProm-KG (Conditional Soft Prompts for KGC) which maintains a balance between structural information and textual knowledge. CSProm-KG only tunes the parameters of Conditional Soft Prompts that are generated by the entities and relations representations. We verify the effectiveness of CSProm-KG on three popular static KGC benchmarks WN18RR, FB15K-237 and Wikidata5M, and two temporal KGC benchmarks ICEWS14 and ICEWS05-15. CSProm-KG outperforms competitive baseline models and sets new state-of-the-art on these benchmarks. We conduct further analysis to show (i) the effectiveness of our proposed components, (ii) the efficiency of CSProm-KG, and (iii) the flexibility of CSProm-KG.
摘要
知识图结束 (KGC) 常常需要知识图结构和文本信息同时进行效果。先训练语言模型 (PLMs) 已经被用来学习文本信息,通常在细致调参 paradigm 中进行 KGC 任务。然而,细致调参 PLMs 经常偏重于文本信息,忽略知识图结构。为了解决这个问题,这篇论文提出了 CSProm-KG (Conditional Soft Prompts for KGC),它保持了知识图结构和文本知识之间的平衡。CSProm-KG 只是调整基于实体和关系表示的 Conditional Soft Prompts 的参数。我们证明了 CSProm-KG 在三个流行的静态 KGC 标准测试集 WN18RR、FB15K-237 和 Wikidata5M 上表现出色,并在两个时间 KGC 标准测试集 ICEWS14 和 ICEWS05-15 上设置新的状态纪录。我们进一步分析表明(i)我们提posed的组件的效果,(ii)CSProm-KG 的效率,以及(iii)CSProm-KG 的灵活性。
Racial Bias Trends in the Text of US Legal Opinions
results: 研究发现,美国法官的言论中存在强烈的种族偏见,传统的黑人名字更加与“不愉快”的词语相关,而传统的白人名字更加与“愉快”的词语相关。此外,研究还发现,在1950年之前的法律意见中没有发现更高的隐性种族偏见,nor did legal opinions from Northeastern states show greater change in racial bias over time compared to Southern states.Abstract
Although there is widespread recognition of racial bias in US law, it is unclear how such bias appears in the language of law, namely judicial opinions, and whether it varies across time period or region. Building upon approaches for measuring implicit racial bias in large-scale corpora, we approximate GloVe word embeddings for over 6 million US federal and state court cases from 1860 to 2009. We find strong evidence of racial bias across nearly all regions and time periods, as traditionally Black names are more closely associated with pre-classified "unpleasant" terms whereas traditionally White names are more closely associated with pre-classified "pleasant" terms. We also test whether legal opinions before 1950 exhibit more implicit racial bias than those after 1950, as well as whether opinions from Southern states exhibit less change in racial bias than those from Northeastern states. We do not find evidence of elevated bias in legal opinions before 1950, or evidence that legal opinions from Northeastern states show greater change in racial bias over time compared to Southern states. These results motivate further research into institutionalized racial bias.
摘要
尽管美国法律界存在普遍的种族偏见,但是未知如何在法律语言中表现出这种偏见,以及是否随时间或地区而变化。我们基于大规模文本潜在偏见测量方法,对1860年至2009年美国联邦和州法院案例600万起进行了 aproximate GloVe词嵌入。我们发现在大多数地区和时间期间,传统的黑人名字更加密切相关于预先分类的“不愉快” terms,而传统的白人名字更加密切相关于预先分类的“愉悦” terms。我们还测试了1950年之前的法律意见是否具有更高的潜在种族偏见,以及南部州法律意见是否在时间的推移中改变了种族偏见的变化。我们未能发现1950年之前的法律意见具有偏高的偏见,也未能发现南部州法律意见在时间的推移中改变种族偏见的变化。这些结果激励进一步研究 институциализи了种族偏见。
Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation
methods: 在这篇论文中,我们使用了大规模的训练语言模型,并在不同的仇恨言论检测数据集上进行了精致的调整。我们还进行了许多数据集的比较,以探讨不同数据集在培训 hate speech detection 模型时的可行性。
results: 我们的实验结果显示,不同的数据集在培训 hate speech detection 模型时有所不同的可行性。其中,一些数据集更加普遍,可以在不同的背景下进行应用。此外,我们发现可以通过 комбінуing 不同的数据集来建立更加Robust的 hate speech detection 模型,这个 Robustness 甚至在控制data size 和比较最佳个别数据集时仍然保持。Abstract
The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalisable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.
摘要
自然语言处理(NLP)领域中自动发现仇恨言论在线是一个活跃的研究领域。大多数研究到目前为止都是基于社交媒体数据集,这些数据集在创建仇恨言论检测模型时提供了贡献。然而,数据创建过程中带有自己的偏见,模型从这些数据集特定的偏见中学习。在这篇论文中,我们进行了大规模的跨数据集比较,我们在不同的仇恨言论检测数据集上进行了精细的微调。这一分析表明了某些数据集在用于训练模型时更加通用,而且我们的实验显示,将多个仇恨言论检测数据集组合起来可以帮助建立更加鲁棒的仇恨言论检测模型。这种鲁棒性甚至在控制数据量和相比最佳单个数据集的情况下保持。
Disentanglement in a GAN for Unconditional Speech Synthesis
results: 在使用Google Speech Commands数据集的小词库数据集上,ASGAN已经取得了顶尖的结果,并且比 existing 的扩散模型快得多。此外,我们还证明了 ASGAN 的潜在空间是分离的,可以使用Simple linear operations在这个空间中进行多个未见 durante 训练的任务。Abstract
Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/
摘要
可以开发一个模型, direct from a latent space synthesize realistic speech without explicit conditioning? despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks.Here's the translation in Traditional Chinese:可以开发一个模型, directly from a latent space synthesize realistic speech without explicit conditioning? despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks.
results: 我们在挪威议会语音 corpus(NPSC)上从单词错误率(WER)17.10%下降至7.60%,模型在挪威语言中获得了5.81%的最佳状态。同时,我们还讨论了进一步改进ASR模型的挑战和解决方案。Abstract
In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10\% to 7.60\%, with models achieving 5.81\% for Bokm{\aa}l and 11.54\% for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.
摘要
在这篇论文中,我们提出了多种基线模型 для自动语音识别(ASR)模型,用于两种官方文字语言在挪威:博克马尔和新北兰语。我们比较了不同大小和预训练方法的模型在多个挪威语音 dataset 上的性能。此外,我们还测试了这些模型与之前的状态对应 ASR 模型和不同语音集上的性能。我们在挪威国会语音集(NPSC)上提高了状态对应率从17.10% 降低至7.60%,其中模型为5.81% для博克马尔和11.54% для新北兰语。我们还讨论了进一步改进 ASR 模型的挑战和解决方案。
Unified Conversational Models with System-Initiated Transitions between Chit-Chat and Task-Oriented Dialogues
results: 研究发现,连续启发模型可以在多个领域任务中实现更高的转换效果,并且可以用于指导对话模型在不同领域之间的批量转换。Abstract
Spoken dialogue systems (SDSs) have been separately developed under two different categories, task-oriented and chit-chat. The former focuses on achieving functional goals and the latter aims at creating engaging social conversations without special goals. Creating a unified conversational model that can engage in both chit-chat and task-oriented dialogue is a promising research topic in recent years. However, the potential ``initiative'' that occurs when there is a change between dialogue modes in one dialogue has rarely been explored. In this work, we investigate two kinds of dialogue scenarios, one starts from chit-chat implicitly involving task-related topics and finally switching to task-oriented requests; the other starts from task-oriented interaction and eventually changes to casual chat after all requested information is provided. We contribute two efficient prompt models which can proactively generate a transition sentence to trigger system-initiated transitions in a unified dialogue model. One is a discrete prompt model trained with two discrete tokens, the other one is a continuous prompt model using continuous prompt embeddings automatically generated by a classifier. We furthermore show that the continuous prompt model can also be used to guide the proactive transitions between particular domains in a multi-domain task-oriented setting.
摘要
干脆对话系统(SDS)已经分别开发出了两类:任务oriented和聊天。前者关注实现功能目标,而后者想创造有趣的社交对话没有特定目标。在最近几年中,创建一个综合对话模型可以在一个对话中同时进行聊天和任务oriented对话是一个有前途的研究话题。然而,在对话模式之间的变化中可能会发生的“发起”(initiative) rarely been explored。在这项工作中,我们研究了两种对话场景:一个从聊天逐渐涉及到任务相关话题,最后转换到任务oriented请求;另一个从任务oriented交互开始,最后变成了聊天。我们提出了两种高效的提示模型,可以触发系统自主发起对话模式的转换。一个是使用两个简单的Token进行训练的批示模型,另一个是使用自动生成的连续提示嵌入数据来 guideline 系统自主转换的连续提示模型。此外,我们还证明了连续提示模型可以在多个领域任务oriented Setting 中用于指导系统自主转换。
Chain of Thought Prompting Elicits Knowledge Augmentation
for: 这篇论文旨在提出一种基于链条思维(Chain-of-Thought,CoT)的知识增强深度学习方法(Knowledge-Augmented Deep Learning,KADL)。
methods: 这种方法使用大语言模型进行广泛预训练,然后将其作为外部知识集成到深度学习模型中。
results: 对于多种逻辑任务的 eleven 个公共数据集上,CoT-KA 方法比纯CoT方法和非增强方法表现出色,得到了更高的性能。Abstract
The knowledge-augmented deep learning paradigm refers to a paradigm in which domain knowledge is identified and integrated into deep models. Conventional methods typically employ task-specific approaches to gather external knowledge from various sources. In contrast, large language models are extensively pre-trained and can serve as a comprehensive source of external knowledge. In this paper, we propose CoT-KA, a Chain-of-Thought-based method that augments knowledge for deep learning. CoT-KA avoids the need for additional knowledge retrieval or knowledge reasoning models, as required in conventional augmentation methods. Our results demonstrate that CoT-KA outperforms both pure CoT-based methods and the non-augmented method across the majority of eleven publicly available benchmarks for various reasoning tasks.
摘要
知识增强深度学习方式指的是一种将领域知识集成到深度模型中的方法。传统方法通常采用任务特定的方法来从多种来源中收集外部知识。然而,大型语言模型已经广泛预训练,可以作为外部知识的全面来源。在这篇论文中,我们提出了基于链条思想的CoT-KA方法,用于增强深度学习。CoT-KA不需要额外的知识检索或知识推理模型,与传统增强方法不同。我们的结果表明,CoT-KA在多种公共可用的benchmark上比纯CoT方法和非增强方法表现出色,其中大多数任务的性能都高于非增强方法。
A Language Model for Grammatical Error Correction in L2 Russian
paper_authors: Nikita Remnev, Sergei Obiedkov, Ekaterina Rakhilina, Ivan Smirnov, Anastasia Vyrenkova
for: correction of non-native (L2) writing errors in Russian language
methods: use of a language model trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus
results: validation of the model’s quality against the RULEC-GEC corpusHere’s the full text in Simplified Chinese:
for: correction of non-native (L2) 中文写作错误
methods: 使用基于新闻子集的俄语国家 corpus 上的无标文本语言模型
results: validate 模型质量 against RULEC-GEC corpusI hope that helps!Abstract
Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a language model intended for correcting errors in L2 Russian writing. The language model proposed is trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus, and the quality of the model is validated against the RULEC-GEC corpus.
摘要
grammatical error correction是自然语言处理中的基本任务之一。对于俄语,大多数可用的拼写检查器能够准确地检测 typo和其他简单错误,但在面临非Native(L2)写作时,它们frequently failed,因为L2写作中含有不典型的错误。在这篇论文中,我们提议一个涉及语言模型的管道,用于 correecting L2俄语写作中的错误。我们的语言模型基于俄语日报子集(Newspaper subcorpus)上的未标注文本,并对RULEC-GEC corpus进行验证。
Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation
results: 我们在两个 datasets 上进行了实验,发现这种方法可以有效地避免重复性,同时保持流畅性。此外,我们发现Language Models 在预测重复Token时使用更长的词语关系,可能是句子水平重复的原因。Abstract
Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts with maximization-based decoding algorithms for open-ended generation. We attribute their overestimation of token-level repetition probabilities to the learning bias: LMs capture simple repetitive patterns faster with the MLE loss. We propose self-contrastive training to penalize the output of a premature checkpoint of the same model when it incorrectly predicts repetition, which is shown to mitigate repetition effectively while maintaining fluency on two datasets. Furthermore, we find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops.
摘要
尽管在许多生成任务中进步很大,预训练语言模型(LM)如GPT2仍然很容易通过最大化基于解码算法来生成重复的文本。我们认为这是因为学习偏见:LM学习了简单的重复模式更快,使得它们在MLE损失函数下过度估计token级别的重复概率。我们提议使用自我对比训练来追加预训练模型的检查点,并在检查点不正确预测重复时进行惩罚,这有效地减少了重复,同时保持了流畅性在两个数据集上。此外,我们发现LM在预测重复token时使用了更长的距离,这可能是句子水平的重复循环的原因。
On Evaluating and Mitigating Gender Biases in Multilingual Settings
results: 研究发现了 multilingual settings 中 studying social biases 的挑战,并提供了资源和 mitigation techniques 以逐步扩展到更多的语言。Abstract
While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English. In this work, we investigate some of the challenges with evaluating and mitigating biases in multilingual settings which stem from a lack of existing benchmarks and resources for bias evaluation beyond English especially for non-western context. In this paper, we first create a benchmark for evaluating gender biases in pre-trained masked language models by extending DisCo to different Indian languages using human annotations. We extend various debiasing methods to work beyond English and evaluate their effectiveness for SOTA massively multilingual models on our proposed metric. Overall, our work highlights the challenges that arise while studying social biases in multilingual settings and provides resources as well as mitigation techniques to take a step toward scaling to more languages.
摘要
tradicional,理解并消除语言模型中的性别偏见问题一直是自然语言处理领域的长期问题,但之前的研究主要集中在英语上。在这项工作中,我们探讨了在多语言设置中评估和消除偏见的挑战,以及由于英语以外的语言缺乏现有的偏见评估 benchmark和资源而导致的问题。在这篇论文中,我们首先创建了评估隐藏语言模型中的性别偏见的benchmark,通过对印度语言进行扩展DisCo以获得人工纠正。然后,我们扩展了不同的去偏见方法以工作在英语以外的语言上,并评估这些方法在我们提出的指标上的效果。总之,我们的工作揭示了在多语言设置中研究社会偏见的挑战,并提供了资源以及消除技术,以便扩展到更多的语言。
SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification
results: 可以很好地训练不含标签数据的语言模型,并且可以提高现有预训练语言模型的鲁棒性。Abstract
Despite their promising performance across various natural language processing (NLP) tasks, current NLP systems are vulnerable to textual adversarial attacks. To defend against these attacks, most existing methods apply adversarial training by incorporating adversarial examples. However, these methods have to rely on ground-truth labels to generate adversarial examples, rendering it impractical for large-scale model pre-training which is commonly used nowadays for NLP and many other tasks. In this paper, we propose a novel learning framework called SCAT (Self-supervised Contrastive Learning via Adversarial Training), which can learn robust representations without requiring labeled data. Specifically, SCAT modifies random augmentations of the data in a fully labelfree manner to generate adversarial examples. Adversarial training is achieved by minimizing the contrastive loss between the augmentations and their adversarial counterparts. We evaluate SCAT on two text classification datasets using two state-of-the-art attack schemes proposed recently. Our results show that SCAT can not only train robust language models from scratch, but it can also significantly improve the robustness of existing pre-trained language models. Moreover, to demonstrate its flexibility, we show that SCAT can also be combined with supervised adversarial training to further enhance model robustness.
摘要
尽管现有的自然语言处理(NLP)系统在不同的任务上表现出色,但它们对文本恶作剂攻击仍然易受到影响。为防止这些攻击,大多数现有的方法采用了对抗训练,但这些方法往往需要使用准确的标签来生成对抗示例,这使得大规模模型预训练成为不可能的。在这篇论文中,我们提出了一种新的学习框架,即SCAT(自主对抗学习),可以不需要标签数据来学习强化表示。具体来说,SCAT通过修改数据的随机扩展来生成对抗示例,然后通过对这些对抗示例和其对抗样本进行对抗训练来减少对抗攻击的影响。我们在两个文本分类任务上使用了两种最新的攻击方案进行评估。我们的结果显示,SCAT不仅可以从零开始训练Robust语言模型,而且还可以显著提高现有预训练语言模型的Robust性。此外,我们还证明了SCAT可以与有监督对抗训练结合使用,以进一步增强模型的Robust性。
CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care
paper_authors: Tong Xiang, Liangzhi Li, Wangyue Li, Mingbai Bai, Lu Wei, Bowen Wang, Noa Garcia for: 这 paper 的目的是evaluating the misinformation generated by large language models (LLMs) in the sensitive topic of maternity and infant care, and providing a benchmark for assessing the quality of long-form generation in Chinese.methods: 该 paper 使用了一个新的 benchmark , named CARE-MI, to evaluate the misinformation of LLMs in the maternity and infant care domain, and compared potential solutions for long-form generation evaluation.results: 该 paper 发现,current Chinese LLMs 在这个领域 still have a long way to go, and proposed a judgment model for automatically assessing the long-form output of LLMs using the benchmark questions.Abstract
The recent advances in NLP, have led to a new trend of applying LLMs to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building long-form generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer a judgment model for automatically assessing the long-form output of LLMs using the benchmark questions. Moreover, we compare potential solutions for long-form generation evaluation and provide insights for building more robust and efficient automated metric.
摘要
近些年,自然语言处理(NLP)的进步,启动了应用大型自然语言模型(LLM)到实际场景的新趋势。latest LLMs在与人类交互时表现出很高的流畅性,但它们受到谎言问题的困扰,即不慎生成的false信息。这可能导致有害的后果,特别是在敏感场景中,如医疗领域。然而,前期工作很少关注了LLMs中的谎言评估,特别是在知识密集的领域和语言中。为了解决这问题,我们提出了一个benchmark,CARE-MI,用于评估LLMs中的谎言。CARE-MI包括以下两个方面:1)敏感领域,即婴儿护理领域;2)语言,即中文。我们还提供了一种创新的评估长形生成 benchmark的方法,可以转移到其他知识密集的领域和低资源语言。我们的提案填补了LLMs的广泛使用和评估谎言生成的数据差距。我们的benchmark包括1,612个专家审核的问题,以及人选的参考文献。使用我们的benchmark,我们进行了广泛的实验,发现当前的中文LLMs在婴儿护理领域还有很大的改进空间。为了减少人工资源的依赖,我们提供了一种自动评估长形输出的模型,以及对不同解决方案的比较。
Diverse Retrieval-Augmented In-Context Learning for Dialogue State Tracking
results: 使用MultiWOZ进行评估,在零和几个示例学习环境下实现了多个多任务共同目标准确率的状态前几。Abstract
There has been significant interest in zero and few-shot learning for dialogue state tracking (DST) due to the high cost of collecting and annotating task-oriented dialogues. Recent work has demonstrated that in-context learning requires very little data and zero parameter updates, and even outperforms trained methods in the few-shot setting (Hu et al. 2022). We propose RefPyDST, which advances the state of the art with three advancements to in-context learning for DST. First, we formulate DST as a Python programming task, explicitly modeling language coreference as variable reference in Python. Second, since in-context learning depends highly on the context examples, we propose a method to retrieve a diverse set of relevant examples to improve performance. Finally, we introduce a novel re-weighting method during decoding that takes into account probabilities of competing surface forms, and produces a more accurate dialogue state prediction. We evaluate our approach using MultiWOZ and achieve state-of-the-art multi-domain joint-goal accuracy in zero and few-shot settings.
摘要
有很多人表达了对零和几个shot学习对话状态追踪(DST)的兴趣,这是因为收集和标注任务型对话的成本很高。现有研究表明,在Context中学习只需要很少数据和零参数更新,甚至在几个shot Setting下超越训练方法的性能(Hu et al. 2022)。我们提出了RefPyDST,这是一种在Context中学习DST的新方法,它具有以下三个进步:1. 我们将DST视为一种Python编程任务,直接在Python中表示语言核心语言引用。2. 由于Context学习强烈取决于上下文示例,我们提议一种方法来检索更多相关的示例,以提高性能。3. 我们提出了一种新的重新权重方法,在解码过程中考虑竞争表面形式的概率,并生成更准确的对话状态预测。我们使用MultiWOZ进行评估,并在零和几个shot Setting下实现了多个领域共同目标准确率的状态前景。
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision
results: 实验表明, ReactIE 方法可以达到显著提高,并超过所有基eline。Abstract
Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that ReactIE achieves substantial improvements and outperforms all existing baselines.
摘要
科学文献中的结构化化学反应信息对化学家进行实验室工作和高级尝试(如计算机支持药物设计)起着重要作用。然而,提取结构化反应的数据标注因为需要域专家劳动量大,因此成本高昂。这导致相关模型在这个领域进步受阻。在这篇论文中,我们提议了ReactIE,它组合了两种弱监督方法进行预训练。我们的方法利用文本中的频繁出现的语言特征作为化学反应的特征标志。此外,我们采用了专利记录中的 sintetic data作为远程监督,以把领域知识引入模型中。实验表明,ReactIE可以实现显著改进,并超过所有基elines。
On Conditional and Compositional Language Model Differentiable Prompting
paper_authors: Jonathan Pilault, Can Liu, Mohit Bansal, Markus Dreyer
for: 本研究旨在提高预训练语言模型(PLM)的下游任务性能,通过各种提示方法来适应不同任务。
methods: 本研究使用了 conditional和compositional的可微分提示方法,并提出了一种新的模型——Prompt Production System(PRopS),可以将任务说明或输入元数据转化为Continuous提示,以便从PLM中获取特定任务输出。PRopS使用了基于神经网络的Production Systems模型结构,可以学习到特定提示输入模式的精细规则,从而实现compositional transfer learning和少量学习。
results: 对比其他PLM适应技术,PRopS在compositional generalization任务、可控摘要和多语言翻译等任务中具有优异表现,需要更少的可训练参数。Abstract
Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks. Prompts can be represented by a human-engineered word sequence or by a learned continuous embedding. In this work, we investigate conditional and compositional differentiable prompting. We propose a new model, Prompt Production System (PRopS), which learns to transform task instructions or input metadata, into continuous prompts that elicit task-specific outputs from the PLM. Our model uses a modular network structure based on our neural formulation of Production Systems, which allows the model to learn discrete rules -- neural functions that learn to specialize in transforming particular prompt input patterns, making it suitable for compositional transfer learning and few-shot learning. We present extensive empirical and theoretical analysis and show that PRopS consistently surpasses other PLM adaptation techniques, and often improves upon fully fine-tuned models, on compositional generalization tasks, controllable summarization and multilingual translation, while needing fewer trainable parameters.
摘要
<>文本提示(Prompts)已经被证明是一种有效的方法,用于适应预训练语言模型(PLM)来实现下游任务的好准确性。文本提示可以表示为人工设计的单词序列或学习到的连续嵌入。在这项工作中,我们研究了强制和组合的可微分提示。我们提出了一个新的模型,即提示生产系统(PRopS),该模型可以将任务指令或输入元数据转换为可微分的提示,从而使PLM发生任务特定的输出。我们的模型采用基于我们的神经网络表述的生产系统结构,该结构允许模型学习分解规则——神经函数学习特定提示输入模式的转换,使其适用于组合转移学习和少量学习。我们对PRopS进行了广泛的实验和理论分析,并证明了其在组合总结任务、可控概要和多语言翻译方面的表现,常常超过其他PLM适应技术,而且经常超过完全精度地训练的模型。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Modeling Tag Prediction based on Question Tagging Behavior Analysis of CommunityQA Platform Users
results: 经过广泛的实验和性能评估,证明了模型的有效性。Abstract
In community question-answering platforms, tags play essential roles in effective information organization and retrieval, better question routing, faster response to questions, and assessment of topic popularity. Hence, automatic assistance for predicting and suggesting tags for posts is of high utility to users of such platforms. To develop better tag prediction across diverse communities and domains, we performed a thorough analysis of users' tagging behavior in 17 StackExchange communities. We found various common inherent properties of this behavior in those diverse domains. We used the findings to develop a flexible neural tag prediction architecture, which predicts both popular tags and more granular tags for each question. Our extensive experiments and obtained performance show the effectiveness of our model
摘要
在社区问答平台上,标签扮演着关键的角色,即信息组织和检索、更好的问题路由、更快的问题回答以及评估话题 популярность。因此,自动为帖子提供标签预测和建议是用户们的高Utility功能。为了在多个社区和领域中提高标签预测,我们进行了17个Stack Exchange社区用户标签行为的严格分析。我们发现了这些多样化领域中标签行为的共同特性。我们使用这些发现来开发一种灵活的神经网络标签预测架构,可以预测每个问题的流行标签以及更加细化的标签。我们的广泛的实验和表现表明我们的模型的效果。
Multi-Task Learning Improves Performance In Deep Argument Mining Models
results: 研究表明,不同的论证检测任务共享相似的semantic和logical结构,并且可以通过共享表示和 Parametern sharing 来提高性能。Abstract
The successful analysis of argumentative techniques from user-generated text is central to many downstream tasks such as political and market analysis. Recent argument mining tools use state-of-the-art deep learning methods to extract and annotate argumentative techniques from various online text corpora, however each task is treated as separate and different bespoke models are fine-tuned for each dataset. We show that different argument mining tasks share common semantic and logical structure by implementing a multi-task approach to argument mining that achieves better performance than state-of-the-art methods for the same problems. Our model builds a shared representation of the input text that is common to all tasks and exploits similarities between tasks in order to further boost performance via parameter-sharing. Our results are important for argument mining as they show that different tasks share substantial similarities and suggest a holistic approach to the extraction of argumentative techniques from text.
摘要
成功分析口说技巧是许多下游任务的核心,如政治和市场分析。现有的口说采矿工具使用 cutting-edge 深度学习方法提取和标注口说技巧,但每个任务都是专门训练不同的模型。我们显示出不同的口说采矿任务有共同的semantic和logical结构,通过实现多任务方法来采矿口说,可以更好地提高性能。我们的模型建立了输入文本共同的表示,并利用任务之间的相似性以进一步提高性能。我们的结果对口说采矿有重要意义,表明不同任务之间有许多相似之处,并建议一个整体的方法来从文本中提取口说技巧。
ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis
paper_authors: Javier de la Rosa, Álvaro Pérez Pozo, Salvador Ros, Elena González-Blanco
for: This paper is written for the analysis of poetry in a multilingual setting, specifically to address the lack of tools for automatically analyzing and scanning poems.
methods: The paper presents a new approach called \textsc{Alberti}, which is a multilingual pre-trained large language model for poetry. The model is trained using domain-specific pre-training (DSP) on a corpus of over 12 million verses from 12 languages.
results: The paper reports that \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes on two structural poetry tasks: Spanish stanza type classification and metrical pattern prediction for Spanish, English, and German. Additionally, \textsc{Alberti} achieves state-of-the-art results for German when compared to rule-based systems.Abstract
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.
摘要
计算 poetry 的分析受到计算 poetry 工具的缺乏的限制。在多语言设置下,问题更加严重,因为押韵和律诗系统只存在于个别语言中,这使得比较研究非常困难和耗时。在这种工作中,我们介绍了 \textsc{Alberti},首个用于 poetry 的多语言预训练大语言模型。通过领域特定预训练(DSP),我们进一步训练了多语言 BERT 在12种语言的超过12万句诗歌中进行预训练。我们对其表现进行评估,并在西班牙押韵类型分类和德语、英语和西班牙的 мет律 Pattern 预测任务上达到了比较好的结果,并且在对比rule-based系统的 germany 语言中达到了国际一流的结果,这表明了 DSP 在 poetry 领域的可能性和有效性。
Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation
results: 实验结果表明,使用这种Left Context方法可以在encoder前进行加速,而且与使用左 context和内存银行的方法相比,翻译质量几乎相同。Abstract
Simultaneous speech translation is an essential communication task difficult for humans whereby a translation is generated concurrently with oncoming speech inputs. For such a streaming task, transformers using block processing to break an input sequence into segments have achieved state-of-the-art performance at a reduced cost. Current methods to allow information to propagate across segments, including left context and memory banks, have faltered as they are both insufficient representations and unnecessarily expensive to compute. In this paper, we propose an Implicit Memory Transformer that implicitly retains memory through a new left context method, removing the need to explicitly represent memory with memory banks. We generate the left context from the attention output of the previous segment and include it in the keys and values of the current segment's attention calculation. Experiments on the MuST-C dataset show that the Implicit Memory Transformer provides a substantial speedup on the encoder forward pass with nearly identical translation quality when compared with the state-of-the-art approach that employs both left context and memory banks.
摘要
同时语音翻译是一项人类交流困难的沟通任务,即在流动输入语音时生成翻译。为此流处理任务,使用块处理的转换器已经实现了状态体系的最佳性能,并降低计算成本。现有的方法,包括左上下文和内存银行,尝试使信息在段之间传递,但是这些方法都是不充分的表示和过分的计算成本。在这篇论文中,我们提出了隐式记忆转换器,通过新的左上下文方法,消除了需要显式表示内存的需求。我们从上一个段的注意输出中生成左上下文,并将其包含在当前段的注意计算中的键和值中。对于 Must-C 数据集的实验结果表明,隐式记忆转换器在编码前进行速度增加,与使用左上下文和内存银行的状态体系相比,翻译质量几乎完全一致。
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
results: 在英语-德语、英语-法语和英语-西班牙语语对上,对Augmented Memory Transformer模型进行Shiftable Context修改后,提高了等待k值的BLEU分数平均值2.09、1.83和1.95个数值,而 computation-aware Average Lagging的影响很小。Abstract
Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation. However, such models create a context mismatch between training and inference environments, hindering potential translation accuracy. We solve this issue by proposing Shiftable Context, a simple yet effective scheme to ensure that consistent segment and context sizes are maintained throughout training and inference, even with the presence of partially filled segments due to the streaming nature of simultaneous translation. Shiftable Context is also broadly applicable to segment-based transformers for streaming tasks. Our experiments on the English-German, English-French, and English-Spanish language pairs from the MUST-C dataset demonstrate that when applied to the Augmented Memory Transformer, a state-of-the-art model for simultaneous speech translation, the proposed scheme achieves an average increase of 2.09, 1.83, and 1.95 BLEU scores across each wait-k value for the three language pairs, respectively, with a minimal impact on computation-aware Average Lagging.
摘要
使用分割基于的处理模型已经是同时交互翻译的有效架构。然而,这些模型会在训练和推理环境中创建上下文匹配问题,从而影响翻译准确性。我们解决这个问题 by proposing Shiftable Context,一种简单 yet effective的方案,确保在训练和推理过程中保持一致的分割和上下文大小,即使有部分填充的分割段 due to the streaming nature of simultaneous translation。Shiftable Context 还可以广泛应用于流处理任务中的 segment-based transformers。我们在英语-德语、英语-法语和英语-西班牙语语对的 MUST-C 数据集上进行了实验,并发现当应用到 Augmented Memory Transformer,一种现有的同时speech翻译模型时,提议的方案可以在每个 wait-k 值上得到平均提高2.09、1.83和1.95 的 BLEU 分数,并且对 computation-aware Average Lagging 产生了最小的影响。
Multilingual Language Models are not Multicultural: A Case Study in Emotion
for: investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages
methods: use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, and investigate the Anglocentricity of embeddings obtained from LMs and the Western norms reflected in generative LMs
results: multilingual LMs do not successfully learn the culturally appropriate nuances of emotion, and possible research directions towards correcting this are highlightedAbstract
Emotions are experienced and expressed differently across the world. In order to use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, LMs must reflect this cultural variation in emotion. In this study, we investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages. We find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion and we highlight possible research directions towards correcting this.
摘要
情感表达在不同的文化中存在差异。为了使用大语言模型(LM)进行多语言任务需要情感敏感,LM必须反映这种文化差异。本研究发现,2023年广泛使用的多语言LM(例如XLM-RoBERTa)的嵌入是英语中心,生成LM(例如ChatGPT)even responding to prompts in other languages still reflect Western norms. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion, and we highlight possible research directions towards correcting this.Note that the translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Semantic enrichment towards efficient speech representations
results: 研究发现,特点域 semantic enrichment 可以提高 spoken language understanding 任务中的 semantic extraction,同时还可以提高 low-resource language 的 portability。Abstract
Over the past few years, self-supervised learned speech representations have emerged as fruitful replacements for conventional surface representations when solving Spoken Language Understanding (SLU) tasks. Simultaneously, multilingual models trained on massive textual data were introduced to encode language agnostic semantics. Recently, the SAMU-XLSR approach introduced a way to make profit from such textual models to enrich multilingual speech representations with language agnostic semantics. By aiming for better semantic extraction on a challenging Spoken Language Understanding task and in consideration with computation costs, this study investigates a specific in-domain semantic enrichment of the SAMU-XLSR model by specializing it on a small amount of transcribed data from the downstream task. In addition, we show the benefits of the use of same-domain French and Italian benchmarks for low-resource language portability and explore cross-domain capacities of the enriched SAMU-XLSR.
摘要
过去几年,自我超级学习的语音表示方法在解决口语语言理解(SLU)任务上出现了丰盈的替代方案。同时,基于巨量文本数据的多种语言模型被引入,以将语言不可知变Semantics编码。最近,SAMU-XLSR方法引入了将投资 language agnostic semantics的方法,以增强多ilingual speech表示。本研究的目的是通过对特定领域的Semantic抽象来提高SAMU-XLSR模型的SLU能力,并考虑计算成本。此外,我们还展示了对 French和Italian benchmarks的同domain使用可以提高低资源语言的可移植性,并探索了增强SAMU-XLSR的跨领域能力。
Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective
results: 结果表明,使用 End-to-End 方式的 spoken NER 比 pipeline 方式的系统表现更好,特别是从德语到荷兰语的转移学习表现出色,超过了荷兰 E2E 系统7%,超过了荷兰 pipeline 系统4%。这项研究不仅证明了跨语言转移学习在 spoken NER 中的可行性,还提示了未来的评估中需要更多的数据收集,以提高结果。Abstract
Recent advancements in Named Entity Recognition (NER) have significantly improved the identification of entities in textual data. However, spoken NER, a specialized field of spoken document retrieval, lags behind due to its limited research and scarce datasets. Moreover, cross-lingual transfer learning in spoken NER has remained unexplored. This paper utilizes transfer learning across Dutch, English, and German using pipeline and End-to-End (E2E) schemes. We employ Wav2Vec2-XLS-R models on custom pseudo-annotated datasets and investigate several architectures for the adaptability of cross-lingual systems. Our results demonstrate that End-to-End spoken NER outperforms pipeline-based alternatives over our limited annotations. Notably, transfer learning from German to Dutch surpasses the Dutch E2E system by 7% and the Dutch pipeline system by 4%. This study not only underscores the feasibility of transfer learning in spoken NER but also sets promising outcomes for future evaluations, hinting at the need for comprehensive data collection to augment the results.
摘要
近期的Named Entity Recognition(NER)技术发展有所进步,有效地识别文本数据中的实体。然而,口语NER,是特殊的口语文检 Retrieval 领域,由于研究的限制和数据的缺乏,落后于NER。此外,口语NER的语言交互转移学习还未得到探索。这篇论文利用了语言交互转移学习 across Dutch, English, and German,使用管道和End-to-End(E2E)方案。我们使用Wav2Vec2-XLS-R模型在自定义pseudo-annotated dataset上进行了训练,并 investigate了多种架构以便适应跨语言系统的适应性。我们的结果表明,End-to-End口语NER比管道方式更高效,并且跨语言转移学习从德语到荷兰语的表现比荷兰E2E系统高出7%,并高过荷兰管道系统4%。这篇研究不仅证明了口语NER中的转移学习的可能性,还提供了未来评估中的优秀结果,强调了需要大量数据收集以增强结果。
The Evolution of Substance Use Coverage in the Philadelphia Inquirer
results: 研究发现,大麻和鸦片是报道的最多的药物类型,而幻觉药物则被更加正面地报道。相比之下,鸦片被报道的最为负面。这项研究的目的是强调媒体对毒瘾和药物使用的报道应该准确、包容,以便减少对毒瘾人士的刻板印象和恐慌。Abstract
The media's representation of illicit substance use can lead to harmful stereotypes and stigmatization for individuals struggling with addiction, ultimately influencing public perception, policy, and public health outcomes. To explore how the discourse and coverage of illicit drug use changed over time, this study analyzes 157,476 articles published in the Philadelphia Inquirer over a decade. Specifically, the study focuses on articles that mentioned at least one commonly abused substance, resulting in a sample of 3,903 articles. Our analysis shows that cannabis and narcotics are the most frequently discussed classes of drugs. Hallucinogenic drugs are portrayed more positively than other categories, whereas narcotics are portrayed the most negatively. Our research aims to highlight the need for accurate and inclusive portrayals of substance use and addiction in the media.
摘要
媒体对非法药物使用的表达可能会导致有害的 sterotype 和偏见,影响公众对添iction的看法,政策和公共健康 outcome。为了探讨媒体对非法药物使用的话语和报道如何变化过时,这项研究分析了费城纪事报上的157,476篇文章,时间段为10年。研究选择了提及常用药物的文章,共3,903篇。我们的分析表明,大麻和毒品是最常讨论的药物类型。幻觉药物在其他类型中被更正面地描述,而毒品则被最为负面地描述。我们的研究旨在强调媒体对药物使用和添iction的精准和包容的报道是必要的。
for: 这篇论文是为了探讨如何使用大型预训练语言模型进行在线学习(ICL),并在推理过程中 simulate 和 fine-tune 内部模型(例如 linear 或 2-layer MLP)。
methods: 该论文提出了一种高效的构建方法,即 Transformer in Transformer(简称 TinT),允许 transformer 模型在推理过程中 simulate 和 fine-tune 复杂的模型(例如预训练语言模型)。该方法使用了创新的近似技术,使得 TinT 模型只需要 fewer than 2 billion parameters 可以 simulate 和 fine-tune 125 million parameter transformer 模型。
results: 该论文通过进行综合的 end-to-end 实验 validate 了 TinT 模型的内部细化过程,并发现在不同的语言模型和下游任务上,TinT 模型可以提高性能 by 4-16% 绝对值。这些发现表明大型预训练语言模型可以执行复杂的子任务。Abstract
Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
摘要
最近的研究归功启发式学习(ICL)在大型预训练语言模型中的能力,是因为这些模型在推理过程中隐式地模拟和精细调整内部模型(例如线性或2层MLP)。然而,这些构造需要大量内存负担,使得更复杂的内部模型的模拟变得不可行。在这项工作中,我们提出了高效的构造方案——Transformer in Transformer(简称TinT),允许 transformer 模型在推理过程中内部模拟和精细调整复杂模型(例如预训练语言模型)。具体来说,我们提出了创新的近似技术,使得 TinT 模型 fewer than 2 billion parameters 可以在单个前进 passes 中模拟和精细调整 125 million parameter transformer 模型。TinT 支持许多常见的 transformer 变体,并且其设计思想还改进了过去简单模型在 transformers 中的效率。我们通过综合实验 validate 内部精细调整过程的有效性,并发现 TinT 对于不同语言模型和下游任务的性能都有明显提升。例如,即使只有一步预算,我们发现 TinT 对于 OPT-125M 模型可以提高性能的平均差值为 4-16%。这些发现表明大型预训练语言模型可以执行复杂的子过程。为了促进进一步研究,我们附加了可重用和扩展的代码库。
Improving Language Plasticity via Pretraining with Active Forgetting
results: 实验表明,使用该忘记机制可以使PLMs在语言适应过程中更快 converges,并且在具有少量数据的情况下,特别是与英语远程的语言,能够表现出更好的性能。Abstract
Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within a limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.
摘要
现代自然语言处理(NLP)中的预训练语言模型(PLM)已成为主流模型。尽管它们在下游任务中表现出色,但是将PLM应用于新语言可能会困难,这可能限制了它们的 universality 性。先前的工作已经证明可以通过学习一个新的映射层来解决这个问题,但是这需要大量的数据和计算资源。我们提议使用活动忘记机制 durante la pretraining,以便快速地使PLM适应新语言。具体来说,在每个更新中重置 embedding layer,我们鼓励PLM在有限的更新数量内快速学习新的映射,类似于一种元学习效应。我们使用 RoBERTa 进行实验,并证明了在语言适应过程中使用我们的忘记机制可以更快地 converges,并且在数据量较少的情况下,特别是与英语较为 distant 的语言,模型的表现更出色。
results: 测试结果表明,ChatGPT在翻译李奥纳·欧拉的1739年信件中表现出色,提供了优秀的翻译结果,这表明了ChatGPT可以作为一个有价值的翻译工具,不仅对普通的拉丁文献专家有帮助,还对特殊的拉丁文献翻译家有利。Abstract
The major hindrance in the study of earlier scientific literature is the availability of Latin translations into modern languages. This is particular true for the works of Euler who authored about 850 manuscripts and wrote a thousand letters and received back almost two thousand more. The translation of many of these manuscripts, books and letters have been published in various sources over the last two centuries, but many more have not yet appeared. Fortunately, nowadays, the artificial intelligence AI translation can be used to circumvent the challenges of translating such substantial number of texts. To validate this tool, benchmark tests have been performed to compare the performance of two popular AI translating algorithms, namely Google Translate and ChatGPT. Since it was found that ChatGPT performed better on these tests, this translating support was then used on an excerpt of a 1739 letter from Johann Bernoulli to Euler, where he notifies that he was sending to Euler the first part of his manuscript Hydraulica. The findings highlight ChatGPT as a valuable translation tool, catering not only to general Latin practitioners but also proving beneficial for specialized Latin translators.
摘要
主要阻碍古科学文献研究的问题是现代语言中的拉丁文翻译的可用性。这 particualrly true for Euler 的作品,他撰写了约850份手稿和写了1000封信件,收到了 almost 2000封回信。许多这些手稿、书籍和信件的翻译已经在过去两个世纪出版,但还有很多没有出现。幸运的是,现在可以使用人工智能 AI 翻译工具来绕过这些文献的翻译挑战。为验证这个工具,我们进行了比较两个流行的 AI 翻译算法的 benchMark 测试,结果发现 ChatGPT 的表现更好,因此选择使用这个翻译支持。在一篇1739年的Euler 写给 Bernoulli 的信件中,Bernoulli 通知他将发送给 Euler 的第一部分的液体学 manuscript。这些发现 highlight ChatGPT 作为一个有价值的翻译工具,不仅有利于一般拉丁文翻译者,还有利于专业拉丁文翻译者。