paper_authors: Shanqing Cai, Subhashini Venugopalan, Katie Seaver, Xiang Xiao, Katrin Tomanek, Sri Jalasutram, Meredith Ringel Morris, Shaun Kane, Ajit Narayanan, Robert L. MacDonald, Emily Kornman, Daniel Vance, Blair Casey, Steve M. Gleason, Philip Q. Nelson, Michael P. Brenner
results: 在19名非AAC用户手动Typingmobile设备上进行了验证,获得了与Offline模拟中的motor保存相似的成果,同时对总速度产生了相对较小的影响。两名困障者在Lab和场景测试中,通过phrase和word predictions从上下文意识到的大语言模型(LLMs)实现了29-60% faster的文本输入速度,相比传统基eline。这些发现提供了加速患有运动障碍者的文本交流的可能性,并证明了应用LLMs到文本基于的用户界面的方向。Abstract
Finding ways to accelerate text input for individuals with profound motor impairments has been a long-standing area of research. Closing the speed gap for augmentative and alternative communication (AAC) devices such as eye-tracking keyboards is important for improving the quality of life for such individuals. Recent advances in neural networks of natural language pose new opportunities for re-thinking strategies and user interfaces for enhanced text-entry for AAC users. In this paper, we present SpeakFaster, consisting of large language models (LLMs) and a co-designed user interface for text entry in a highly-abbreviated form, allowing saving 57% more motor actions than traditional predictive keyboards in offline simulation. A pilot study with 19 non-AAC participants typing on a mobile device by hand demonstrated gains in motor savings in line with the offline simulation, while introducing relatively small effects on overall typing speed. Lab and field testing on two eye-gaze typing users with amyotrophic lateral sclerosis (ALS) demonstrated text-entry rates 29-60% faster than traditional baselines, due to significant saving of expensive keystrokes achieved through phrase and word predictions from context-aware LLMs. These findings provide a strong foundation for further exploration of substantially-accelerated text communication for motor-impaired users and demonstrate a direction for applying LLMs to text-based user interfaces.
摘要
研究增加患有重度motor功能障碍者的文本输入速度已经是长期的研究领域。关闭速度差异对增强communication(AAC)设备如眼动识别键盘是重要的,以提高这些人的生活质量。 latest advances in natural language processing neural networks new opportunities for re-thinking strategies and user interfaces for enhanced text-entry for AAC users。在这篇论文中,我们介绍SpeakFaster,包括大型自然语言模型(LLMs)和与用户界面(UI)的共同设计,以实现高度压缩的文本输入方式,可以在Offline simulation中节省57%更多的motor actions。一个随手持机设备的19名非AAC参与者的试验中,通过手动打印方式实现了与Offline simulation中的motor savings相同的效果,而 introducing relatively small effects on overall typing speed。在医学实验室和场景中,对两名受到amyotrophic lateral sclerosis(ALS)影响的眼动输入用户进行了lab和场景测试,并达到了基eline以上的29-60% faster than traditional baselines的文本输入速度,这主要归功于LLMsContext-aware的短语和单词预测。这些发现为患有motor功能障碍者的快速文本通信提供了一个强大的基础,并证明了应用LLMs到文本基于的用户界面的方向。
T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training
results: 在使用新组建的大规模数据集和医学报告中训练T3D模型后,T3D在组织和肿瘤分类 tasks 中具有了优秀的效能,较现有的vSSL方法更好。这证明了T3D的潜在在医学影像分析中的表现学。Abstract
Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance.
摘要
专家标注三维医学图像的专业笔记对下游分析带来挑战,因为这是资源密集的。Visual self-supervised learning(vSSL),虽然能够学习视觉不变性,但忽略了从医学中获得的领域知识。为了将医学知识 incorporated into visual representation learning,vision-language pre-training(VLP)在2D图像上已经显示出了扎实的效果。然而,现有的VLP方法在应用于高分辨率3D医学图像时变得极其不实际,主要是由于GPU硬件限制和可能的细节丢失,这是强制下采样的直观解决方案。为解决以上限制,我们介绍T3D,第一个针对高分辨率3D医学图像的VLP框架。T3D包括两个文本指导的预tex任务:(1)文本指导的对比学习;(2)文本指导的图像恢复。这两个任务的目的是从高分辨率3D医学图像中学习3D视觉表示,并将临床知识从放射学报告中 инте incorporated,而不是通过强制将下采样的图像与细节文本对齐。T3D在新建的大规模数据集和放射学报告上训练,与当前vSSL方法在组织和肿瘤分割以及疾病分类等任务中显著超越。这表明T3D在图像分析领域的表示学习具有潜力。所有数据和代码将在接受后公开。
SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise
results: 相比现有方法NEFTune,SymNoise在不同的模型和基础模型上都显示出了明显的改善,其中最高分为69.04%,比NEFTune高6.7%。Abstract
In this paper, we introduce a novel fine-tuning technique for language models, which involves incorporating symmetric noise into the embedding process. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune~(64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.
摘要
在本文中,我们介绍了一种新的精度调整技术 для语言模型,即在嵌入过程中添加对称噪声。这种方法的目标是通过更加严格地控制模型的本地弯曲度来提高模型的表现,比起现有的方法NEFTune,显示出更好的效果。当使用Alpaca进行精度调整LLaMA-2-7B模型时,标准的方法可以达到29.79%的分数在AlpacaEval上,但是我们的方法SymNoise可以大幅提高这个分数到69.04%,使用对称噪声嵌入。这比现状态之前的方法NEFTune(64.69%)提高了6.7%。此外,当测试在不同的模型和更强的基eline instruction dataset上,如Evol-Instruct、ShareGPT、OpenPlatypus,SymNoise都能够超越NEFTune。现有文献,包括NEFTune,强调了对噪声基础的进一步研究在语言模型的精度调整中的重要性。我们的方法SymNoise是这个方向的另一个重要的一步,表现出了 notable的提高。
Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training
for: investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning
methods: principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference
results: find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations.Abstract
It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning. We principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference. We find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations. Surprisingly, we find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features. Our results point to potential changes in the design of current upstream architectures to better facilitate a variety of downstream tasks.
摘要
通常认为在自动语音识别(ASR)文献中,模型更好地有更广的上下文窗口。然而,这些可能的原因在不监督学习情况下不一定传递。我们研究如何在自我监督学习中获得高质量预训练语音模型。我们主要研究对比预测编码(CPC),我们修改其以控制模型在训练和推理中可见的上下文量。我们发现话语可识别度在40毫秒的前Context中峰值,并且有过多的Context(超过320毫秒)会导致表示质量下降。 surprisingly,我们发现这种模式也适用于监督ASR,当预训练表示被用作冻结输入特征时。我们的结果指向现有上游架构的设计需要更好地满足不同下游任务的需求。
Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?
results: 该论文在10种指定语言中进行了比较分析,并与人类评估结果进行了对比,以评估模型的表现。Abstract
Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. Our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanaguges.
摘要
流畅是所有自然语言生成(NLG)系统的关键目标。广泛使用自动评估metric falling short captures the machine-generated text fluency. assessing the fluency of NLG systems is a challenge because these models are not limited to simply reusing words from the input but may also generate abstractions. existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. this paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. we also experiment with other available multilingual Language Models (LMs). to assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanguages.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
paper_authors: Andrés Villa, Juan Carlos León Alcázar, Alvaro Soto, Bernard Ghanem
for: 本研究的目的是评估大型语言和视觉模型(IT-LVLM)在基本计算机视觉任务中的效果。
methods: 本研究使用了一个多模态评估 benchmark named MERLIM,以评估 IT-LVLM 在计算机视觉任务中的表现。 MERLIM 包含了279,000个图像-问题对,并强调检测跨模态“幻觉”事件,其中语言输出引用图像中缺乏有效的基础。
results: 研究结果表明,当前的状态部硬件IT-LVLM 在确定细腻视觉概念方面仍有限制,并且对于不同任务而言,模型具有强大的全局视觉模式和文本偏好,但是对于细节视觉概念的识别仍有很大的改进空间。Abstract
Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot vision tasks. These large pre-trained architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and arbitrary visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 279K image-question pairs, and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs, where the language output refers to visual concepts that lack any effective grounding in the image. Our results show that state-of-the-art IT-LVMLs are still limited at identifying fine-grained visual concepts, object hallucinations are common across tasks, and their results are strongly biased by small variations in the input query, even if the queries have the very same semantics. Our findings also suggest that these models have weak visual groundings but they can still make adequate guesses by global visual patterns or textual biases contained in the LLM component.
摘要
大型视觉语言模型已经实现了完全监督和零点精度视觉任务的显著进步。这些大型预训练建筑 serves as the baseline for what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs 是一种通用多模式助手,响应于自然语言指令和任意视觉数据。despite this versatility, the effectiveness of IT-LVLMs in fundamental computer vision problems is still unclear, primarily due to the lack of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 279K image-question pairs, and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs, where the language output refers to visual concepts that lack any effective grounding in the image. Our results show that state-of-the-art IT-LVMLs are still limited at identifying fine-grained visual concepts, object hallucinations are common across tasks, and their results are strongly biased by small variations in the input query, even if the queries have the very same semantics. Our findings also suggest that these models have weak visual groundings but they can still make adequate guesses by global visual patterns or textual biases contained in the LLM component.
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
paper_authors: Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
for: 本研究探讨了对已经训练过的模型(例如Transformer)进行解释性分析的方法,以免导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导�Abstract
Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be ``nearly randomized'', while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even after severely constraining the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.
摘要
Interpretability方法 aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be "nearly randomized", while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even after severely constraining the architecture of the model, vastly different solutions can be reached via standard training. Therefore, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.
CEScore: Simple and Efficient Confidence Estimation Model for Evaluating Split and Rephrase
paper_authors: AlMotasem Bellah Al Ajlouni, Jinlong Li for: 这篇论文目的是评估自然语言处理(NLP)中的分词和重句(SR)任务的质量。methods: 这篇论文使用了一种新的统计模型来自动评估 SR 任务的质量。该模型通过模仿人类评估 SR 的方式,提供了四个指标(Sscore、Gscore、Mscore和CEscore)来评估分词和重句的简洁性、 grammaticality、意义保持和总质量等方面。results: 在使用 26 种模型进行实验后,CEScore 与人类评估的相关性强,Spearman 相关性为 0.98 的水平。这说明 CEScore 可以作为一个简单而有效的质量评估 metric для SR 任务。Abstract
The split and rephrase (SR) task aims to divide a long, complex sentence into a set of shorter, simpler sentences that convey the same meaning. This challenging problem in NLP has gained increased attention recently because of its benefits as a pre-processing step in other NLP tasks. Evaluating quality of SR is challenging, as there no automatic metric fit to evaluate this task. In this work, we introduce CEScore, as novel statistical model to automatically evaluate SR task. By mimicking the way humans evaluate SR, CEScore provides 4 metrics (Sscore, Gscore, Mscore, and CEscore) to assess simplicity, grammaticality, meaning preservation, and overall quality, respectively. In experiments with 26 models, CEScore correlates strongly with human evaluations, achieving 0.98 in Spearman correlations at model-level. This underscores the potential of CEScore as a simple and effective metric for assessing the overall quality of SR models.
摘要
《 split and rephrase (SR) 任务的目标是将长、复杂的句子分解成一系列简单、简洁的句子,以保持同样的意思。这个在自然语言处理中的问题在最近几年得到了更多的关注,因为它作为其他 NLP 任务的预处理步骤具有优点。评估 SR 质量是一项挑战,因为没有自动计算可以评估这个任务。在这篇文章中,我们介绍了一种新的统计模型,称为 CEScore,用于自动评估 SR 任务的质量。CEScore 通过模仿人类评估 SR 的方式,提供了4个指标(Sscore、Gscore、Mscore 和 CEscore),用于评估简洁性、 grammaticality、意思保持和总质量等方面。在26种模型的实验中,CEScore 与人类评估呈现很高的相关性,达到了0.98的Spearman相关性。这表明 CEScore 可能是一个简单而有效的质量评估metric для SR 模型。
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian
results: 研究发现,当 GLMs 在不同的下游任务中进行训练时,它们的性能强度和灵活性异常出乎意料。此外,研究还发现,当 GLMs 在不同的语言和文化背景下进行训练时,它们的性能也会受到这些因素的影响。Abstract
Recent advancements in Generative Language Models (GLMs) have transformed Natural Language Processing (NLP) by showcasing the effectiveness of the "pre-train, prompt, and predict" paradigm in utilizing pre-trained GLM knowledge for diverse applications. Despite their potential, these capabilities lack adequate quantitative characterization due to the absence of comprehensive benchmarks, particularly for low-resource languages. Existing low-resource benchmarks focus on discriminative language models like BERT, neglecting the evaluation of generative language models. Moreover, current benchmarks often overlook measuring generalization performance across multiple tasks, a crucial metric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmark tailored for evaluating natural language generation capabilities in Norwegian, a low-resource language. We use Norwegian as a case study to explore whether current GLMs and benchmarks in mainstream languages like English can reveal the unique characteristics of underrepresented languages. NLEBench encompasses a suite of real-world NLP tasks ranging from news storytelling, summarization, open-domain conversation, natural language understanding, instruction fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought investigation. It features two high-quality, human-annotated datasets: an instruction dataset covering traditional Norwegian cultures, idioms, slang, and special expressions, and a document-grounded multi-label dataset for topic classification, question answering, and summarization. This paper also introduces foundational Norwegian Generative Language Models (NorGLMs) developed with diverse parameter scales and Transformer-based architectures. Systematic evaluations on the proposed benchmark suite provide insights into the capabilities and scalability of NorGLMs across various downstream tasks.
摘要
现代生成语言模型(GLM)的进步已经改变了自然语言处理(NLP)领域,它们在“预训练、提示、预测”的 paradigm 中使用预训练 GLM 知识来实现多种应用。尽管它们的潜力很大,但是这些能力尚未得到充分的量化描述,特别是低资源语言的情况。现有的低资源标准化 benchmark 多数注重推理语言模型如 BERT,忽略生成语言模型的评价。此外,现有的标准化 benchmark 经常忽略跨多个任务的通用性表现,这是 GLM 的关键纪录。为了bridging这些差距,我们提出了 NLEBench,一个适用于评估 Norwegians 语言生成能力的全面性 benchmark。我们使用 Norwegians 作为 caso study,探索现有 GLM 和主流语言 like English 的权威语言模型是否能够捕捉低资源语言的特殊特征。 NLEBench 包括一系列真实世界 NLP 任务,从新闻故事、摘要、开放领域对话、自然语言理解、指令精细调整、恶意和偏见评估等。它还包括两个高质量、人工注解的数据集:一个关于传统 Norwegian 文化、短语、俚语和特殊表达的教程数据集,以及一个基于文档的多标签数据集,用于主题分类、问题回答和摘要。本文还提出了基于 Transformer 架构和多种参数缩放的 Norwegian 生成语言模型(NorGLM)。通过对提议的标准化 benchmark 进行系统性评估,我们可以了解 NorGLM 在不同下游任务中的能力和缩放性。
Bridging Background Knowledge Gaps in Translation with Automatic Explicitation
results: 这篇论文的实验结果表明,使用这些自动生成的explicitation技术可以帮助答案更准确地回答用户提问。Abstract
Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP research on explicitation is limited because of the dearth of adequate evaluation methods. This work introduces techniques for automatically generating explicitations, motivated by WikiExpl: a dataset that we collect from Wikipedia and annotate with human translators. The resulting explicitations are useful as they help answer questions more accurately in a multilingual question answering framework.
摘要
翻译助人理解另一种语言中的内容,但是简单地Literal翻译并不能实现这个目标,因为人们缺乏必要的背景知识。专业翻译人员会考虑文化差异,为了在目标受众中提供更多的信息。尽管NLP研究中的Explicitation潜在的应用价值很大,但是由于评估方法的不足,这一领域的研究受到限制。这项工作提出了自动生成Explicitation的技术,由WikiExpl数据集和人工翻译员的注释驱动。得到的Explicitation是有用的,可以帮助多语言问答框架更准确地回答问题。
On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi
For: The paper focuses on improving the performance of shallow models for Named Entity Recognition (NER) in low-resource languages, specifically in Marathi.* Methods: The authors propose a hybrid approach that combines traditional deep learning models (CNN and LSTM) with a BERT-based subword tokenizer to improve the accuracy of NER models.* Results: The authors show that replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT, and demonstrate the importance of using sub-word tokenization for NER in low-resource languages.Abstract
Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.
摘要
Named EntityRecognition(NER)系统在自然语言处理(NLP)应用中扮演着重要的角色,如机器翻译、概要 Summarization 和问答系统。这些系统可以识别命名实体,其包括实际世界中的地点、人物和组织。尽管英语NER系统已经得到了广泛的研究,但它们在低资源语言上没有得到了足够的注意。在这项工作中,我们将对低资源语言NER系统进行研究,并通过使用 MahaBERT、MahaGPT、IndicBERT 和 mBERT 等tokenizer进行评估。我们提出了一种将BERT基于的字词tokenizer与传统的CNN和LSTM模型结合使用的hybrid方法,以提高NER模型的性能。我们发现,将字词tokenizer替换为BERT基于的字词tokenizer可以使vanilla单层模型的准确率接近到深度预训练模型的水平。我们还证明了使用字词tokenization对NER是重要的,并进行了对Building efficient NLP系统的研究。