results: 在自动评估中,他们在英文->希伯来和希伯来->英文两个方向中都获得了竞争性的结果。Abstract
This paper describes the UvA-MT's submission to the WMT 2023 shared task on general machine translation. We participate in the constrained track in two directions: English <-> Hebrew. In this competition, we show that by using one model to handle bidirectional tasks, as a minimal setting of Multilingual Machine Translation (MMT), it is possible to achieve comparable results with that of traditional bilingual translation for both directions. By including effective strategies, like back-translation, re-parameterized embedding table, and task-oriented fine-tuning, we obtained competitive final results in the automatic evaluation for both English -> Hebrew and Hebrew -> English directions.
摘要
translate to Simplified Chinese as follows:这篇论文描述了UvA-MT在WMT 2023共同任务中的提交,我们在Constrained Track中参加了英文 <-> 希伯来两个方向的翻译。在这次竞赛中,我们表明,通过使用一个模型处理双向任务,作为多语言翻译的最小设置(MMT),可以达到相同的结果。通过包括有效策略,如回译、重新参数表示表 и任务导向精度调整,我们在自动评估中获得了对 beiden方向的竞争性最终结果。
FiLM: Fill-in Language Models for Any-Order Generation
for: 填充语言模型 (Fill-in Language Model, FiLM) 的目的是提供一种可以在任意位置进行灵活生成的语言模型,以便在填充文本中使用双向文本上下文。
methods: FiLM 使用了一种新的语言模型方法,即采用 beta 分布中的变化掩码概率来提高 FiLM 的生成能力。在推理过程中,FiLM 可以顺利地插入缺失的句子、段落或整个文本,以确保输出的文本流畅、与周围上下文一致。
results: 在自动和人工评估中,FiLM 表现出色,超过了基于左到右语言模型的填充方法。FiLM 可以轻松地在不同的文本长度和难度水平上进行调整,并且可以在不同的语言模型大小上进行训练和 fine-tuning。Abstract
Language models have become the backbone of today's AI systems. However, their predominant left-to-right generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fill-in Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on left-to-right language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or fine-tuned from a left-to-right language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong left-to-right language models of similar sizes, indicating FiLM's scalability and potential as a large language model.
摘要
现代人工智能系统中,语言模型已成为背景模型。然而,这些主要左往右生成的语言模型限制了使用对向文本填充的 bidirectional 上下文,这是装备填充文本的任务中非常重要。我们提出了填充语言模型(FiLM),一种新的语言模型化方法,可以在任何位置进行 flexible 生成,不受特定生成顺序的限制。它的训练将推广遮盾语言模型的对话预设,透过对应排版的 beta 分布来增强FiLM的生成能力。在推断中,FiLM可以顺利地插入缺失的句子、句末或段落,以确保输出的流畅和与周围上下文一致。在自动和人工评估中,FiLM比靠左往右的语言模型训练在重新排序的文本段落上的填充方法表现出色,并且可以轻松地从头部训练或精革左往右语言模型。值得一提的是,当模型的大小增加时,FiLM的误差接近强左往右语言模型相似大小的误差,这表明FiLM在大型模型中的可扩展性和潜力。
Prompting Scientific Names for Zero-Shot Species Recognition
results: 研究发现,使用common名称(例如mountain hare)而不是学名(例如Lepus Timidus)在prompt中可以提高CLIP的认知精度,并且可以达到2∼5倍的提升。Abstract
Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.
摘要使用 web 级别的图片文本对,视觉语言模型(VLM)如 CLIP 可以不经过训练就识别通用对象的图片。但是,对于高度专业化的概念,如鸟类、植物和动物的种类,它们的科学名称通常是拉丁文或希腊文。CLIP 在无需训练的情况下识别这些种类的图片表现不佳,因为这些名称没有包含在 CLIP 的训练集中。以前的研究提议使用大型自然语言模型(LLM)生成描述(例如,种类颜色和形状),并将其添加到提示中。我们发现它们只提供了有限的改进。与此不同,我们强调将科学名称翻译成通用英文名称(例如,山兔),并使用这些名称作为提示。我们发现这样可以提高 CLIP 的准确率,在benchmarking数据集上实现2-5倍的提高。Here's the translation in Traditional Chinese as well:使用 web 级别的图片文本对,视觉语言模型(VLM)如 CLIP 可以不经过训练就识别通用对象的图片。但是,对于高度专业化的概念,如鸟类、植物和动物的种类,它们的科学名称通常是拉丁文或希腊文。CLIP 在无需训练的情况下识别这些种类的图片表现不佳,因为这些名称没有包含在 CLIP 的训练集中。以前的研究提议使用大型自然语言模型(LLM)生成描述(例如,种类颜色和形状),并将其添加到提示中。我们发现它们只提供了有限的改进。与此不同,我们强调将科学名称翻译成通用英文名称(例如,山兔),并使用这些名称作为提示。我们发现这样可以提高 CLIP 的准确率,在benchmarking数据集上实现2-5倍的提高。
Empirical study of pretrained multilingual language models for zero-shot cross-lingual generation
methods: 这篇论文测试了一些替代的mPLM模型,包括mBART和NLLB,并考虑了全 Parameters 的 fine-tuning 和 parameter-efficient fine-tuning with adapters。
results: 研究发现,mBART with adapters 与 mT5 相似,NLLB 可以在一些情况下与 mT5 竞争。 此外,研究发现训练学习率对 fine-tuning 的调整可以减轻生成错误语言的问题。Abstract
Zero-shot cross-lingual generation assumes finetuning the multilingual pretrained language model (mPLM) on a generation task in one language and then using it to make predictions for this task in other languages. Previous works notice a frequent problem of generation in a wrong language and propose approaches to address it, usually using mT5 as a backbone model. In this work, we test alternative mPLMs, such as mBART and NLLB, considering full finetuning and parameter-efficient finetuning with adapters. We find that mBART with adapters performs similarly to mT5 of the same size, and NLLB can be competitive in some cases. We also underline the importance of tuning learning rate used for finetuning, which helps to alleviate the problem of generation in the wrong language.
摘要
zero-shot 跨语言生成假设通过质量化多语言预训练语言模型(mPLM)的 Fine-tuning 进行一种语言的生成任务,然后用其来预测这个任务的其他语言。 previous works 发现生成 incorrect language 的问题,并提出了解决方案,通常使用 mT5 作为基础模型。 在这个工作中,我们测试了不同的 mPLM,如 mBART 和 NLLB,包括全部 Fine-tuning 和参数有效的 Fine-tuning WITH 适配器。我们发现 mBART WITH 适配器 和 mT5 的同等大小下表现相似,而 NLLB 在一些情况下可以达到竞争水平。我们还强调了在 Fine-tuning 中调整学习率的重要性,可以减轻生成 incorrect language 的问题。
Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis
for: This paper assesses the performance of OpenAI’s GPT-4V model in multimodal medical diagnosis, evaluating its ability to distinguish between medical image modalities and anatomy, as well as its ability to generate comprehensive reports.
methods: The evaluation uses 17 human body systems and 8 modalities of medical images, with or without patent history provided, to probe the GPT-4V’s ability on multiple clinical tasks such as imaging modality and anatomy recognition, disease diagnosis, and report generation.
results: The study finds that while GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy, it faces significant challenges in disease diagnosis and generating comprehensive reports, highlighting the limitations of large multimodal models in supporting real-world medical applications and clinical decision-making.Here are the three key points in Simplified Chinese:
results: 研究发现,虽然GPT-4V在分辨医疗影像模式和解剖结构方面表现出色,但在疾病诊断和生成全面报告方面受到了重大挑战,表明大型多模态模型在实际医疗应用和临床决策中仍有很大的发展空间。Abstract
Driven by the large foundation models, the development of artificial intelligence has witnessed tremendous progress lately, leading to a surge of general interest from the public. In this study, we aim to assess the performance of OpenAI's newest model, GPT-4V(ision), specifically in the realm of multimodal medical diagnosis. Our evaluation encompasses 17 human body systems, including Central Nervous System, Head and Neck, Cardiac, Chest, Hematology, Hepatobiliary, Gastrointestinal, Urogenital, Gynecology, Obstetrics, Breast, Musculoskeletal, Spine, Vascular, Oncology, Trauma, Pediatrics, with images taken from 8 modalities used in daily clinic routine, e.g., X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Digital Subtraction Angiography (DSA), Mammography, Ultrasound, and Pathology. We probe the GPT-4V's ability on multiple clinical tasks with or without patent history provided, including imaging modality and anatomy recognition, disease diagnosis, report generation, disease localisation. Our observation shows that, while GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy, it faces significant challenges in disease diagnosis and generating comprehensive reports. These findings underscore that while large multimodal models have made significant advancements in computer vision and natural language processing, it remains far from being used to effectively support real-world medical applications and clinical decision-making. All images used in this report can be found in https://github.com/chaoyi-wu/GPT-4V_Medical_Evaluation.
摘要
由大型基础模型驱动,人工智能的发展最近几年有了很大的进步,引起了公众的广泛关注。在这项研究中,我们想要评估OpenAI的最新模型GPT-4V(视觉)在多modal医学诊断方面的表现。我们的评估覆盖了17个人体系统,包括中枢神经系统、头颈部、心脏、胸部、血液系统、肝胆系统、肠道系统、尿道系统、妇科、儿科、骨骼系统、脊梁系统、血管系统、肿瘤系统、护理、外伤等,图像来自日常临床 Routine的8种模式,例如X射线、计算tomography(CT)、核磁共振成像(MRI)、 позитрон发射tomography(PET)、数字抽取ANGIOGRAPHY(DSA)、胸部X射线、计算tomography(CT)、ultrasound和pathology。我们 probing GPT-4V的能力在多种临床任务上,包括图像模式和解剖学识别、疾病诊断、报告生成、疾病Localization。我们的观察表明,GPT-4V能够Distinguish between different medical imaging modalities and anatomy, but it faces significant challenges in disease diagnosis and report generation. These findings highlight that while large multimodal models have made significant advancements in computer vision and natural language processing, they are still far from being used to effectively support real-world medical applications and clinical decision-making.所有图像使用在这项报告中可以在GitHub上找到:https://github.com/chaoyi-wu/GPT-4V_Medical_Evaluation。
Reformulating NLP tasks to Capture Longitudinal Manifestation of Language Disorders in People with Dementia
results: 研究发现,提出的语言标记能够准确地识别患有 деменция 的人的语言障碍,并且与临床标记呈正相关。此外,这些语言标记还提供了词语障碍的可观察性和恰当性,可以用于评估词语障碍的进程。Abstract
Dementia is associated with language disorders which impede communication. Here, we automatically learn linguistic disorder patterns by making use of a moderately-sized pre-trained language model and forcing it to focus on reformulated natural language processing (NLP) tasks and associated linguistic patterns. Our experiments show that NLP tasks that encapsulate contextual information and enhance the gradient signal with linguistic patterns benefit performance. We then use the probability estimates from the best model to construct digital linguistic markers measuring the overall quality in communication and the intensity of a variety of language disorders. We investigate how the digital markers characterize dementia speech from a longitudinal perspective. We find that our proposed communication marker is able to robustly and reliably characterize the language of people with dementia, outperforming existing linguistic approaches; and shows external validity via significant correlation with clinical markers of behaviour. Finally, our proposed linguistic disorder markers provide useful insights into gradual language impairment associated with disease progression.
摘要
偏僻症与语言障碍有关,我们通过自动学习受控语言模型,训练其专注于修改后NLP任务和相关的语言模式。我们的实验显示,包含语言上下文信息并在语言模式中增强梯度信号的NLP任务可以提高表现。然后,我们使用最佳模型的概率估计来构建数字语言标记,评估整体沟通质量和语言障碍的严重程度。我们研究如何使用我们的提议的沟通标记来 caracterize dementia speech的长期趋势。我们发现,我们的提议的语言障碍标记能够坚定可靠地 caracterize人们患有偏僻症的语言,高于现有的语言方法;并与临床标记相关。最后,我们的语言障碍标记提供了有用的透视 gradual language impairment与疾病进程相关的语言障碍。
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
results: 该模型在 MS-COCO 测试集上取得了状态的最佳性能(CIDEr 125.6),并且比基eline模型快速 9.22 倍;在半循环的情况下,该模型达到了 128.4 的 CIDEr 性能,并且速度比基eline模型快速 3.69 倍。Abstract
Most image captioning models following an autoregressive manner suffer from significant inference latency. Several models adopted a non-autoregressive manner to speed up the process. However, the vanilla non-autoregressive manner results in subpar performance, since it generates all words simultaneously, which fails to capture the relationships between words in a description. The semi-autoregressive manner employs a partially parallel method to preserve performance, but it sacrifices inference speed. In this paper, we introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques. The BoFiCap model leverages the inherent characteristics of image captioning tasks to pre-define bounding boxes for image regions and their relationships. Subsequently, the BoFiCap model fills corresponding words in each box using two-generation manners. Leveraging the box hints, our filling process allows each word to better perceive other words. Additionally, our model offers flexible image description generation: 1) by employing different generation manners based on speed or performance requirements, 2) producing varied sentences based on user-specified boxes. Experimental evaluations on the MS-COCO benchmark dataset demonstrate that our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr (125.6) while speeding up 9.22x than the baseline model with an autoregressive manner; in a semi-autoregressive manner, our method reaches 128.4 on CIDEr while a 3.69x speedup. Our code and data is available at https://github.com/ChangxinWang/BoFiCap.
摘要
大多数图像描述模型采用回归方式,却受到显著的推理延迟。一些模型采用非回归方式以加速过程,但这会导致性能下降,因为它们同时生成所有 слова,无法捕捉图像描述中 слова之间的关系。半回归方式使用部分并行方法保持性能,但是它们牺牲推理速度。本文提出一种快速和灵活的图像描述模型called BoFiCap,基于缓存和填充技术。BoFiCap模型利用图像描述任务的特点,先定义图像区域的缓存框,然后使用两种生成方式填充对应的字。利用框提示,我们的填充过程让每个字etter perceive其他字。此外,我们的模型提供了自适应的图像描述生成:1)根据速度或性能要求使用不同的生成方式,2)生成基于用户指定的盒子的多种句子。在COCO数据集上的实验评估 demonstrate了我们的框架在非回归方式下达到了状态之arte(CIDEr=125.6),同时速度比基eline模型(具有回归方式)快9.22倍。在半回归方式下,我们的方法达到了128.4的CIDEr,速度比基eline模型快3.69倍。我们的代码和数据可以在https://github.com/ChangxinWang/BoFiCap上获取。
Enhancing Stance Classification with Quantified Moral Foundations
paper_authors: Hong Zhang, Prasanta Bhattacharya, Wei Gao, Liang Ze Wong, Brandon Siyuan Loh, Joseph J. P. Simons, Jisun An
for: 这 paper 的目的是增强社交媒体上的立场检测,通过 incorporating deeper psychological attributes,特别是个人的道德基础。
methods: 这 paper 使用的方法包括EXTRACTING moral foundation features from text, 以及 message semantic features,来 классифика stance 在 message- 和 user-levels 上。
results: Preliminary results 表明, encoding moral foundations 可以提高 stance detection 任务的性能,并帮助描述特定道德基础和 online stance 之间的关系。 results highlight the importance of considering deeper psychological attributes in stance analysis and underscores the role of moral foundations in guiding online social behavior.Abstract
This study enhances stance detection on social media by incorporating deeper psychological attributes, specifically individuals' moral foundations. These theoretically-derived dimensions aim to provide a comprehensive profile of an individual's moral concerns which, in recent work, has been linked to behaviour in a range of domains, including society, politics, health, and the environment. In this paper, we investigate how moral foundation dimensions can contribute to predicting an individual's stance on a given target. Specifically we incorporate moral foundation features extracted from text, along with message semantic features, to classify stances at both message- and user-levels across a range of targets and models. Our preliminary results suggest that encoding moral foundations can enhance the performance of stance detection tasks and help illuminate the associations between specific moral foundations and online stances on target topics. The results highlight the importance of considering deeper psychological attributes in stance analysis and underscores the role of moral foundations in guiding online social behavior.
摘要
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts
paper_authors: Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, Dacheng Tao
for: 提高语言模型的大小通常会导致NLPTasks的进步,但是会增加计算成本。零含量的混合专家(MoE)可以减少计算成本,但是如果增加激活专家的数量,计算成本会增加很快,限制实际应用。本文提出一种名为\textbf{\texttt{Merging Experts into One}(MEO)的计算效率的方法,可以保持增加专家的优点而不导致计算成本增加。
methods: 我们首先证明选择多个专家的优势,然后提出一种计算效率的方法,即\textbf{\texttt{Merging Experts into One}(MEO),可以将计算成本降低到单个专家的水平。此外,我们还提出了一种符号级注意块,可以进一步提高MEO的效率和表现。
results: 我们进行了广泛的实验,显示MEO可以减少计算成本,例如FLOPS从72.0G下降到28.6G(MEO)。此外,我们还提出了一种符号级注意块,可以进一步提高MEO的效率和表现。例如,在GLUE benchmark上,MEO的平均分数为83.3%,而vanilla MoE的平均分数为82.6%。Abstract
Scaling the size of language models usually leads to remarkable advancements in NLP tasks. But it often comes with a price of growing computational cost. Although a sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters (e.g., one expert) for each input, its computation escalates significantly if increasing the number of activated experts, limiting its practical utility. Can we retain the advantages of adding more experts without substantially increasing the computational costs? In this paper, we first demonstrate the superiority of selecting multiple experts and then propose a computation-efficient approach called \textbf{\texttt{Merging Experts into One} (MEO), which reduces the computation cost to that of a single expert. Extensive experiments show that MEO significantly improves computational efficiency, e.g., FLOPS drops from 72.0G of vanilla MoE to 28.6G (MEO). Moreover, we propose a token-level attention block that further enhances the efficiency and performance of token-level MEO, e.g., 83.3\% (MEO) vs. 82.6\% (vanilla MoE) average score on the GLUE benchmark. Our code will be released upon acceptance. Code will be released at: \url{https://github.com/Shwai-He/MEO}.
摘要
通常,将语言模型的大小扩展到可观的尺度会导致NLPTasks的显著进步。然而,这经常会带来计算成本的增加。虽然 sparse Mixture of Experts(MoE)可以降低计算成本,但是如果启用更多的专家,计算成本会快速增加,限制其实际应用。我们是否可以保留添加更多专家的优点而不导致计算成本增加很多?在这篇论文中,我们首先表明了多个专家的选择的优势,然后我们提出了一种 computation-efficient的方法called \textbf{\texttt{Merging Experts into One}(MEO),可以降低计算成本到单个专家的水平。我们进行了广泛的实验,发现 MEO 可以减少 FLOPS 的值,例如,从 vanilla MoE 的 72.0G 降低到 28.6G(MEO)。此外,我们还提出了一种循环预测块,可以进一步提高 MEO 的效率和性能,例如,在 GLUE 测试准则上,MEO 的平均分数为 83.3%,而 vanilla MoE 的平均分数为 82.6%。我们将代码发布在接受后。代码将发布在:\url{https://github.com/Shwai-He/MEO}.
Assessing the Reliability of Large Language Model Knowledge
paper_authors: Weixuan Wang, Barry Haddow, Alexandra Birch, Wei Peng
for: 评估大语言模型(LLMs)的知识可靠性。
methods: 提出了一种名为 Model Knowledge Relibility Score (MONITOR) 的新度量方法,用于直接测试 LLMs 的事实可靠性。
results: 在一系列12种 LLMS 上进行了实验,并证明了 MONITOR 的效iveness 以及低计算成本。此外,还释放了一个名为 Factual Knowledge Test Corpus (FKTC) 的测试集,以便进一步研究。Abstract
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. LLMs are typically evaluated using accuracy, yet this metric does not capture the vulnerability of LLMs to hallucination-inducing factors like prompt and context variability. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? In this paper, we propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability. MONITOR computes the distance between the probability distributions of a valid output and its counterparts produced by the same LLM probing the same fact using different styles of prompts and contexts.Experiments on a comprehensive range of 12 LLMs demonstrate the effectiveness of MONITOR in evaluating the factual reliability of LLMs while maintaining a low computational overhead. In addition, we release the FKTC (Factual Knowledge Test Corpus) test set, containing 210,158 prompts in total to foster research along this line (https://github.com/Vicky-Wil/MONITOR).
摘要
RSVP: Customer Intent Detection via Agent Response Contrastive and Generative Pre-Training
results: 与状态空间的基eline比进行了比较,得到了4.95%的准确率提升,3.4%的MRR@3提升和2.75%的MRR@5提升的结果Abstract
The dialogue systems in customer services have been developed with neural models to provide users with precise answers and round-the-clock support in task-oriented conversations by detecting customer intents based on their utterances. Existing intent detection approaches have highly relied on adaptively pre-training language models with large-scale datasets, yet the predominant cost of data collection may hinder their superiority. In addition, they neglect the information within the conversational responses of the agents, which have a lower collection cost, but are significant to customer intent as agents must tailor their replies based on the customers' intent. In this paper, we propose RSVP, a self-supervised framework dedicated to task-oriented dialogues, which utilizes agent responses for pre-training in a two-stage manner. Specifically, we introduce two pre-training tasks to incorporate the relations of utterance-response pairs: 1) Response Retrieval by selecting a correct response from a batch of candidates, and 2) Response Generation by mimicking agents to generate the response to a given utterance. Our benchmark results for two real-world customer service datasets show that RSVP significantly outperforms the state-of-the-art baselines by 4.95% for accuracy, 3.4% for MRR@3, and 2.75% for MRR@5 on average. Extensive case studies are investigated to show the validity of incorporating agent responses into the pre-training stage.
摘要
Dialogue 系统在客户服务中已经采用神经网络模型,以提供用户精准的答案和24小时的支持,通过检测客户意图基于他们的谈话来进行任务化对话。现有的意图检测方法强调适应性地预训练语言模型,但这可能增加成本。此外,它们忽略了代理人回复的信息,尽管这些信息在客户意图方面具有重要性,因为代理人必须根据客户的意图修改他们的回复。在本文中,我们提出了 RSVP,一个自动预训练框架,专门用于任务化对话。我们在两个阶段中使用代理人回复进行预训练:1)回复选择,选择一个正确的回复从批处理中的候选者中,2)回复生成,模仿代理人生成一个回复来回应给一个谈话。我们对两个实际的客户服务数据集进行了比较,结果显示,RSVP在精度、MRR@3和MRR@5等指标上平均高于状态之前的基eline by 4.95%、3.4%和2.75%。我们还进行了广泛的案例研究,以证明代理人回复的包含在预训练阶段是有效的。
Revisiting Graph Meaning Representations through Decoupling Contextual Representation Learning and Structural Information Propagation
results: 研究结果表明,GMRs在四个英文和两个中文 dataset 中有所提高表达关系的性能,特别是英文dataset更加精确。然而,在文学领域dataset中,GMRs的效果较低。这些发现可以为将来关系EXTRACTION任务中的GMRs和 parser设计提供更好的指导。Abstract
In the field of natural language understanding, the intersection of neural models and graph meaning representations (GMRs) remains a compelling area of research. Despite the growing interest, a critical gap persists in understanding the exact influence of GMRs, particularly concerning relation extraction tasks. Addressing this, we introduce DAGNN-plus, a simple and parameter-efficient neural architecture designed to decouple contextual representation learning from structural information propagation. Coupled with various sequence encoders and GMRs, this architecture provides a foundation for systematic experimentation on two English and two Chinese datasets. Our empirical analysis utilizes four different graph formalisms and nine parsers. The results yield a nuanced understanding of GMRs, showing improvements in three out of the four datasets, particularly favoring English over Chinese due to highly accurate parsers. Interestingly, GMRs appear less effective in literary-domain datasets compared to general-domain datasets. These findings lay the groundwork for better-informed design of GMRs and parsers to improve relation classification, which is expected to tangibly impact the future trajectory of natural language understanding research.
摘要
在自然语言理解领域,神经网络和图意表示(GMR)的交叉研究仍然吸引着广泛的关注。尽管有增长的兴趣,但是关于GMR的具体影响仍然存在一个重要的知识 gap。为了解决这个问题,我们介绍了DAGNN-plus,一种简单而参数有效的神经网络架构,用于分离上下文表示学习和结构信息传递。与不同的序列编码器和GMR相结合,这个架构提供了对系统实验的基础,并在四种图形式和九个解析器的支持下进行了实验分析。我们的实验结果表明,GMR在英文和中文两个领域中的表现不同,特别是在文学领域比通用领域更具有优势。这些发现为将来改进GMR和解析器的设计,以提高关系类别的识别,这将对自然语言理解研究的未来轨迹产生直接的影响。
Large Language Model-Aware In-Context Learning for Code Generation
results: 这 paper 的实验结果表明,LAIL 可以在 CodeGen 和 GPT-3.5 上提高 LLMS 的培养效果,相比之前的基eline 提高了11.58%、6.89%和5.07%,以及4.38%、2.85%和2.74%。Abstract
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. LLMs take a prompt consisting of requirement-code examples and a new requirement as input, and output new programs. Existing studies have found that ICL is highly dominated by the examples and thus arises research on example selection. However, existing approaches randomly select examples or only consider the textual similarity of requirements to retrieve, leading to sub-optimal performance. In this paper, we propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation. Given a candidate example, we exploit LLMs themselves to estimate it by considering the generation probabilities of ground-truth programs given a requirement and the example. We then label candidate examples as positive or negative through the probability feedback. Based on the labeled data, we import a contrastive learning objective to train an effective retriever that acquires the preference of LLMs in code generation. We apply LAIL to three LLMs and evaluate it on three representative datasets (e.g., MBJP, MBPP, and MBCPP). LATA outperforms the state-of-the-art baselines by 11.58%, 6.89%, and 5.07% on CodeGen, and 4.38%, 2.85%, and 2.74% on GPT-3.5 in terms of Pass@1, respectively.
摘要
大型语言模型(LLM)在代码生成中表现出了吸引人的上下文学习(ICL)能力。LLM 接受一个包含需求代码示例和新需求的提示,并输出新的程序。现有的研究发现,ICL 受到示例的影响很大,因此引发了研究示例选择的研究。然而,现有的方法 Randomly 选择示例或者只考虑需求文本相似性来 retrieve,导致表现不佳。在这篇论文中,我们提出了一种新的学习基于选择方法 named LAIL(LLM-Aware In-context Learning)。给定一个候选示例,我们利用 LLM 自己来估算它,通过考虑需求和示例下的生成概率来Feedback probability。然后,我们将候选示例标记为正例或者负例,根据概率反馈。基于标记数据,我们导入了对比学习目标,以培养一个有效的检索器,使其获得 LLM 在代码生成中的偏好。我们在三个 LLM 上应用 LAIL,并对 MBJP、MBPP 和 MBCPP 三个表示性数据集进行评估。LATA 与当前基eline 相比,提高了代码生成的性能,具体是11.58%、6.89% 和 5.07% 的提升。
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining
results: 这个共同任务收到了 31 个参赛作品,其中 21 个来自 9 个团队,来自 6 个国家。最佳提交在 Subtask-A 中获得了 F1 分数 0.8647,而在 Subtask-B 中获得了 F1 分数 0.5561。Abstract
This paper presents an overview of the ImageArg shared task, the first multimodal Argument Mining shared task co-located with the 10th Workshop on Argument Mining at EMNLP 2023. The shared task comprises two classification subtasks - (1) Subtask-A: Argument Stance Classification; (2) Subtask-B: Image Persuasiveness Classification. The former determines the stance of a tweet containing an image and a piece of text toward a controversial topic (e.g., gun control and abortion). The latter determines whether the image makes the tweet text more persuasive. The shared task received 31 submissions for Subtask-A and 21 submissions for Subtask-B from 9 different teams across 6 countries. The top submission in Subtask-A achieved an F1-score of 0.8647 while the best submission in Subtask-B achieved an F1-score of 0.5561.
摘要
这份论文介绍了图像论据共同任务(ImageArg),这是在EMNLP 2023年工作坊上的第一个多Modal Argument Mining共同任务。该任务包括两个分类子任务:(1)子任务A:图像立场分类;(2)子任务B:图像宣传效果分类。前者确定一个推文中的图像和文本对于一个争议话题(例如,枪支控制和堕胎)的立场。后者确定图像是否使得推文文本更加吸引人。共同任务收到了31个提交 для子任务A和21个提交 для子任务B来自9个不同的团队在6个国家。最佳提交在子任务A中取得了F1分数0.8647,而最佳提交在子任务B中取得了F1分数0.5561。
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models
results: 经过广泛的实验表明,LLM在简单的知识 QA 任务上表现出色,但是需要更复杂的推理或使用域pecific的知识时仍然存在很大挑战。这些结果表明KGQuiz可以用于分析LLM的知识能力和普适性在不同知识领域和任务格式下的变化。Abstract
Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
摘要
大型语言模型(LLM)在知识密集任务中表现出色,表明其模型参数中含有真实世界知识。然而,关于如何系统地评估 LLM 的知识能力和其知识能力是否可以普遍应用于多个知识领域和复杂任务格式,还不够了解。为此,我们提出了 KGQuiz,一个用于全面探索 LLM 的知识普适能力的benchmark。KGQuiz 基于 triplet 知识结构,覆盖了三个知识领域,包括五种任务 formats,从简单的true-or-false 和多选问答,到复杂的blank filling和factual editing,最后是开放式知识生成。为了更好地理解 LLM 的知识能力和其普适性,我们在 KGQuiz benchmark 上测试了 10 个开源和黑盒 LLM,并进行了广泛的实验。结果表明, LLM 在直观知识 QA 任务中表现出色,但是需要更复杂的解释或使用域pecific的事实时仍然存在很大的挑战。我们认为 KGQuiz 可以作为一个测试台来分析这些 nuanced 的表现差异,并 ultimately 理解、评估和提高 LLM 的知识能力在多个知识领域和任务格式中。
HiCL: Hierarchical Contrastive Learning of Unsupervised Sentence Embeddings
results: 对比于传统方法,HiCL能够提高7种广泛评估的STS任务的前一个表现,升师平均提高+0.2%(BERT-large)和+0.44%(RoBERTa-large)。Abstract
In this paper, we propose a hierarchical contrastive learning framework, HiCL, which considers local segment-level and global sequence-level relationships to improve training efficiency and effectiveness. Traditional methods typically encode a sequence in its entirety for contrast with others, often neglecting local representation learning, leading to challenges in generalizing to shorter texts. Conversely, HiCL improves its effectiveness by dividing the sequence into several segments and employing both local and global contrastive learning to model segment-level and sequence-level relationships. Further, considering the quadratic time complexity of transformers over input tokens, HiCL boosts training efficiency by first encoding short segments and then aggregating them to obtain the sequence representation. Extensive experiments show that HiCL enhances the prior top-performing SNCSE model across seven extensively evaluated STS tasks, with an average increase of +0.2% observed on BERT-large and +0.44% on RoBERTa-large.
摘要
在这篇论文中,我们提出了一个层次对比学习框架,即HiCL,该框架考虑了本地分割段级和全序列级关系,以提高训练效率和有效性。传统方法通常将序列编码为整体对比他们,而忽略本地表示学习,这会导致对短文本掌握困难。相反,HiCL通过将序列分割成多个段,并使用本地和全序列对比学习来模型段级和序列级关系。此外,考虑到 transformer 对输入字符数的平方时间复杂度,HiCL 提高了训练效率,先对短段进行编码,然后将其聚合以获得序列表示。广泛的实验表明,HiCL 可以提高先前的最佳 SNCSE 模型在七个广泛评估的 STS 任务上,平均提高 +0.2% 在 BERT-large 上和 +0.44% 在 RoBERTa-large 上。