results: 通过对多种任务进行测试,包括数学函数、知识图关系和复杂的现实世界RESTful API等,实验表明,ToolDec可以完全消除语法错误,并在不需要精度调整或在 контекст中提供工具文档的情况下,达到更高的性能和速度提升。此外,ToolDec还可以在未看过工具的情况下,选择合适的工具,并且可以更好地泛化到新的工具。Abstract
Large language models (LLMs) have shown promising capabilities in using external tools to solve complex problems. However, existing approaches either involve fine-tuning on tool demonstrations, which do not generalize to new tools without additional training, or providing tool documentation in context, limiting the number of tools. Both approaches often generate syntactically invalid tool calls. In this paper, we propose ToolDec, a finite-state machine-guided decoding algorithm for tool-augmented LLMs. ToolDec eliminates tool-related errors for any tool-augmented LLMs by ensuring valid tool names and type-conforming arguments. Furthermore, ToolDec enables LLM to effectively select tools using only the information contained in their names, with no need for fine-tuning or in-context documentation. We evaluated multiple prior methods and their ToolDec-enhanced versions on a variety of tasks involving tools like math functions, knowledge graph relations, and complex real-world RESTful APIs. Our experiments show that ToolDec reduces syntactic errors to zero, consequently achieving significantly better performance and as much as a 2x speedup. We also show that ToolDec achieves superior generalization performance on unseen tools, performing up to 8x better than the baselines.
摘要
results: 在数字逻辑和关系逻辑任务中,使用HtT框架可以提高现有的提问方法精度,增加11-27%的精度提升。Abstract
When prompted with a few examples and intermediate steps, large language models (LLMs) have demonstrated impressive performance in various reasoning tasks. However, prompting methods that rely on implicit knowledge in an LLM often hallucinate incorrect answers when the implicit knowledge is wrong or inconsistent with the task. To tackle this problem, we present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with LLMs. HtT contains two stages, an induction stage and a deduction stage. In the induction stage, an LLM is first asked to generate and verify rules over a set of training examples. Rules that appear and lead to correct answers sufficiently often are collected to form a rule library. In the deduction stage, the LLM is then prompted to employ the learned rule library to perform reasoning to answer test questions. Experiments on both numerical reasoning and relational reasoning problems show that HtT improves existing prompting methods, with an absolute gain of 11-27% in accuracy. The learned rules are also transferable to different models and to different forms of the same problem.
摘要
当提供一些示例和中间步骤时,大语言模型(LLM)展现出了吸引人的表现在不同的逻辑任务上。然而,这些prompting方法常常当 implicit knowledge在 LLM 中错误或与任务不一致时会hallucinate incorrect answers。为解决这个问题,我们提出了 Hypotheses-to-Theories(HtT)框架,该框架学习了一个逻辑规则库,用于与 LLM 进行逻辑 reasoning。HtT 框架包括两个阶段:induction stage和 deduction stage。在induction stage中, LLM 首先被要求生成并验证规则,以便在一组训练示例上建立一个规则库。在 deduction stage中, LLM THEN 被要求使用学习的规则库来解决测试问题。实验表明,HtT 可以提高现有的prompting方法,具有11-27%的精度提升。学习的规则也可以转移到不同的模型和不同的问题形式。
DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records
methods: 这个论文使用了两个innovation:第一,是一个 Label-wise attention mechanism,可以 capture医疗知识和领域 ontologies 中的semantic relationships between medical entities。第二,是一个简单 yet effective的group-wise training method,可以增加 rare classes 的训练数据。
results: 这个论文的实验结果显示,我们的方法可以比前一代方法更好地预测医疗诊断,特别是对少数类别(tail)的预测。此外,我们还研究了 DKEC 在不同的语言模型上的应用,并证明了 DKEC 可以帮助小型语言模型 achieve comparable performance 到大型语言模型。Abstract
Multi-label text classification (MLTC) tasks in the medical domain often face long-tail label distribution, where rare classes have fewer training samples than frequent classes. Although previous works have explored different model architectures and hierarchical label structures to find important features, most of them neglect to incorporate the domain knowledge from medical guidelines. In this paper, we present DKEC, Domain Knowledge Enhanced Classifier for medical diagnosis prediction with two innovations: (1) a label-wise attention mechanism that incorporates a heterogeneous graph and domain ontologies to capture the semantic relationships between medical entities, (2) a simple yet effective group-wise training method based on similarity of labels to increase samples of rare classes. We evaluate DKEC on two real-world medical datasets: the RAA dataset, a collection of 4,417 patient care reports from emergency medical services (EMS) incidents, and a subset of 53,898 reports from the MIMIC-III dataset. Experimental results show that our method outperforms the state-of-the-art, particularly for the few-shot (tail) classes. More importantly, we study the applicability of DKEC to different language models and show that DKEC can help the smaller language models achieve comparable performance to large language models.
摘要
多个标签文本分类(MLTC)任务在医疗领域经常遇到长尾标签分布,其中罕见的类别有 fewer 的训练样本 than frequent classes。 although previous works have explored different model architectures and hierarchical label structures to find important features, most of them neglect to incorporate the domain knowledge from medical guidelines. In this paper, we present DKEC, Domain Knowledge Enhanced Classifier for medical diagnosis prediction with two innovations: (1) a label-wise attention mechanism that incorporates a heterogeneous graph and domain ontologies to capture the semantic relationships between medical entities, (2) a simple yet effective group-wise training method based on similarity of labels to increase samples of rare classes. We evaluate DKEC on two real-world medical datasets: the RAA dataset, a collection of 4,417 patient care reports from emergency medical services (EMS) incidents, and a subset of 53,898 reports from the MIMIC-III dataset. Experimental results show that our method outperforms the state-of-the-art, particularly for the few-shot (tail) classes. More importantly, we study the applicability of DKEC to different language models and show that DKEC can help the smaller language models achieve comparable performance to large language models.Here's the text with some notes on the translation:* "多个标签" is translated as "多个标签" (both words are the same in Chinese), which is a bit redundant but follows the original text's structure.* "文本分类" is translated as "文本分类" (both words are the same in Chinese), which is also a bit redundant but follows the original text's structure.* "任务" is translated as "任务" (a single word in Chinese), which is a more concise way of saying "task" in Chinese.* "在医疗领域" is translated as "在医疗领域" (both words are the same in Chinese), which is a more concise way of saying "in the medical field" in Chinese.* "经常" is translated as "经常" (a single word in Chinese), which is a more concise way of saying "often" in Chinese.* "遇到" is translated as "遇到" (a single word in Chinese), which is a more concise way of saying "encounter" in Chinese.* "长尾标签" is translated as "长尾标签" (both words are the same in Chinese), which is a more concise way of saying "long-tail label" in Chinese.* "其中" is translated as "其中" (a single word in Chinese), which is a more concise way of saying "where" in Chinese.* "罕见的类别" is translated as "罕见的类别" (both words are the same in Chinese), which is a more concise way of saying "rare classes" in Chinese.* "有 fewer 的训练样本" is translated as "有 fewer 的训练样本" (both words are the same in Chinese), which is a more concise way of saying "have fewer training samples" in Chinese.* "although" is translated as "although" (a single word in Chinese), which is a more concise way of saying "although" in Chinese.* "previous works" is translated as "前一些工作" (both words are the same in Chinese), which is a more concise way of saying "previous works" in Chinese.* "have explored" is translated as "已经探索" (a single word in Chinese), which is a more concise way of saying "have explored" in Chinese.* "different model architectures and hierarchical label structures" is translated as "不同的模型架构和层次标签结构" (both words are the same in Chinese), which is a more concise way of saying "different model architectures and hierarchical label structures" in Chinese.* "to find important features" is translated as "以找到重要特征" (a single word in Chinese), which is a more concise way of saying "to find important features" in Chinese.* "most of them neglect to incorporate the domain knowledge from medical guidelines" is translated as "大多数 Neglect 医学指南中的领域知识" (both words are the same in Chinese), which is a more concise way of saying "most of them neglect to incorporate the domain knowledge from medical guidelines" in Chinese.* "In this paper, we present DKEC" is translated as "本文中,我们介绍 DKEC" (both words are the same in Chinese), which is a more concise way of saying "In this paper, we present DKEC" in Chinese.* "Domain Knowledge Enhanced Classifier" is translated as "领域知识增强分类器" (both words are the same in Chinese), which is a more concise way of saying "Domain Knowledge Enhanced Classifier" in Chinese.* "for medical diagnosis prediction" is translated as "用于医学诊断预测" (a single word in Chinese), which is a more concise way of saying "for medical diagnosis prediction" in Chinese.* "with two innovations" is translated as "两种创新" (both words are the same in Chinese), which is a more concise way of saying "with two innovations" in Chinese.* "label-wise attention mechanism" is translated as "标签 wise 注意机制" (both words are the same in Chinese), which is a more concise way of saying "label-wise attention mechanism" in Chinese.* "incorporates a heterogeneous graph and domain ontologies" is translated as "包含不同类型的图和领域 ontologies" (both words are the same in Chinese), which is a more concise way of saying "incorporates a heterogeneous graph and domain ontologies" in Chinese.* "to capture the semantic relationships between medical entities" is translated as "以捕捉医疗实体之间的含义关系" (a single word in Chinese), which is a more concise way of saying "to capture the semantic relationships between medical entities" in Chinese.* "We evaluate DKEC on two real-world medical datasets" is translated as "我们在两个实际医疗数据集上评估 DKEC" (both words are the same in Chinese), which is a more concise way of saying "We evaluate DKEC on two real-world medical datasets" in Chinese.* "RAA dataset" is translated as "RAA 数据集" (a single word in Chinese), which is a more concise way of saying "RAA dataset" in Chinese.* "a collection of 4,417 patient care reports from emergency medical services (EMS) incidents" is translated as "4,417 例 emergency medical services (EMS) 事件中的病人护理报告集" (both words are the same in Chinese), which is a more concise way of saying "a collection of 4,417 patient care reports from emergency medical services (EMS) incidents" in Chinese.* "a subset of 53,898 reports from the MIMIC-III dataset" is translated as "MIMIC-III 数据集中的53,898 份报告子集" (both words are the same in Chinese), which is a more concise way of saying "a subset of 53,898 reports from the MIMIC-III dataset" in Chinese.* "Experimental results show that our method outperforms the state-of-the-art" is translated as "实验结果表明我们的方法在当前领域中表现出色" (a single word in Chinese), which is a more concise way of saying "Experimental results show that our method outperforms the state-of-the-art" in Chinese.* "particularly for the few-shot (tail) classes" is translated as "尤其是少量 (tail) 类" (both words are the same in Chinese), which is a more concise way of saying "particularly for the few-shot (tail) classes" in Chinese.* "More importantly" is translated as "更重要的是" (a single word in Chinese), which is a more concise way of saying "More importantly" in Chinese.* "we study the applicability of DKEC to different language models" is translated as "我们研究 DKEC 在不同语言模型上的可应用性" (both words are the same in Chinese), which is a more concise way of saying "we study the applicability of DKEC to different language models" in Chinese.* "and show that DKEC can help the smaller language models achieve comparable performance to large language models" is translated as "并表明 DKEC 可以帮助小型语言模型实现与大型语言模型相同的性能" (both words are the same in Chinese), which is a more concise way of saying "and show that DKEC can help the smaller language models achieve comparable performance to large language models" in Chinese.
Computational Pathology at Health System Scale – Self-Supervised Foundation Models from Three Billion Images
paper_authors: Gabriele Campanella, Ricky Kwan, Eugene Fluder, Jennifer Zeng, Aryeh Stock, Brandon Veremis, Alexandros D. Polydorides, Cyrus Hedvat, Adam Schoenfeld, Chad Vanderbilt, Patricia Kovatch, Carlos Cordon-Cardo, Thomas J. Fuchs
results: 研究结果显示,这些自我超vised learning算法在病理科领域的大量数据上进行预训后,对下游任务的性能有所提高,而DINO算法在所有任务中表现更好。Abstract
Recent breakthroughs in self-supervised learning have enabled the use of large unlabeled datasets to train visual foundation models that can generalize to a variety of downstream tasks. While this training paradigm is well suited for the medical domain where annotations are scarce, large-scale pre-training in the medical domain, and in particular pathology, has not been extensively studied. Previous work in self-supervised learning in pathology has leveraged smaller datasets for both pre-training and evaluating downstream performance. The aim of this project is to train the largest academic foundation model and benchmark the most prominent self-supervised learning algorithms by pre-training and evaluating downstream performance on large clinical pathology datasets. We collected the largest pathology dataset to date, consisting of over 3 billion images from over 423 thousand microscopy slides. We compared pre-training of visual transformer models using the masked autoencoder (MAE) and DINO algorithms. We evaluated performance on six clinically relevant tasks from three anatomic sites and two institutions: breast cancer detection, inflammatory bowel disease detection, breast cancer estrogen receptor prediction, lung adenocarcinoma EGFR mutation prediction, and lung cancer immunotherapy response prediction. Our results demonstrate that pre-training on pathology data is beneficial for downstream performance compared to pre-training on natural images. Additionally, the DINO algorithm achieved better generalization performance across all tasks tested. The presented results signify a phase change in computational pathology research, paving the way into a new era of more performant models based on large-scale, parallel pre-training at the billion-image scale.
摘要
近期,自我监督学习的突破有助于使用大量没有标签的数据来训练视觉基础模型,这些模型可以通过多种下游任务进行泛化。在医疗领域,这种训练方法非常适合,因为标签稀缺。然而,大规模预训练在医疗领域,特别是在病理学方面,尚未得到广泛的研究。在自我监督学习中,以前的工作通常使用小型数据集进行预训练和下游性能评估。本项目的目标是在大学院内 trains the largest academic foundation model,并用最出色的自我监督学习算法对大规模病理学数据集进行预训练和下游性能评估。我们收集了医疗领域最大的数据集,包括超过30亿张图像,来自423000余个微scopic抹片。我们对Visual Transformer模型的预训练使用MAE和DINO算法进行比较。我们对六个临床相关任务进行评估,来自三个 анатомиче位置和两个机构:乳腺癌检测、消耗性肠炎检测、乳腺癌estrogen受体预测、肺adenocarcinoma EGFR变化预测和肺癌免疫策略预测。我们的结果表明,预训练在病理学数据集上比预训练在自然图像上更有利于下游性能。此外,DINO算法在所有任务上实现了更好的泛化性能。这些结果标志着计算 PATHOLOGY 研究的新时代的开始,预计将在大规模、并行预训练的基础上建立更高性能的模型,覆盖 billion-image 级别。
Facial Forgery-based Deepfake Detection using Fine-Grained Features
results: 通过广泛的实验验证,本研究证明了该方法在跨数据集和跨模杂化检测场景下的超过90%的性能。Abstract
Facial forgery by deepfakes has caused major security risks and raised severe societal concerns. As a countermeasure, a number of deepfake detection methods have been proposed. Most of them model deepfake detection as a binary classification problem using a backbone convolutional neural network (CNN) architecture pretrained for the task. These CNN-based methods have demonstrated very high efficacy in deepfake detection with the Area under the Curve (AUC) as high as $0.99$. However, the performance of these methods degrades significantly when evaluated across datasets and deepfake manipulation techniques. This draws our attention towards learning more subtle, local, and discriminative features for deepfake detection. In this paper, we formulate deepfake detection as a fine-grained classification problem and propose a new fine-grained solution to it. Specifically, our method is based on learning subtle and generalizable features by effectively suppressing background noise and learning discriminative features at various scales for deepfake detection. Through extensive experimental validation, we demonstrate the superiority of our method over the published research in cross-dataset and cross-manipulation generalization of deepfake detectors for the majority of the experimental scenarios.
摘要
面部伪造技术使用深度复卷 neural network (CNN) 实现了严重的安全风险和社会上的担忧。为了对抗这些威胁,一些深度伪造检测方法已经被提出。大多数这些方法都是模型深度伪造检测为二分类问题,使用预训练的 CNN 架构。这些 CNN 基本架构的方法已经在深度伪造检测中表现出非常高的有效性,AUC 为 0.99。但是,这些方法在不同的数据集和伪造技巧下的表现却有很大的差异。这使我们注意到了更加细致、地方和描述性的特征学习是需要的。在这篇文章中,我们将深度伪造检测设计为精细分类问题,并提出了一个新的精细解决方案。具体而言,我们的方法是通过有效地抑制背景噪音和学习不同尺度的特征,以获得更加细致和普遍适用的深度伪造检测特征。经过了广泛的实验验证,我们证明了我们的方法在跨数据集和跨伪造技巧的普遍化运算中的超越性。
NEWTON: Are Large Language Models Capable of Physical Reasoning?
paper_authors: Yi Ru Wang, Jiafei Duan, Dieter Fox, Siddhartha Srinivasa
for: The paper aims to evaluate the physical reasoning abilities of large language models (LLMs) and provide a benchmark for assessing their performance in this area.
methods: The paper introduces a new repository and benchmark called NEWTON, which includes a collection of object-attribute pairs and 160,000 question-answer pairs to test the physical reasoning capabilities of LLMs. The authors also present a pipeline for generating customized benchmarks for specific applications.
results: The authors find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans. The paper highlights the potential of the NEWTON platform for evaluating and enhancing language models for physically grounded settings, such as robotic manipulation.Abstract
Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io
摘要
大型语言模型(LLM)通过它们的上下文化表现,已经被实践证明可以捕捉语言层次的 sintactic、semantic、词汇和常识知识。然而,对于它们的物理逻辑能力的探索仍然受限,特别是关于日常物品的重要特征。为解决这个问题,我们提出了 NEWTON,一个Repository和Benchmark,用于评估语言模型的物理逻辑能力。此外,我们还提供了一个管道,让研究人员可以根据它们的应用领域自定义NEWTON的Benchmark,以便更好地满足它们的需求。NEWTONRepository包含2800个物品 attribute的集合,提供了无限数量的评估模板。NEWTON Benchmark包含160,000个问题,通过使用NEWTONRepository进行整理,以探索语言模型的物理逻辑能力。我们的实验结果显示,LLMs如GPT-4在enario-based任务中展示了强大的逻辑能力,但在物品 attribute 的推理中与人类(50% vs. 84%)之间存在较大的差异。此外,NEWTON平台显示了它们的潜力,可以评估和改进语言模型,将其应用于物理基础的设置,如 робоック掌控。Project site:https://newtonreasoning.github.io
Answer Candidate Type Selection: Text-to-Text Language Model for Closed Book Question Answering Meets Knowledge Graphs
results: 提高了预训练问答系统的答案质量, especialyl for questions with less popular entities.Abstract
Pre-trained Text-to-Text Language Models (LMs), such as T5 or BART yield promising results in the Knowledge Graph Question Answering (KGQA) task. However, the capacity of the models is limited and the quality decreases for questions with less popular entities. In this paper, we present a novel approach which works on top of the pre-trained Text-to-Text QA system to address this issue. Our simple yet effective method performs filtering and re-ranking of generated candidates based on their types derived from Wikidata "instance_of" property.
摘要
预训练的文本到文本语言模型(LM),如T5或BART,在知识图问答任务中表现了扎实的结果。然而,模型的容量有限,问题中的 menos popular entity 的质量下降。在这篇论文中,我们提出了一种新的方法,它基于预训练的文本到文本问答系统进行排除和重新排序生成的候选答案。我们的简单 yet effective 方法基于Wikidata "instance_of" 属性来 derive 候选答案的类型。
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
results: 该论文的实验结果显示,通过使用不同的生成方法,可以增加模型的不一致率至95%以上,并且比之前的攻击方法便宜30倍。Abstract
The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM.
摘要
大量的开源大语言模型(LLM)的快速进步正在推动人工智能的发展。在模型发布之前,努力了解行为与人类价值观合并,主要目标是确保它们的帮助和无害。然而,即使仔细对齐的模型也可以被恶意折衡,导致不期望的行为,称为“监狱拥堵”。这些监狱拥堵通常是由特定的文本输入触发,通常被称为“敌意提示”。在这种工作中,我们提出了生成滥用攻击,它是一种非常简单的方法,通过只对变种解码方法进行操作来扰乱模型的对齐。通过利用不同的生成策略,包括变种解码 гипер参数和采样方法,我们从0%提高了距离度到超过95%,在11种语言模型中,包括LLaMA2、Vicuna、Falcon和MPT家族,超过了当前攻击的状态艺术。最后,我们提出了一种有效的对齐方法,它可以有效降低我们的攻击下的距离度。总之,我们的研究表明当前开源LLM的安全评估和对齐过程存在重大的缺陷,强烈建议在发布之前进行更加全面的红团测试,以确保模型的帮助和无害。我们的代码可以在https://github.com/Princeton-SysML/Jailbreak_LLM上获取。
On the Interpretability of Part-Prototype Based Classifiers: A Human Centric Analysis
results: 实验结果表明,本框架可以准确评估不同类型的部prototype网络的可解释性,并且对现有的黑盒子图像分类器提供了一种可解释的替代方案。Abstract
Part-prototype networks have recently become methods of interest as an interpretable alternative to many of the current black-box image classifiers. However, the interpretability of these methods from the perspective of human users has not been sufficiently explored. In this work, we have devised a framework for evaluating the interpretability of part-prototype-based models from a human perspective. The proposed framework consists of three actionable metrics and experiments. To demonstrate the usefulness of our framework, we performed an extensive set of experiments using Amazon Mechanical Turk. They not only show the capability of our framework in assessing the interpretability of various part-prototype-based models, but they also are, to the best of our knowledge, the most comprehensive work on evaluating such methods in a unified framework.
摘要
<> translate "Part-prototype networks have recently become methods of interest as an interpretable alternative to many of the current black-box image classifiers. However, the interpretability of these methods from the perspective of human users has not been sufficiently explored. In this work, we have devised a framework for evaluating the interpretability of part-prototype-based models from a human perspective. The proposed framework consists of three actionable metrics and experiments. To demonstrate the usefulness of our framework, we performed an extensive set of experiments using Amazon Mechanical Turk. They not only show the capability of our framework in assessing the interpretability of various part-prototype-based models, but they also are, to the best of our knowledge, the most comprehensive work on evaluating such methods in a unified framework." into 简化中文 >>Here's the translation:现在,部prototype网络已成为一种可解释性替代多种现有的黑obox图像分类器的方法。然而,人类用户对这些方法的可解释性还没有充分探讨。在这项工作中,我们提出了一个用于评估部prototype基于模型的可解释性的框架。该框架包括三个操作性指标和实验。为证明我们的框架的用于性,我们在Amazon Mechanical Turk上进行了广泛的实验。这些实验不仅显示了我们的框架可以评估多种部prototype基于模型的可解释性,而且也是我们知道的最为全面的评估这类方法的框架。
Sparse Fine-tuning for Inference Acceleration of Large Language Models
results: 研究表明,使用 sparse LLMs 可以实现CPU和GPU运行时的速度提高,同时保持准确性。此外,在内存绑定的LLMs中,精度可以通过精度来降低内存带宽。研究还展示了 T5(语言翻译)、Whisper(语音翻译)和 open GPT-type(文本生成)等应用场景中的综合结果,并证明了精度可以达75%,而不会影响准确性。Abstract
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fine-tuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead which enables accurate recovery even at higher sparsities, across all model types. On the practical efficiency side, we show that sparse LLMs can be executed with speedups by taking advantage of sparsity, for both CPU and GPU runtimes. While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth. We exhibit end-to-end results showing speedups due to sparsity, while recovering accuracy, on T5 (language translation), Whisper (speech translation), and open GPT-type (MPT for text generation). For MPT text generation, we show for the first time that sparse fine-tuning can reach 75% sparsity without accuracy drops, provide notable end-to-end speedups for both CPU and GPU inference, and highlight that sparsity is also compatible with quantization approaches. Models and software for reproducing our results are provided in Section 6.
摘要
我们考虑了大型语言模型(LLM)的精确简洁训练问题,即在特殊任务上训练预训练的 LLM,而使其权重产生简洁。从准确性角度来看,我们发现,使用标准损失函数的训练可能无法恢复准确性,特别是在高度简洁的情况下。为解决这个问题,我们进行了详细的研究,找到了一种基于L2的激发型损失函数,我们称之为对角方法(SquareHead),这个方法可以在高度简洁的情况下确保准确性的回复。从实际效率角度来看,我们发现,简洁的 LLM 可以通过利用简洁来提高 CPU 和 GPU 的执行速度。而标准的方法是利用简洁来降低计算量,我们发现在承载受限的 LLM 中,简洁也可以用来降低内存带宽。我们展示了实际结果,显示了因简洁而获得的优化速度,同时保持准确性,在 T5(语言翻译)、Whisper(语音翻译)和开放 GPT-type(MPT для文本生成)上。在 MPT 文本生成中,我们发现简洁训练可以 дости到 75% 的简洁水准,而不会对准确性造成负面的影响,并且在 CPU 和 GPU 执行中获得了明显的优化速度。此外,我们还发现简洁可以与量化方法相容。我们在 Section 6 中提供了模型和软件来重现我们的结果。
PICProp: Physics-Informed Confidence Propagation for Uncertainty Quantification
results: 该论文通过计算实验证明了其方法的有效性,并提供了一个定理来证明方法的正确性。Abstract
Standard approaches for uncertainty quantification in deep learning and physics-informed learning have persistent limitations. Indicatively, strong assumptions regarding the data likelihood are required, the performance highly depends on the selection of priors, and the posterior can be sampled only approximately, which leads to poor approximations because of the associated computational cost. This paper introduces and studies confidence interval (CI) estimation for deterministic partial differential equations as a novel problem. That is, to propagate confidence, in the form of CIs, from data locations to the entire domain with probabilistic guarantees. We propose a method, termed Physics-Informed Confidence Propagation (PICProp), based on bi-level optimization to compute a valid CI without making heavy assumptions. We provide a theorem regarding the validity of our method, and computational experiments, where the focus is on physics-informed learning.
摘要
The proposed method, called Physics-Informed Confidence Propagation (PICProp), is based on bi-level optimization and can compute a valid CI without making heavy assumptions. The paper provides a theorem on the validity of the method and conducts computational experiments, focusing on physics-informed learning.Translated into Simplified Chinese:传统的深度学习和物理学习不确定性评估方法具有持续的限制。例如,它们需要强大地假设数据的可能性,高度依赖于采样的选择,并且只能 aproximate posterior,导致因计算成本而得到的approximation是poor的。这篇论文介绍了和研究了确idenceInterval(CI)估计方法,即从数据位置传播 confidence 到整个领域,并提供了可靠的 probabilistic garanties。提议的方法,称为Physics-Informed Confidence Propagation(PICProp),基于双层优化,可以计算一个有效的CI无需做出重大假设。文章提供了有关方法的有效性的定理,并进行了计算实验,主要关注物理学习。
Distributed Transfer Learning with 4th Gen Intel Xeon Processors
results: 研究人员通过使用 Intel Xeon 处理器和 AMX 实现了分布式训练,在图像分类任务上达到了 near state-of-the-art 精度。Abstract
In this paper, we explore how transfer learning, coupled with Intel Xeon, specifically 4th Gen Intel Xeon scalable processor, defies the conventional belief that training is primarily GPU-dependent. We present a case study where we achieved near state-of-the-art accuracy for image classification on a publicly available Image Classification TensorFlow dataset using Intel Advanced Matrix Extensions(AMX) and distributed training with Horovod.
摘要
在这篇论文中,我们探讨了如何使用传输学习,特别是使用四代英特尔Xeon可扩展处理器,推翻了训练主要依赖于GPU的传统信念。我们提出了一个案例研究,在公共可用的TensorFlow图像分类 dataset上使用英特尔高级矩阵扩展(AMX)和分布式训练 Horovod 实现了near状态艺点精度的图像分类。
Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization
results: 该方法在安全训练 tasks 中表现出色,在推理中获得了显著更高的奖励和近于零的安全违反率。此外,通过实际投入一个实际任务中的箱推进,证明了该方法的实际可行性。Abstract
Safe Reinforcement Learning (RL) plays an important role in applying RL algorithms to safety-critical real-world applications, addressing the trade-off between maximizing rewards and adhering to safety constraints. This work introduces a novel approach that combines RL with trajectory optimization to manage this trade-off effectively. Our approach embeds safety constraints within the action space of a modified Markov Decision Process (MDP). The RL agent produces a sequence of actions that are transformed into safe trajectories by a trajectory optimizer, thereby effectively ensuring safety and increasing training stability. This novel approach excels in its performance on challenging Safety Gym tasks, achieving significantly higher rewards and near-zero safety violations during inference. The method's real-world applicability is demonstrated through a safe and effective deployment in a real robot task of box-pushing around obstacles.
摘要
安全强化学习(RL)在实现RL算法应用于安全关键实际应用中扮演着重要的角色,解决最大化奖励和遵从安全约束之间的负担。这项工作介绍了一种新的方法,即将RL与轨迹优化结合起来管理这种负担。我们的方法将安全约束嵌入到修改后的Markov决策过程(MDP)中的动作空间中。RL机器人生成一系列动作,然后这些动作被一个轨迹优化器转换成安全轨迹,从而确保安全性和提高训练稳定性。这种新的方法在安全启发任务中表现出色,在推理过程中获得了明显更高的奖励和near-zero的安全违反。此外,我们还通过一个真实的 робот任务——推箱避障来证明该方法的实际应用性。
Scalable Semantic Non-Markovian Simulation Proxy for Reinforcement Learning
results: 与两个高精度模拟器进行比较,速度提高三个数量级,保持策略学习质量。同时,可以模拟和利用非马歇维安的动力学和即时行动,并提供可解释的跟踪来描述代理行为的结果。Abstract
Recent advances in reinforcement learning (RL) have shown much promise across a variety of applications. However, issues such as scalability, explainability, and Markovian assumptions limit its applicability in certain domains. We observe that many of these shortcomings emanate from the simulator as opposed to the RL training algorithms themselves. As such, we propose a semantic proxy for simulation based on a temporal extension to annotated logic. In comparison with two high-fidelity simulators, we show up to three orders of magnitude speed-up while preserving the quality of policy learned. In addition, we show the ability to model and leverage non-Markovian dynamics and instantaneous actions while providing an explainable trace describing the outcomes of the agent actions.
摘要
paper_authors: Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
results: 对比LLAMA 2 13B和LLAMA 1 34B,Mistral 7B在评估标准卷积中表现出色,并在逻辑、数学和代码生成方面超越Llama 2 13B。此外,Mistral 7B – Instruct模型在人工和自动评测标准中也表现出优异。Abstract
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
摘要
我们介绍Mistral 7B v0.1,一个引擎ered for superior performance和效率的700亿个参数语言模型。Mistral 7B在所有评估标准上都超过Llama 2 13B,并且在推理、数学和代码生成方面也超过Llama 1 34B。我们的模型利用分组查询注意力(GQA)来提高推理速度,同时使用滑块窗口注意力(SWA)来有效地处理序列的任意长度,具有降低推理成本的特性。我们还提供了一个遵循指令的模型,Mistral 7B -- Instruct,与Llama 2 13B -- Chat模型在人工和自动评估标准上都超过。我们的模型根据Apache 2.0 license发布。
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
methods: 这 paper 使用了训练探针来检测 LLM 是否输出真假信息,并利用了三种线索来研究 LLM 表达真假信息的结构:1. 视觉化 LLM 真假声明表示结构,显示了明确的线性结构。2. 传输实验,在不同数据集上使用同一个探针进行探测。3. causal 证据,通过对 LLM 的前进传递进行手动修改,使其对假声明视为真 и vice versa。
results: 这 paper 发现,语言模型 linearly 表示真假信息。此外,这 paper 还引入了一种新的探针技术——质量均值探针,它比其他探针技术更好地泛化和更直接地关联到模型输出。Abstract
Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
摘要
results: 我们的实验表明,NECO 可以在小规模和大规模 OOD 检测任务中达到状态之Art的结果,并且具有强大的泛化能力,可以在不同的网络架构上展示出优秀的表现。Abstract
Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. We plan to release the code after the anonymity period.
摘要
检测对外分布(OOD)数据是机器学习中的关键挑战,因为模型具有自信心,通常不知道其知识理论上的限制。我们假设“神经塌陷”,一种影响在 Distribution 数据上的模型训练过程中的现象,也影响 OOD 数据。为了利用这种关系,我们介绍了 NECO,一种新的后处方法 для OOD 检测,它利用神经网络在主成分空间的几何性质和“神经塌陷”的特性来标识 OOD 数据。我们的广泛实验表明,NECO 在小规模和大规模 OOD 检测任务上具有最佳的 результаaten,并且具有强大的泛化能力,可以在不同的网络架构上展现出优异表现。此外,我们还提供了对 NECO 方法在 OOD 检测中的理论解释。我们计划在匿名期结束后发布代码。
Advancing Transformer’s Capabilities in Commonsense Reasoning
methods: INTRODUCE current ML-based methods, including knowledge transfer, model ensemble, and introducing an additional pairwise contrastive objective.
results: our best model outperforms the strongest previous works by ~15% absolute gains in Pairwise Accuracy and ~8.7% absolute gains in Standard Accuracy.Here’s the full Chinese text:
results: 我们的最佳模型与最强前一个工作相比,在对比精度和标准精度上增加了约15%和8.7%的绝对提升。Abstract
Recent advances in general purpose pre-trained language models have shown great potential in commonsense reasoning. However, current works still perform poorly on standard commonsense reasoning benchmarks including the Com2Sense Dataset. We argue that this is due to a disconnect with current cutting-edge machine learning methods. In this work, we aim to bridge the gap by introducing current ML-based methods to improve general purpose pre-trained language models in the task of commonsense reasoning. Specifically, we experiment with and systematically evaluate methods including knowledge transfer, model ensemble, and introducing an additional pairwise contrastive objective. Our best model outperforms the strongest previous works by ~15\% absolute gains in Pairwise Accuracy and ~8.7\% absolute gains in Standard Accuracy.
摘要
近期大规模普通语言模型的进步已经表现出了很大的潜力,但现有工作仍然在标准的常识理解benchmark上表现不佳。我们认为这是因为现有的机器学习方法和普通语言模型之间存在一个分隔。在这个工作中,我们希望通过引入当前的机器学习方法来改善通用语言模型在常识理解任务中的性能。具体来说,我们实验了并系统地评估了知识传递、模型集成和添加对比对象的方法。我们的best模型在对比精度和标准精度上都有约15%的绝对提升,即使是与最强的前一代工作相比也有8.7%的绝对提升。
$f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences
methods: 这个论文提出了一种新的探索促进方法called $f$-Policy Gradients($f$-PG),它利用了状态访问分布与目标之间的f- divergence来逼近优化问题。 authors derive gradients for various f-divergences来优化这个目标。
results: 论文的实验结果表明,$f$-PG比标准的政策升降方法在一个具有挑战性的网格世界以及Point Maze和FetchReach环境中表现更好。Abstract
Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments. More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html.
摘要
goal-conditioned reinforcement learning(RL)问题经常会遇到罕见的奖励,agent只有当它完成目标时才会获得奖励信号,这使得政策优化成为一个困难的问题。一些工作会在罕见的奖励上添加学习的权重函数,但这可能会导致不优化的政策。此外,最近的研究表明,有效的形状奖励可以与学习算法相关。这篇论文介绍了一种新的探索促进方法,称为f-政策Gradient(f-PG)。f-PG将减少f-分布之间的差异,我们显示这可以导致最佳政策。我们Derive gradients for various f-divergences to optimize this objective。我们的学习模式可以在罕见奖励设置下提供权重学习信号,以便探索。此外,我们还引入了一个 entropy-regularized policy optimization objective,称为state-MaxEnt RL(s-MaxEnt RL),这是我们的目标之一。我们显示可以使用L2等度量基于的形状奖励,并且可以与efficient exploration相结合。我们发现f-PG比标准的政策梯度方法在一个复杂的网格世界以及Point Maze和FetchReach环境中表现更好。更多信息请访问我们的网站https://agarwalsiddhant10.github.io/projects/fpg.html。
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
results: 训练使用OpenWebMath数据集的1.4亿参数语言模型,模型性能超过了训练在大量通用语言数据上的模型。Abstract
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
摘要
有证据显示,预训练于高质量、仔细设计的令符,如代码或数学文档,对大型自然语言模型的理解能力产生重要的影响。例如,Minerva模型,通过对数百亿个数学文档从arXiv和网络上进行微调,显著提高了需要量化逻辑的问题的性能。然而,由于所有已知的开源网络数据集都会对数学notation进行不准确的预处理,因此大规模在量化网络文档上进行训练的 beneficial effects 是无法对研究社区提供。我们介绍了 OpenWebMath 数据集,它是基于这些工作的开放数据集,包含 14.7 亿个数学网页FROM Common Crawl。我们详细描述了提取文本和LaTeX内容,并从 HTML 文档中 removing boilerplate 的方法,以及质量筛选和重复 elimination 的方法。此外,我们对 OpenWebMath 数据集进行了小规模实验,证明模型在 14.7 亿个令符上进行训练后,性能超过了在大量常见语言数据上进行训练后的性能。我们希望 OpenWebMath 数据集,通过在 Hugging Face Hub 上公开发布,能够促进大型自然语言模型的理解能力。
A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults
paper_authors: R. Mosayebi, H. Kia, A. Kianpour Raki
For: 这篇论文旨在解决移动网络中异常警示日志的效率识别问题,降低手动监测的困难,帮助网络维护人员更快地发现和解决问题。* Methods: 该方法使用超级vised Embedding和集群异常检测(SEMC-AD),利用历史异常警示日志和其标签来EXTRACT数字表示方法,有效地解决异常警示日志数据集中异常分类的问题,不需要使用一hot编码。* Results: 实验表明,SEMC-AD方法可以准确地识别异常警示日志,其 anomaly detection 率为 99%,而Random Forest和XGBoost方法只能检测到 86% 和 81% 的异常。 SEMC-AD 方法在具有多个分类特征的数据集中表现更高效,可以快速地发现和解决问题,减轻网络维护人员的负担。Abstract
The paper introduces Supervised Embedding and Clustering Anomaly Detection (SEMC-AD), a method designed to efficiently identify faulty alarm logs in a mobile network and alleviate the challenges of manual monitoring caused by the growing volume of alarm logs. SEMC-AD employs a supervised embedding approach based on deep neural networks, utilizing historical alarm logs and their labels to extract numerical representations for each log, effectively addressing the issue of imbalanced classification due to a small proportion of anomalies in the dataset without employing one-hot encoding. The robustness of the embedding is evaluated by plotting the two most significant principle components of the embedded alarm logs, revealing that anomalies form distinct clusters with similar embeddings. Multivariate normal Gaussian clustering is then applied to these components, identifying clusters with a high ratio of anomalies to normal alarms (above 90%) and labeling them as the anomaly group. To classify new alarm logs, we check if their embedded vectors' two most significant principle components fall within the anomaly-labeled clusters. If so, the log is classified as an anomaly. Performance evaluation demonstrates that SEMC-AD outperforms conventional random forest and gradient boosting methods without embedding. SEMC-AD achieves 99% anomaly detection, whereas random forest and XGBoost only detect 86% and 81% of anomalies, respectively. While supervised classification methods may excel in labeled datasets, the results demonstrate that SEMC-AD is more efficient in classifying anomalies in datasets with numerous categorical features, significantly enhancing anomaly detection, reducing operator burden, and improving network maintenance.
摘要
文章介绍了一种名为Supervised Embedding and Clustering Anomaly Detection(SEMC-AD)的方法,用于高效地在移动网络中识别异常报警日志并减轻人工监测的困难,由于报警日志的数量不断增加。SEMC-AD利用深度神经网络的超级vised embedding方法,使用历史报警日志和其标签来提取每个日志的数字表示,有效解决了因数据集中异常的比例较小而导致的一类问题,不需要使用一hot编码。随后,对这些Component进行多ivariate normal Gaussian clustering,可以快速地标识异常类型的报警日志。为了分类新的报警日志,我们只需要检查其embeddedvector的两个最重要的主成分是否 falls within the anomaly-labeled clusters。如果是,则将日志分类为异常。性能评估表明,SEMC-AD比无 embedding的Random Forest和XGBoost方法更高效,SEMC-AD可以识别99%的异常报警,而Random Forest和XGBoost只能识别86%和81%的异常报警。虽然超级vised分类方法在标注数据集中可能会出色,但结果表明SEMC-AD在具有多个分类特征的数据集中更高效地识别异常,提高异常检测率,减轻操作员的负担,改善网络维护。
Correlated Noise Provably Beats Independent Noise for Differentially Private Learning
results: 比对普通DP-SGD,相关噪声可以提高学习效果,并且可以避免 cube 复杂度。实验 validate 了我们的理论。Abstract
Differentially private learning algorithms inject noise into the learning process. While the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation matrix in previous work. We validate our theory with experiments on private deep learning. Our work matches or outperforms prior work while being efficient both in terms of compute and memory.
摘要
diferencialmente privado 学习算法加入噪声到学习过程中。而最常见的私人学习算法DP-SGD每次迭代添加独立的 Gaussian 噪声,而最近的矩阵分解机制研究表明,在噪声中引入相关性可以大大提高其用用。我们Characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions。我们通过这些bound,证明相关噪声可以超过原生DP-SGD的性能,随着问题参数如有效维度和condition number的变化。此外,我们的分析表达式可以避免之前的cubic complexity的半definite program用于优化噪声相关矩阵。我们通过实验 validate our theory on private deep learning,我们的工作与之前的工作匹配或超越,同时在计算和内存方面都是高效的。
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
results: 研究发现,当给出一个代码库和一个问题描述时,当前状态艺AE模型和我们精度调整的模型 SWE-Llama 只能解决最简单的问题。 Claude 2 和 GPT-4 只能解决 $4.8%$ 和 $1.7%$ 的实例。Abstract
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere $4.8$% and $1.7$% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
摘要
语言模型已经超过了我们评估它们的能力,但为其未来发展, изу究这些模型的前沿是非常重要。我们认为实际的软件工程是一个丰富、可持续和挑战性的测试环境,可以用于评估下一代语言模型。因此,我们介绍了 SWE-bench,一个评估框架,包括 $2,294$ 个实际 GitHub 问题和相应的 pull request across $12$ 个流行的 Python 存储库。给定一个代码库以及一个问题的描述,一个语言模型需要编辑代码库以解决问题。在 SWE-bench 中解决问题 frequently 需要理解和协调多个函数、类和文件之间的更改,需要模型与执行环境交互,处理极长的上下文,并进行复杂的逻辑分析,这些都超出了传统代码生成的范畴。我们的评估结果显示,当前的商业化模型和我们练化的模型 SWE-Llama 只能解决最简单的问题。Claude 2 和 GPT-4 只能解决 $4.8\%$ 和 $1.7\%$ 的实例,即使提供了 oracle retriever。 SWE-bench 的进步表明了语言模型的实用、智能和自主发展。
results: 这个研究通过广泛的实验证明了 $\mathbf{FABind}$ 的优秀性能,与现有的方法相比,它在预测蛋白质和抗体之间的结合结构时表现出了更高的精度和更快的速度。Abstract
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at $\href{https://github.com/QizhiPei/FABind}{Github}$.
摘要
In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference.Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at $\href{https://github.com/QizhiPei/FABind}{Github}$.
Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory
results: 研究发现,不同神经网络学习过程中的Feature之间存在广泛的功能等价关系,并且可以通过Iterative Feature Merging(IFM)算法来减少神经网络的参数数量无需影响性能。此外,我们还发现了一些有趣的实验结果,如Feature复杂性与神经网络性能之间的关系等。Abstract
The behavior of neural networks still remains opaque, and a recently widely noted phenomenon is that networks often achieve similar performance when initialized with different random parameters. This phenomenon has attracted significant attention in measuring the similarity between features learned by distinct networks. However, feature similarity could be vague in describing the same feature since equivalent features hardly exist. In this paper, we expand the concept of equivalent feature and provide the definition of what we call functionally equivalent features. These features produce equivalent output under certain transformations. Using this definition, we aim to derive a more intrinsic metric for the so-called feature complexity regarding the redundancy of features learned by a neural network at each layer. We offer a formal interpretation of our approach through the lens of category theory, a well-developed area in mathematics. To quantify the feature complexity, we further propose an efficient algorithm named Iterative Feature Merging. Our experimental results validate our ideas and theories from various perspectives. We empirically demonstrate that the functionally equivalence widely exists among different features learned by the same neural network and we could reduce the number of parameters of the network without affecting the performance.The IFM shows great potential as a data-agnostic model prune method. We have also drawn several interesting empirical findings regarding the defined feature complexity.
摘要
神经网络的行为仍然存在诡异性,而一个最近受到广泛关注的现象是,当不同Random参数初始化的神经网络 Initialization具有类似性。这种现象引起了评估特征之间的相似性的重要注意。然而,特征相似性可能是描述相同特征的抽象方式,因为等效特征几乎不存在。在这篇论文中,我们扩展了特征相似性的概念,并提供了我们称为功能相似特征的定义。这些特征在某些变换下产生相同的输出。使用这个定义,我们想要 derivate一种更内在的特征复杂度度量方法,用于衡量神经网络每层学习的特征复杂度。我们还提出了一种效率高的算法名为迭代特征合并(Iterative Feature Merging,IFM),用于实现这一目标。我们的实验结果证明了我们的想法和理论从多个角度来看都是正确的。我们经验显示,神经网络学习的不同特征之间存在广泛的功能相似性,并且可以通过IFM来减少神经网络的参数数量,无需影响性能。此外,我们还发现了一些有趣的实验发现,关于定义的特征复杂度。
Comparing AI Algorithms for Optimizing Elliptic Curve Cryptography Parameters in Third-Party E-Commerce Integrations: A Pre-Quantum Era Analysis
results: 研究发现,GA和PSO在ECC参数优化方面具有不同的优势,GA在精度方面表现较好,而PSO在稳定性方面表现较好。在模拟的电子商务环境中,使用GA和PSO优化的ECC参数和 secp256k1 比较,显示了GA和PSO在ECC参数优化方面的有效性。Abstract
This paper presents a comparative analysis between the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), two vital artificial intelligence algorithms, focusing on optimizing Elliptic Curve Cryptography (ECC) parameters. These encompass the elliptic curve coefficients, prime number, generator point, group order, and cofactor. The study provides insights into which of the bio-inspired algorithms yields better optimization results for ECC configurations, examining performances under the same fitness function. This function incorporates methods to ensure robust ECC parameters, including assessing for singular or anomalous curves and applying Pollard's rho attack and Hasse's theorem for optimization precision. The optimized parameters generated by GA and PSO are tested in a simulated e-commerce environment, contrasting with well-known curves like secp256k1 during the transmission of order messages using Elliptic Curve-Diffie Hellman (ECDH) and Hash-based Message Authentication Code (HMAC). Focusing on traditional computing in the pre-quantum era, this research highlights the efficacy of GA and PSO in ECC optimization, with implications for enhancing cybersecurity in third-party e-commerce integrations. We recommend the immediate consideration of these findings before quantum computing's widespread adoption.
摘要
The fitness function incorporates methods to ensure robust ECC parameters, such as assessing for singular or anomalous curves and applying Pollard's rho attack and Hasse's theorem for optimization precision. The optimized parameters generated by GA and PSO are tested in a simulated e-commerce environment, and compared with well-known curves like secp256k1. The study focuses on traditional computing in the pre-quantum era, and highlights the efficacy of GA and PSO in ECC optimization, with implications for enhancing cybersecurity in third-party e-commerce integrations. The findings of this research are recommended for immediate consideration before the widespread adoption of quantum computing.
Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
results: 研究发现,通过将球面傅立叶函数和抛物线网络相结合,可以实现高效地理空间特征表示,并且在不同的分类和回归任务中达到了州际级的性能。Abstract
Learning feature representations of geographical space is vital for any machine learning model that integrates geolocated data, spanning application domains such as remote sensing, ecology, or epidemiology. Recent work mostly embeds coordinates using sine and cosine projections based on Double Fourier Sphere (DFS) features -- these embeddings assume a rectangular data domain even on global data, which can lead to artifacts, especially at the poles. At the same time, relatively little attention has been paid to the exact design of the neural network architectures these functional embeddings are combined with. This work proposes a novel location encoder for globally distributed geographic data that combines spherical harmonic basis functions, natively defined on spherical surfaces, with sinusoidal representation networks (SirenNets) that can be interpreted as learned Double Fourier Sphere embedding. We systematically evaluate the cross-product of positional embeddings and neural network architectures across various classification and regression benchmarks and synthetic evaluation datasets. In contrast to previous approaches that require the combination of both positional encoding and neural networks to learn meaningful representations, we show that both spherical harmonics and sinusoidal representation networks are competitive on their own but set state-of-the-art performances across tasks when combined. We provide source code at www.github.com/marccoru/locationencoder
摘要
学习地理空间特征表示是任何结合地理数据的机器学习模型的关键环节,涵盖应用领域如远程感知、生态学和 epidemiology。现有的大部分方法使用 Double Fourier Sphere(DFS)特征来投影坐标,这些投影假设数据域是方形的,尤其是在全球数据上,这可能会导致特征扭曲,特别是在两极。同时,对于 neural network 架构的精确设计得到了相对少的关注。这项工作提议一种新的全球地理数据编码器,其将球面幂函数基函数(spherical harmonic)和投影网络(SirenNets)结合起来,可以视为学习 Double Fourier Sphere 嵌入。我们系统地评估了不同的 pozitional 编码和 neural network 架构的跨产品和人工评估数据集。与之前的方法不同,我们发现了 spherical harmonics 和投影网络是独立学习的,但是当它们组合在一起时,它们可以达到最佳性能。我们在 www.github.com/marccoru/locationencoder 提供源代码。
Exploring Memorization in Fine-tuned Language Models
results: 研究发现,LM在不同任务中的记忆表现存在异常强的差异,并且发现了记忆表现与注意力分布之间的强相关性。此外,多任务调节被发现可以减少精度调节后的记忆表现。Abstract
LLMs have shown great capabilities in various tasks but also exhibited memorization of training data, thus raising tremendous privacy and copyright concerns. While prior work has studied memorization during pre-training, the exploration of memorization during fine-tuning is rather limited. Compared with pre-training, fine-tuning typically involves sensitive data and diverse objectives, thus may bring unique memorization behaviors and distinct privacy risks. In this work, we conduct the first comprehensive analysis to explore LMs' memorization during fine-tuning across tasks. Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that fine-tuned memorization presents a strong disparity among tasks. We provide an understanding of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution. By investigating its memorization behavior, multi-task fine-tuning paves a potential strategy to mitigate fine-tuned memorization.
摘要
LLMs 有很好的能力在不同的任务上,但也表现出储存训练数据的问题,因此引起了巨大的隐私和版权问题。在先前的工作中,研究了训练前的储存行为,但对于精度调整来说,研究储存行为的探索相对较少。相比训练前,精度调整通常涉及敏感数据和多种目标,因此可能带来唯一的储存行为和特定的隐私风险。在这项工作中,我们进行了第一次全面的分析,探索 LM 在调整过程中的储存行为。我们使用开源的 LM 和我们自己调整的 LM 在多种任务上进行了研究,发现调整后的储存强度存在任务之间的强烈差异。通过零 coding 理论和注意力分布的调查,我们了解了储存行为与注意力分布之间的强烈相关性。此外,我们发现了多任务调整可能减轻调整后的储存行为的问题。
Quality Control at Your Fingertips: Quality-Aware Translation Models
paper_authors: Christian Tomani, David Vilar, Markus Freitag, Colin Cherry, Subhajit Naskar, Mara Finkelstein, Daniel Cremers
for: 提高神经机器翻译模型(NMT)的翻译质量。
methods: 使用神经网络自己估计输出质量,并在MAP解oding中使用这个质量信号作为提示。
results: 使用内置质量估计可以自动排除优化搜索空间,并在MBR解oding中提高翻译质量,同时降低搜索速度。Abstract
Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations being more likely. However, research has shown that this assumption does not always hold, and decoding strategies which directly optimize a utility function, like Minimum Bayes Risk (MBR) or Quality-Aware decoding can significantly improve translation quality over standard MAP decoding. The main disadvantage of these methods is that they require an additional model to predict the utility, and additional steps during decoding, which makes the entire process computationally demanding. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. During decoding, we can use the model's own quality estimates to guide the generation process and produce the highest-quality translations possible. We demonstrate that the model can self-evaluate its own output during translation, eliminating the need for a separate quality estimation model. Moreover, we show that using this quality signal as a prompt during MAP decoding can significantly improve translation quality. When using the internal quality estimate to prune the hypothesis space during MBR decoding, we can not only further improve translation quality, but also reduce inference speed by two orders of magnitude.
摘要
最常用的决策策略之一是最大 posteriori(MAP)解oding,用于神经机器翻译(NMT)模型。假设是,模型的概率与人类判断有高度相关, better translations 是更有可能性的。然而,研究表明,这种假设并不总是成立,而使用直接优化一个实用函数,如最小 bayes 风险(MBR)或质量意识 decoding 可以显著提高翻译质量。这些方法的主要缺点是它们需要一个额外的模型来预测实用函数,以及在解码过程中进行额外的步骤,这使得整个过程变得计算昂贵。在这篇论文中,我们提议使用 NMT 模型本身来自适应质量。在解码过程中,我们可以使用模型自己的质量估计来导引生成过程,以生成最高质量的翻译。我们示示了模型可以自我评估其自己的输出,无需额外的质量估计模型。此外,我们还表明,使用这个质量信号作为提示在 MAP 解oding 中使用可以显著提高翻译质量。当使用内部质量估计来减少假设空间中的假设时,我们可以不仅进一步提高翻译质量,还可以将推理速度减少两个数量级。
DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection
results: 研究发现,使用LSH和DeepLSH可以减少崩溃bug报告的相似性搜索时间,并且可以保证搜索结果的准确性。此外,研究还提供了一个原始数据集,以便进一步 validate 这些结果。Abstract
Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.
摘要
自动化崩溃分组是软件开发过程中的一个关键阶段,用于有效地处理报告 bug 的情况。通常通过聚合技术来实现这一目标。然而,在实时流动的报告 bug 收集中,系统需要快速回答问题:新的报告 bug 与其他报告 bug 之间有哪些相似之处?因此,快速找到相似的报告 bug 变得非常重要。这使得最近邻居搜索成为一个自然的选择,特别是使用了本地敏感哈希(LSH),因为它可以在大量数据集上实现子线性性和对 Similarity search 的理论保证。尽管 LSH 在崩溃分组文献中没有被考虑,但我们在这篇论文中尝试使用它来解决这一问题。为了考虑文献中最常用的metric,我们提出了 DeepLSH,一种基于 Siamese DNN 架构的原始搜索函数,可以准确地模拟本地敏感性Property,包括 Jaccard 和 Cosine metric 。我们通过一系列实验证明了 DeepLSH 的有效性,并提供了一个原始数据集,可以用于进一步研究。
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
paper_authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen for:This paper aims to develop a cost-effective approach for building smaller language models (LLMs) from pre-trained, larger models.methods:The approach uses two key techniques: targeted structured pruning and dynamic batch loading. Targeted structured pruning prunes the larger model to a specified target shape, while dynamic batch loading dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.results:The Sheared-LLaMA series, pruned from the LLaMA2-7B model, outperforms state-of-the-art open-source models of equivalent sizes on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of the compute required to train such models from scratch.Abstract
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.
摘要
LLMA (Touvron 等人,2023a;b) 的受欢迎程度和其他最近出现的中等规模的大语言模型 (LLMs) 表明了建立更小 yet 强大 LLMs 的潜力。然而,从头来训练这些模型的成本仍然高。在这个工作中,我们研究结构性剪裁作为开发更小 LLMs 的有效方法。我们的方法使用两个关键技术:(1) Targeted 结构性剪裁,剪掉一个更大的模型到指定的目标形态,包括层、头、中间和隐藏维度,并在端到端方式进行剪裁;(2)动态批处理,根据不同领域的变化损失动态更新每个训练批处理中的样本数据组合。我们示出了 Sheared-LLaMA 系列,剪掉 LLMA2-7B 模型为 1.3B 和 2.7B 参数。Sheared-LLaMA 模型在各种下游和指令调整评估中表现出色,而且只需要比训练这些模型从头来的计算量为 3%。这项工作提供了证明,使用现有的 LLMs 结构性剪裁是一种更加经济的方法来建立更小的 LLMs。
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
results: 论文的实验结果表明,Meta-CoT可以在十个公共评估任务上达到杰出的表现,同时具有优秀的泛化能力。特别是,Meta-CoT在SVAMP任务上达到了93.7%的最佳成绩,无需任何程序协助方法。Abstract
Large language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on handcrafted task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose Meta-CoT, a generalizable CoT prompting method in mixed-task scenarios where the type of input questions is unknown. Meta-CoT firstly categorizes the scenario based on the input question and subsequently constructs diverse demonstrations from the corresponding data pool in an automatic pattern. Meta-CoT simultaneously enjoys remarkable performances on ten public benchmark reasoning tasks and superior generalization capabilities. Notably, Meta-CoT achieves the state-of-the-art result on SVAMP (93.7%) without any additional program-aided methods. Our further experiments on five out-of-distribution datasets verify the stability and generality of Meta-CoT.
摘要
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach
results: 研究结果表明,该方法可以提供帮助 end-users 理解 LLMs 预测的各种见解,并可以通过调整提示来改善 LLMs 生成的代码质量。Abstract
While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation, where LLMs, deemed a complex and powerful black-box model, is instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by the recent progress in causality analysis and its application in software engineering, this paper launches a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
摘要
在软件开发中,代码生成已经广泛应用,但代码质量并不能保证。在大语言模型(LLM)基于代码生成中,LLM被视为复杂且强大的黑盒模型,通过高级自然语言规范(即提示)生成代码。然而,对LLM代码生成能力进行有效评估和解释是极其困难的,这是因为LLM的复杂性和不透明性。 inspirited by recent progress in causality analysis and its application in software engineering, this paper proposes a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
Unlock the Potential of Counterfactually-Augmented Data in Out-Of-Distribution Generalization
methods: 使用Counterfactually-Augmented Data(CAD),并在Feature Space中分析myopia现象,并引入基于CAD的结构性质的两个约束,以帮助语言模型更好地抽取 causal features
results: 在 Sentiment Analysis 和 Natural Language Inference 两个任务上,经验表明,我们的方法可以提高语言模型的 OOD 泛化性能by 1.0% 到 5.9%Abstract
Counterfactually-Augmented Data (CAD) -- minimal editing of sentences to flip the corresponding labels -- has the potential to improve the Out-Of-Distribution (OOD) generalization capability of language models, as CAD induces language models to exploit domain-independent causal features and exclude spurious correlations. However, the empirical results of CAD's OOD generalization are not as efficient as anticipated. In this study, we attribute the inefficiency to the myopia phenomenon caused by CAD: language models only focus on causal features that are edited in the augmentation operation and exclude other non-edited causal features. Therefore, the potential of CAD is not fully exploited. To address this issue, we analyze the myopia phenomenon in feature space from the perspective of Fisher's Linear Discriminant, then we introduce two additional constraints based on CAD's structural properties (dataset-level and sentence-level) to help language models extract more complete causal features in CAD, thereby mitigating the myopia phenomenon and improving OOD generalization capability. We evaluate our method on two tasks: Sentiment Analysis and Natural Language Inference, and the experimental results demonstrate that our method could unlock the potential of CAD and improve the OOD generalization performance of language models by 1.0% to 5.9%.
摘要
Counterfactually-Augmented Data (CAD) -- 通过最小修改句子的方式,flips 对应的标签 -- 有可能提高语言模型的 OUT-OF-DISTRIBUTION (OOD) 泛化能力,因为 CAD 使语言模型利用域 independet causal features,并排除干扰因素。然而,CAD 的 OOD 泛化实际效果不如预期,我们归因于 CAD 引起的短视现象:语言模型只关注编辑操作中的 causal features,而忽略其他非编辑的 causal features。因此,CAD 的潜在可能性没有得到充分利用。为了解决这个问题,我们从 Fisher's Linear Discriminant 的视角分析 feature space 中的短视现象,然后引入基于 CAD 结构属性(dataset-level和 sentence-level)的两个额外约束,以 помо助语言模型在 CAD 中提取更完整的 causal features,从而 Mitigate 短视现象,提高 OOD 泛化能力。我们在 Sentiment Analysis 和 Natural Language Inference 两个任务上进行了实验,实际结果表明,我们的方法可以提高 CAD 的 OOD 泛化性能,从 1.0% 到 5.9%。
Assessing the Impact of a Supervised Classification Filter on Flow-based Hybrid Network Anomaly Detection
results: 我们的实验结果表明,混合方法可以提高已知攻击的检测率,同时仍然保持适用于零日攻击的检测能力。使用监督二进制预Filter可以提高AUC指标超过11%,检测30%更多的攻击,保持准确阳性数量相对不变。Abstract
Constant evolution and the emergence of new cyberattacks require the development of advanced techniques for defense. This paper aims to measure the impact of a supervised filter (classifier) in network anomaly detection. We perform our experiments by employing a hybrid anomaly detection approach in network flow data. For this purpose, we extended a state-of-the-art autoencoder-based anomaly detection method by prepending a binary classifier acting as a prefilter for the anomaly detector. The method was evaluated on the publicly available real-world dataset UGR'16. Our empirical results indicate that the hybrid approach does offer a higher detection rate of known attacks than a standalone anomaly detector while still retaining the ability to detect zero-day attacks. Employing a supervised binary prefilter has increased the AUC metric by over 11%, detecting 30% more attacks while keeping the number of false positives approximately the same.
摘要
常态的演化和新型攻击的出现需要开发先进的防御技术。这篇论文目的是测量一个监督器(分类器)在网络异常检测中的影响。我们在网络流数据中进行了一种混合异常检测方法的实验,包括将一个状态艺术自适应异常检测方法扩展为预先筛选器。我们使用公共可用的真实世界数据集UGR'16进行评估。我们的实验结果表明,混合方法可以提高已知攻击的检测率,同时仍然保持适用于零天攻击的检测能力。在使用监督二进制预测器后,AUC指标提高了超过11%,检测到30%更多的攻击,保持 false positive 的数量约为同样。
methods: 本 paper 提出了一种名为多样性从人类反馈(DivHF)的方法,通过询问人类反馈来学习行为描述符,并将其与任何距离度量结合以定义多样性度量。
results: 实验结果表明,使用 DivHF 方法可以学习更好地适应人类偏好的行为空间,并且可以通过人类反馈来提高多样性优化的效果。Abstract
Diversity plays a significant role in many problems, such as ensemble learning, reinforcement learning, and combinatorial optimization. How to define the diversity measure is a longstanding problem. Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure, which is, however, challenging in many scenarios. In this paper, we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback (DivHF) to solve it. DivHF learns a behavior descriptor consistent with human preference by querying human feedback. The learned behavior descriptor can be combined with any distance measure to define a diversity measure. We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite. The results show that DivHF learns a behavior space that aligns better with human requirements compared to direct data-driven approaches and leads to more diverse solutions under human preference. Our contributions include formulating the problem, proposing the DivHF method, and demonstrating its effectiveness through experiments.
摘要
多样性在许多问题中扮演着重要的角色,如集成学习、强化学习和组合优化。定义多样性度量是一个长期的问题。许多方法依赖于专家经验来定义合适的行为空间,然后获取多样性度量,但这在许多场景下是困难的。在这篇论文中,我们提出了从人类反馈获得行为空间的问题,并提出了一种通用的方法called多样性从人类反馈(DivHF)来解决这个问题。DivHF通过询问人类反馈来学习一个与人类偏好相符的行为描述符。学习的行为描述符可以与任何距离度量结合以定义多样性度量。我们通过在QDax集合上集成DivHF和MAP-Elites算法进行实验,并证明了DivHF可以学习一个更好地与人类需求相符的行为空间,并且导致更多的多样性解决方案。我们的贡献包括提出了问题、提出了DivHF方法,并通过实验证明了其效果。
Topic-DPR: Topic-based Prompts for Dense Passage Retrieval
results: 在两个 datasets 上实验结果显示,我们的方法超过了之前的 state-of-the-art Retrieval 技术。Abstract
Prompt-based learning's efficacy across numerous natural language processing tasks has led to its integration into dense passage retrieval. Prior research has mainly focused on enhancing the semantic understanding of pre-trained language models by optimizing a single vector as a continuous prompt. This approach, however, leads to a semantic space collapse; identical semantic information seeps into all representations, causing their distributions to converge in a restricted region. This hinders differentiation between relevant and irrelevant passages during dense retrieval. To tackle this issue, we present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. Unlike the single prompt method, multiple topic-based prompts are established over a probabilistic simplex and optimized simultaneously through contrastive learning. This encourages representations to align with their topic distributions, improving space uniformity. Furthermore, we introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency. Experimental results from two datasets affirm that our method surpasses previous state-of-the-art retrieval techniques.
摘要
results: SOTA results on DDBP2 problem (estimating number of tricks for two given hands)Abstract
Contract bridge is a game characterized by incomplete information, posing an exciting challenge for artificial intelligence methods. This paper proposes the BridgeHand2Vec approach, which leverages a neural network to embed a bridge player's hand (consisting of 13 cards) into a vector space. The resulting representation reflects the strength of the hand in the game and enables interpretable distances to be determined between different hands. This representation is derived by training a neural network to estimate the number of tricks that a pair of players can take. In the remainder of this paper, we analyze the properties of the resulting vector space and provide examples of its application in reinforcement learning, and opening bid classification. Although this was not our main goal, the neural network used for the vectorization achieves SOTA results on the DDBP2 problem (estimating the number of tricks for two given hands).
摘要
CONTRACT BRIDGE 是一款具有不完整信息的游戏,具有让人很激动的挑战性,这篇论文提出了 BridgeHand2Vec 方法,该方法利用神经网络将bridge玩家手中的13张牌转换成向量空间中的表示。这种表示能够反映手中的游戏力量,并允许确定不同手中的距离。这种表示是通过训练神经网络来估计两名玩家可以拿到的赢得局数来获得的。在本文中,我们分析了这种向量空间的性质,并提供了应用于强化学习和开场招许分类的示例。虽然这并不是我们的主要目标,但是使用于向量化的神经网络在 DDBP2 问题上达到了顶峰性能。
V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network
results: 应用我们的算法于大规模开放数据集V2Xset,得到了state-of-the-art的结果。V2X-AHD可以有效地提高3D物体检测的准确性,并降低网络参数的数量。这些结果可以 serves as a benchmark for cooperative perception。Abstract
Object detection is the central issue of intelligent traffic systems, and recent advancements in single-vehicle lidar-based 3D detection indicate that it can provide accurate position information for intelligent agents to make decisions and plan. Compared with single-vehicle perception, multi-view vehicle-road cooperation perception has fundamental advantages, such as the elimination of blind spots and a broader range of perception, and has become a research hotspot. However, the current perception of cooperation focuses on improving the complexity of fusion while ignoring the fundamental problems caused by the absence of single-view outlines. We propose a multi-view vehicle-road cooperation perception system, vehicle-to-everything cooperative perception (V2X-AHD), in order to enhance the identification capability, particularly for predicting the vehicle's shape. At first, we propose an asymmetric heterogeneous distillation network fed with different training data to improve the accuracy of contour recognition, with multi-view teacher features transferring to single-view student features. While the point cloud data are sparse, we propose Spara Pillar, a spare convolutional-based plug-in feature extraction backbone, to reduce the number of parameters and improve and enhance feature extraction capabilities. Moreover, we leverage the multi-head self-attention (MSA) to fuse the single-view feature, and the lightweight design makes the fusion feature a smooth expression. The results of applying our algorithm to the massive open dataset V2Xset demonstrate that our method achieves the state-of-the-art result. The V2X-AHD can effectively improve the accuracy of 3D object detection and reduce the number of network parameters, according to this study, which serves as a benchmark for cooperative perception. The code for this article is available at https://github.com/feeling0414-lab/V2X-AHD.
摘要
“对于智能交通系统中的物件探测,最近的进展表明单车 lidar 三维探测可以提供正确的位置信息,帮助智能代理人做出决策和规划。相比单车感知,多视角车道合作感知具有根本上的优势,如消除盲点和扩大视野,并成为研究热点。然而,现有的感知方法强调增强复杂的融合,忽略了单视角 outline 的基本问题。我们提出了一个多视角车道合作感知系统(V2X-AHD),以增强物件识别能力,特别是预测车辆形状。在这个系统中,我们提出了不对称多元精神网络,使用不同训练数据来提高楔形识别精度,并将多视角教师特征转移到单视角学生特征中。当Point cloud 资料为稀疏时,我们提出了Spara Pillar,一个减少参数的几何学基础Feature extraction 后置架构。此外,我们利用多头自注意(MSA)融合单视角特征,并将融合特征设计为轻量级。根据我们将这个算法应用到大量公开 dataset V2Xset 的结果,我们的方法可以实现 state-of-the-art 的成果。V2X-AHD 可以增强3D物件探测的精度和减少网络参数数量,根据这个研究,可以作为协同感知的 bench mark。相关的代码可以在 GitHub 上获取:https://github.com/feeling0414-lab/V2X-AHD。”
A Black-Box Physics-Informed Estimator based on Gaussian Process Regression for Robot Inverse Dynamics Identification
results: 对于 simulate 和实际两个 robotic manipulate 器(Franka Emika Panda 和 MELFA RV4FL)进行了实验,结果显示,提出的模型在精度、通用性和数据效率方面都高于当前黑盒估计器,包括 Gaussian Processes 和神经网络。此外,对 MELFA 机器人的实验还示出了我们的方法可以与高精度模型基于估计器相比,即使需要更少的先验信息。Abstract
In this paper, we propose a black-box model based on Gaussian process regression for the identification of the inverse dynamics of robotic manipulators. The proposed model relies on a novel multidimensional kernel, called \textit{Lagrangian Inspired Polynomial} (\kernelInitials{}) kernel. The \kernelInitials{} kernel is based on two main ideas. First, instead of directly modeling the inverse dynamics components, we model as GPs the kinetic and potential energy of the system. The GP prior on the inverse dynamics components is derived from those on the energies by applying the properties of GPs under linear operators. Second, as regards the energy prior definition, we prove a polynomial structure of the kinetic and potential energy, and we derive a polynomial kernel that encodes this property. As a consequence, the proposed model allows also to estimate the kinetic and potential energy without requiring any label on these quantities. Results on simulation and on two real robotic manipulators, namely a 7 DOF Franka Emika Panda and a 6 DOF MELFA RV4FL, show that the proposed model outperforms state-of-the-art black-box estimators based both on Gaussian Processes and Neural Networks in terms of accuracy, generality and data efficiency. The experiments on the MELFA robot also demonstrate that our approach achieves performance comparable to fine-tuned model-based estimators, despite requiring less prior information.
摘要
在这篇论文中,我们提出了基于 Gaussian Process regression 的黑盒模型,用于 робо机拟合器的反动动学特征标定。我们的模型基于一种新的多维度核函数,称为 \kernelInitials{} 核函数。\kernelInitials{} 核函数基于两个主要想法:首先,而不直接模型反动动学分量,我们模型了机械系统的动能和潜能为 Gaussian Processes。由于 GPs 具有线性运算下的性质,我们可以从这些 GPs 中得到反动动学分量的 priors。其次,我们证明机械系统的动能和潜能具有多项式结构,并 derivated一个多项式核函数来表示这种性质。这意味着我们的模型可以不仅仅 estimator 反动动学分量,还可以估计机械系统的动能和潜能,无需提供任何标注。在实验中,我们使用了两个真实的 робо机拟合器,即 7 DOF Franka Emika Panda 和 6 DOF MELFA RV4FL,并与已有的黑盒估计器(基于 Gaussian Processes 和 Neural Networks)进行比较。结果表明,我们的模型在准确性、通用性和数据效率方面表现更好,并且在 MELFA 机械上实验也表明了我们的方法可以与高精度模型基本估计器相比,即使需要更少的先验信息。
paper_authors: Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman
for: 这 paper 是为了研究 emergent communication 中的时间参照 vocabulary 的发展而写的。
methods: 这 paper 使用了一种新的 agent architecture,以检验 temporal referencing 是否可以自然地出现在 emergent communication 中。
results: 实验结果表明,这种新的 agent architecture 是可以自然地引入 temporal referencing 的,无需额外的损失。这些发现可以为其他 emergent communication 环境中的时间参照引入提供基础。Abstract
As humans, we use linguistic elements referencing time, such as before or tomorrow, to easily share past experiences and future predictions. While temporal aspects of the language have been considered in computational linguistics, no such exploration has been done within the field of emergent communication. We research this gap, providing the first reported temporal vocabulary within emergent communication literature. Our experimental analysis shows that a different agent architecture is sufficient for the natural emergence of temporal references, and that no additional losses are necessary. Our readily transferable architectural insights provide the basis for the incorporation of temporal referencing into other emergent communication environments.
摘要
人类使用语言元素 referencing 时间,如前或明天,轻松分享过去经验和未来预测。在计算机语言学中,时间方面的语言元素已经得到了考虑,但在emergent communication中,这一方面的探索尚未进行过。我们对此进行研究,提供了emergent communication中第一个时间参考词汇的报告。我们的实验分析表明,不需要额外的损失,不同的机器人体系即可自然地出现时间引用。我们的易于传输的建筑思想可以为其他emergent communication环境中的时间引用 incorporation提供基础。
Automated clinical coding using off-the-shelf large language models
methods: 这个方法使用了市场上已经预训练的大语言模型(LLM),通过信息抽取和 Hierarchical 搜索来自动生成ICD代码。在第二阶段,使用 GPT-4 进行元反差,选择一 subset of 相关的标签作为预测。
results: 这个方法在 CodiEsp 数据集上测试,与 PLM-ICD 相比,在更为罕见的类型上表现出了状态的表现,具有最高的 macro-F1 值 0.225,微-F1 值 0.157,而 PLM-ICD 的最高值为 0.216 和 0.219。这个方法不需要任何任务特定的学习,从而实现了自动 ICD 编码的目的。Abstract
The task of assigning diagnostic ICD codes to patient hospital admissions is typically performed by expert human coders. Efforts towards automated ICD coding are dominated by supervised deep learning models. However, difficulties in learning to predict the large number of rare codes remain a barrier to adoption in clinical practice. In this work, we leverage off-the-shelf pre-trained generative large language models (LLMs) to develop a practical solution that is suitable for zero-shot and few-shot code assignment. Unsupervised pre-training alone does not guarantee precise knowledge of the ICD ontology and specialist clinical coding task, therefore we frame the task as information extraction, providing a description of each coded concept and asking the model to retrieve related mentions. For efficiency, rather than iterating over all codes, we leverage the hierarchical nature of the ICD ontology to sparsely search for relevant codes. Then, in a second stage, which we term 'meta-refinement', we utilise GPT-4 to select a subset of the relevant labels as predictions. We validate our method using Llama-2, GPT-3.5 and GPT-4 on the CodiEsp dataset of ICD-coded clinical case documents. Our tree-search method achieves state-of-the-art performance on rarer classes, achieving the best macro-F1 of 0.225, whilst achieving slightly lower micro-F1 of 0.157, compared to 0.216 and 0.219 respectively from PLM-ICD. To the best of our knowledge, this is the first method for automated ICD coding requiring no task-specific learning.
摘要
通常情况下,诊断ICD代码的分配是由专业的人类编码器完成。但是,自动ICD编码的尝试受到罕见代码的困难而受阻。在这种情况下,我们利用可用的准备好的大语言模型(LLM)来开发一个实用的解决方案,适用于零shot和几shot代码分配。不同于其他研究,我们不使用监督学习模型,而是将任务定义为信息抽取,请求模型提取相关的提取。为了提高效率,我们利用ICD ontology的层次结构,将搜索范围限定为有关代码。然后,在第二阶段,我们使用GPT-4来选择相关的标签作为预测。我们使用Llama-2、GPT-3.5和GPT-4在CodiEsp数据集上验证我们的方法,并 achieved state-of-the-art表现在更罕见的类别中,即macro-F1为0.225,微-F1为0.157。这与PLM-ICD的0.216和0.219相比,表现较佳。我们知道,这是自动ICD编码的首次不需要任务特定学习的方法。
Rationale-Enhanced Language Models are Better Continual Relation Learners
results: 实验结果表明,我们的方法在两个标准benchmark上都超越了当前最佳CRE模型。Abstract
Continual relation extraction (CRE) aims to solve the problem of catastrophic forgetting when learning a sequence of newly emerging relations. Recent CRE studies have found that catastrophic forgetting arises from the model's lack of robustness against future analogous relations. To address the issue, we introduce rationale, i.e., the explanations of relation classification results generated by large language models (LLM), into CRE task. Specifically, we design the multi-task rationale tuning strategy to help the model learn current relations robustly. We also conduct contrastive rationale replay to further distinguish analogous relations. Experimental results on two standard benchmarks demonstrate that our method outperforms the state-of-the-art CRE models.
摘要
Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach
paper_authors: Gyu Seon Kim, JaeHyun Chung, Soohyun Park
for: 这篇论文是为了探讨量子强化学习在再用导弹控制系统中的应用。
methods: 这篇论文使用量子强化学习来更新控制系统,以适应导弹动态系统变化。
results: 研究人员发现,量子强化学习可以提供更高的计算效率、减少内存需求和更稳定的性能,这些特点使其成为再用导弹控制系统中的优秀解决方案。Abstract
The advent of reusable rockets has heralded a new era in space exploration, reducing the costs of launching satellites by a significant factor. Traditional rockets were disposable, but the design of reusable rockets for repeated use has revolutionized the financial dynamics of space missions. The most critical phase of reusable rockets is the landing stage, which involves managing the tremendous speed and attitude for safe recovery. The complexity of this task presents new challenges for control systems, specifically in terms of precision and adaptability. Classical control systems like the proportional-integral-derivative (PID) controller lack the flexibility to adapt to dynamic system changes, making them costly and time-consuming to redesign of controller. This paper explores the integration of quantum reinforcement learning into the control systems of reusable rockets as a promising alternative. Unlike classical reinforcement learning, quantum reinforcement learning uses quantum bits that can exist in superposition, allowing for more efficient information encoding and reducing the number of parameters required. This leads to increased computational efficiency, reduced memory requirements, and more stable and predictable performance. Due to the nature of reusable rockets, which must be light, heavy computers cannot fit into them. In the reusable rocket scenario, quantum reinforcement learning, which has reduced memory requirements due to fewer parameters, is a good solution.
摘要
发射卫星的成本由再用火箭的出现大幅降低,这种新的发射方式已经开启了宇宙探索的新时代。传统的火箭是一次性的,但是再用火箭的设计可以重复使用,这对宇宙探索的财务动力产生了革命性的变化。再用火箭的最关键的阶段是着陆阶段,需要控制高速和总体orientation以实现安全的回收。这种任务的复杂性带来了新的控制系统挑战,特别是精度和适应性方面。经典的控制系统如比例-Integral-Derivative(PID)控制器缺乏适应性,需要时间和成本重新设计控制器。这篇论文探讨了在控制系统中 интеGRATION quantum reinforcement learning作为一种可能的替代方案。与经典的反馈学习不同,量子反馈学习使用量子比特,可以在超position中存在,从而实现更高效的信息编码和减少参数数量。这导致计算效率提高,存储需求减少,性能更稳定和预测可靠。由于再用火箭需要轻量级,因此在再用火箭场景下,量子反馈学习,具有减少参数数量的优点,是一个好的解决方案。
A Novel Contrastive Learning Method for Clickbait Detection on RoCliCo: A Romanian Clickbait Corpus of News Articles
results: 研究人员通过手动标注8,313篇新闻样本,并使用四种机器学习方法进行实验,以建立一系列竞争力强的基线。此外,研究人员还提出了一种基于BERT的对比学习模型,可以在新闻标题和内容之间学习深度度量空间,以便识别不是吸引用户点击的新闻。Abstract
To increase revenue, news websites often resort to using deceptive news titles, luring users into clicking on the title and reading the full news. Clickbait detection is the task that aims to automatically detect this form of false advertisement and avoid wasting the precious time of online users. Despite the importance of the task, to the best of our knowledge, there is no publicly available clickbait corpus for the Romanian language. To this end, we introduce a novel Romanian Clickbait Corpus (RoCliCo) comprising 8,313 news samples which are manually annotated with clickbait and non-clickbait labels. Furthermore, we conduct experiments with four machine learning methods, ranging from handcrafted models to recurrent and transformer-based neural networks, to establish a line-up of competitive baselines. We also carry out experiments with a weighted voting ensemble. Among the considered baselines, we propose a novel BERT-based contrastive learning model that learns to encode news titles and contents into a deep metric space such that titles and contents of non-clickbait news have high cosine similarity, while titles and contents of clickbait news have low cosine similarity. Our data set and code to reproduce the baselines are publicly available for download at https://github.com/dariabroscoteanu/RoCliCo.
摘要
为了增加收入,新闻网站经常使用吸引人的标题,让用户点击标题并阅读完整的新闻。 Clickbait检测是一项任务,旨在自动检测这种false advertisement,以避免在线用户的宝贵时间浪费。然而,到目前为止,我们知道没有公开可用的罗马尼亚语Clickbait corpus。为此,我们介绍了一个新的罗马尼亚Clickbait corpus(RoCliCo),包含8,313个新闻样本,每个样本都被手动标注为Clickbait或非Clickbait。此外,我们进行了四种机器学习方法的实验,从手工模型到回归和转换器基于神经网络,以建立一系列竞争力强的基准。我们还进行了一个权重投票集成。 amongst the considered baselines, we propose a novel BERT-based contrastive learning model that learns to encode news titles and contents into a deep metric space such that titles and contents of non-clickbait news have high cosine similarity, while titles and contents of clickbait news have low cosine similarity。我们的数据集和可重现基准的代码公开下载于https://github.com/dariabroscoteanu/RoCliCo。
Accelerating Monte Carlo Tree Search with Probability Tree State Abstraction
results: 通过与现有的MCTS-based算法集成,如Sampled MuZero和Gumbel MuZero,实验结果表明,我们的PTSA算法可以在不同任务上减少搜索空间大小10%-45%,并且加速了现有算法的训练过程。Abstract
Monte Carlo Tree Search (MCTS) algorithms such as AlphaGo and MuZero have achieved superhuman performance in many challenging tasks. However, the computational complexity of MCTS-based algorithms is influenced by the size of the search space. To address this issue, we propose a novel probability tree state abstraction (PTSA) algorithm to improve the search efficiency of MCTS. A general tree state abstraction with path transitivity is defined. In addition, the probability tree state abstraction is proposed for fewer mistakes during the aggregation step. Furthermore, the theoretical guarantees of the transitivity and aggregation error bound are justified. To evaluate the effectiveness of the PTSA algorithm, we integrate it with state-of-the-art MCTS-based algorithms, such as Sampled MuZero and Gumbel MuZero. Experimental results on different tasks demonstrate that our method can accelerate the training process of state-of-the-art algorithms with 10%-45% search space reduction.
摘要
蒙特卡洛树搜索(MCTS)算法如AlphaGo和MuZero已经在许多具有挑战性的任务中表现出人类之上。然而,MCTS基本算法的计算复杂度受到搜索空间的大小影响。为解决这个问题,我们提出了一个新的概率树状抽象(PTSA)算法,以提高MCTS搜索效率。我们定义一个通用的树状抽象,并提出了路径潜在性的概率树状抽象,以降低统计误差。此外,我们提供了对于潜在性和统计误差的理论保证。为评估PTSA算法的效果,我们将其与现有的MCTS基本算法结合,例如Sampled MuZero和Gumbel MuZero。实验结果显示,我们的方法可以将搜索空间缩减10%-45%,并加速现有算法的训练过程。
RK-core: An Established Methodology for Exploring the Hierarchical Structure within Datasets
results: 本研究在多个 benchmark 数据集上进行了实验,发现了 samples 的低核心值与其所属类别的表现有负相关性,而高核心值 samples 则对性能的贡献更大。此外,本研究还发现了一个高质量的核心集应该具有层次多样性,而不是仅选择表现最佳的示例。Abstract
Recently, the field of machine learning has undergone a transition from model-centric to data-centric. The advancements in diverse learning tasks have been propelled by the accumulation of more extensive datasets, subsequently facilitating the training of larger models on these datasets. However, these datasets remain relatively under-explored. To this end, we introduce a pioneering approach known as RK-core, to empower gaining a deeper understanding of the intricate hierarchical structure within datasets. Across several benchmark datasets, we find that samples with low coreness values appear less representative of their respective categories, and conversely, those with high coreness values exhibit greater representativeness. Correspondingly, samples with high coreness values make a more substantial contribution to the performance in comparison to those with low coreness values. Building upon this, we further employ RK-core to analyze the hierarchical structure of samples with different coreset selection methods. Remarkably, we find that a high-quality coreset should exhibit hierarchical diversity instead of solely opting for representative samples. The code is available at https://github.com/yaolu-zjut/Kcore.
摘要
最近,机器学习领域受到了数据中心化的影响,各种学习任务的进步受到了更加广泛和深入的数据驱动。然而,这些数据仍然尚未得到充分探索。为此,我们提出了一种创新的方法——RK-core,以便更深入地理解数据集中的复杂层次结构。在多个标准数据集上测试,我们发现低核心值的样本对其相应的类别表示着较低的表达力,而高核心值的样本则表现出较高的表达力。此外,高核心值的样本在性能中占据了更大的比重。基于这一点,我们进一步采用RK-core分析不同核心选择方法下的层次结构。我们发现,高质量的核心集应该具有层次多样性,而不是仅仅选择表达力最高的样本。相关代码可以在https://github.com/yaolu-zjut/Kcore上找到。
Evaluation of ChatGPT Feedback on ELL Writers’ Coherence and Cohesion
paper_authors: Su-Youn Yoon, Eva Miszoglad, Lisa R. Pierce
For: The paper evaluates the effectiveness of ChatGPT in providing feedback on the coherence and cohesion of essays written by English Language Learners (ELLs) students.* Methods: The paper uses a two-step approach to evaluate the feedback generated by ChatGPT, including classifying each sentence into subtypes based on its function and evaluating its accuracy and usability.* Results: The paper finds that most feedback sentences generated by ChatGPT are highly abstract and generic, failing to provide concrete suggestions for improvement. The accuracy of the feedback depends on superficial linguistic features and is often incorrect, indicating that ChatGPT, without specific training for the feedback generation task, does not offer effective feedback on ELL students’ coherence and cohesion.Here are the three key information points in Simplified Chinese text:* For: 这个研究用ChatGPT来评估英语学习者(ELLs)学生写的论文的 coherence 和 cohesion 的 feedback 的有效性。* Methods: 这个研究使用了一种两步方法来评估 ChatGPT 生成的 feedback,包括将每句话分类为不同的类型根据其功能(例如,正面鼓励、问题陈述),然后评估它们的准确性和可用性。* Results: 研究发现,ChatGPT 生成的 feedback 多数是高度抽象的和通用的,无法提供具体的改进建议。准确地检测主要问题(如重复的想法和不准确使用连接device)的准确性取决于表面语言特征,并且经常错误。因此,ChatGPT 无法提供有效的 feedback 于 ELLs 学生的 coherence 和 cohesion。Abstract
Since its launch in November 2022, ChatGPT has had a transformative effect on education where students are using it to help with homework assignments and teachers are actively employing it in their teaching practices. This includes using ChatGPT as a tool for writing teachers to grade and generate feedback on students' essays. In this study, we evaluated the quality of the feedback generated by ChatGPT regarding the coherence and cohesion of the essays written by English Language Learners (ELLs) students. We selected 50 argumentative essays and generated feedback on coherence and cohesion using the ELLIPSE rubric. During the feedback evaluation, we used a two-step approach: first, each sentence in the feedback was classified into subtypes based on its function (e.g., positive reinforcement, problem statement). Next, we evaluated its accuracy and usability according to these types. Both the analysis of feedback types and the evaluation of accuracy and usability revealed that most feedback sentences were highly abstract and generic, failing to provide concrete suggestions for improvement. The accuracy in detecting major problems, such as repetitive ideas and the inaccurate use of cohesive devices, depended on superficial linguistic features and was often incorrect. In conclusion, ChatGPT, without specific training for the feedback generation task, does not offer effective feedback on ELL students' coherence and cohesion.
摘要
(Simplified Chinese translation)自其在11月2022年发布以来,ChatGPT已经对教育产生了transformative的影响,学生们使用它来帮助完成家庭作业,教师也活动地使用它在教学实践中。这包括使用ChatGPT来评估学生的作业,并提供反馈。在这项研究中,我们评估了ChatGPT对英语学习者(ELLs)学生的论文 coherence 和 cohesion 的反馈质量。我们选择50篇 Argumentative essay,并使用 ELLIPSE 分类法生成反馈。在反馈评估中,我们采用了两步方法:首先,每句反馈被分类为不同的函数类型(例如,正面鼓励、问题陈述)。然后,我们评估了它们的准确性和可用性。结果表明,大多数反馈句子具有高度抽象和通用的特点,无法提供具体的改进建议。检测重要问题的准确性,如重复的想法和不当使用 cohesive devices,通常基于表面语言特征,并且错误。结论,ChatGPT,无需特定的培训,不能提供有效的反馈对 ELL 学生的 coherence 和 cohesion。
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task
results: 实验结果表明,目前的开源LLMs在实际噪声数据上的杂乱Robustness性表现很有限。基于这些实验观察结果,研究者提出了一些前瞻的建议,以促进这方面的研究。Abstract
With the increasing capabilities of large language models (LLMs), these high-performance models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, we propose a unified robustness evaluation framework based on the slot-filling task to systematically evaluate the dialogue understanding capability of LLMs in diverse input perturbation scenarios. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Furthermore, we utilize a multi-level data augmentation method (character, word, and sentence levels) to construct a candidate data pool, and carefully design two ways of automatic task demonstration construction strategies (instance-level and entity-level) with various prompt templates. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios. The experiments have demonstrated that the current open-source LLMs generally achieve limited perturbation robustness performance. Based on these experimental observations, we make some forward-looking suggestions to fuel the research in this direction.
摘要
随着大型语言模型(LLMs)的能力不断提高,这些高性能模型在各种自然语言处理(NLP)任务上取得了状态之最的成绩。然而,这些模型在通常使用的基准数据集上的性能经常不能准确反映它们在实际噪音数据上的可靠性和可靠性。为解决这些挑战,我们提议一种统一的可靠性评估框架,基于槽填充任务来系统地评估 LLMS 在多种噪音数据下的对话理解能力。具体来说,我们构建了一个噪音评估数据集,即 Noise-LLM,该数据集包括5种单个噪音数据和4种混合噪音数据。此外,我们采用了多级数据增强方法(字符、词和句子级别),将候选数据池构建起来,并且仔细设计了两种自动任务示例构建策略(实例级和实体级),并使用了多种提示模板。我们的目标是评估不同robustness方法在实际噪音场景下的表现。实验结果表明,当前的开源 LLMS 通常在实际噪音场景下表现有限的鲁棒性能。根据这些实验观察结果,我们提出了一些前瞻的建议,以促进这一方向的研究。
MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents
results: 作者的评估表明,这些协同生成代理人在一个 simulated job fair 环境中表现出了有前途的表现,但是也暴露出了在更复杂的协调任务中的局限性。Abstract
Significant advancements have occurred in the application of Large Language Models (LLMs) for various tasks and social simulations. Despite this, their capacities to coordinate within task-oriented social contexts are under-explored. Such capabilities are crucial if LLMs are to effectively mimic human-like social behavior and produce meaningful results. To bridge this gap, we introduce collaborative generative agents, endowing LLM-based Agents with consistent behavior patterns and task-solving abilities. We situate these agents in a simulated job fair environment as a case study to scrutinize their coordination skills. We propose a novel framework that equips collaborative generative agents with human-like reasoning abilities and specialized skills. Our evaluation demonstrates that these agents show promising performance. However, we also uncover limitations that hinder their effectiveness in more complex coordination tasks. Our work provides valuable insights into the role and evolution of LLMs in task-oriented social simulations.
摘要
<>大量的进步已经发生在大语言模型(LLM)的应用中,包括不同的任务和社会 simulations。然而,LLM在任务团队社会上的协调能力仍然未得到充分探索。这些能力是LLM模仿人类社会行为的关键,以生成有意义的结果。为了bridging这个差距,我们引入合作生成代理人,赋予LLM基于代理人的一致行为模式和任务解决能力。我们在模拟的就业 fair环境中作为一个案例,检验这些代理人的协调能力。我们提出了一种新的框架,让合作生成代理人具有人类化的思维能力和专业技能。我们的评估表明,这些代理人在完成任务时表现了promising的表现。然而,我们还发现了一些限制,这些限制阻碍了它们在更复杂的协调任务中的效果。我们的工作为LLM在任务团队社会 simulations中的角色和演化提供了重要的看法。
Topological RANSAC for instance verification and retrieval without fine-tuning
results: 与传统的 SP 方法相比,我们的方法在非精度调整情况下显著提高检索性能,并且可以增强使用精度调整的特征表现。此外,我们的方法具有高可解释性和轻量级的特点,适用于各种实际应用场景。Abstract
This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and neglect of topological relations among features. To address these shortcomings, we introduce a pioneering technique that replaces the spatial model with a topological one within the RANSAC process. We propose bio-inspired saccade and fovea functions to verify the topological consistency among features, effectively circumventing the issues associated with SP's spatial model. Our experimental results demonstrate that our method significantly outperforms SP, achieving state-of-the-art performance in non-fine-tuning retrieval. Furthermore, our approach can enhance performance when used in conjunction with fine-tuned features. Importantly, our method retains high explainability and is lightweight, offering a practical and adaptable solution for a variety of real-world applications.
摘要
Memory efficient location recommendation through proximity-aware representation
results: 使用三个实际的 Location-Based Social Networking(LBSN)数据集进行评估,显示PASR在续ous sequential location recommendation方法中占据了领先地位。Abstract
Sequential location recommendation plays a huge role in modern life, which can enhance user experience, bring more profit to businesses and assist in government administration. Although methods for location recommendation have evolved significantly thanks to the development of recommendation systems, there is still limited utilization of geographic information, along with the ongoing challenge of addressing data sparsity. In response, we introduce a Proximity-aware based region representation for Sequential Recommendation (PASR for short), built upon the Self-Attention Network architecture. We tackle the sparsity issue through a novel loss function employing importance sampling, which emphasizes informative negative samples during optimization. Moreover, PASR enhances the integration of geographic information by employing a self-attention-based geography encoder to the hierarchical grid and proximity grid at each GPS point. To further leverage geographic information, we utilize the proximity-aware negative samplers to enhance the quality of negative samples. We conducted evaluations using three real-world Location-Based Social Networking (LBSN) datasets, demonstrating that PASR surpasses state-of-the-art sequential location recommendation methods
摘要
现代生活中的顺序位置推荐具有巨大的作用,可以提高用户体验、带来更多的商业利益以及政府管理的帮助。虽然位置推荐的方法已经发展到了很高的水平,但是还是受到地理信息的有限使用和缺乏数据的挑战。为了解决这个问题,我们介绍了一种基于靠近性的区域表示方法(PASR简称),基于自注意网络架构。我们通过一种新的损失函数和重要样本选择来解决缺乏数据的问题,并且使用自注意网络来增强地理信息的集成。此外,我们还利用靠近性aware的负样本来提高负样本的质量。我们对三个实际的位置基于社交媒体网络(LBSN)数据集进行了评估,结果表明,PASR超越了现状最佳的顺序位置推荐方法。
Understanding the Effects of RLHF on LLM Generalisation and Diversity
results: 研究发现,RLHF比SFT在新输入处理更好的泛化能力,尤其是当输入和输出之间的分布差异较大时。然而,RLHF会对输出多样性产生负面影响,特别是在多种指标上。这些结果可以帮助选择合适的微调方法,并促进RLHF方法的进一步改进。Abstract
Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT, Anthropic's Claude, or Meta's LLaMA-2. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the trade-off between generalisation and diversity.
摘要
大型语言模型(LLM)通过人工测验学习(RLHF)的精练化已经在一些最广泛应用的AI模型中使用,如OpenAI的ChatGPT、Anthropic的Claude或Meta的LLaMA-2。 although there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modeling, and RLHF) affects two key properties: out-of-distribution (OOD) generalization and output diversity. OOD generalization is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarization and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalizes better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalization and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the trade-off between generalization and diversity.
Constructive Large Language Models Alignment with Diverse Feedback
paper_authors: Tianshu Yu, Ting-En Lin, Yuchuan Wu, Min Yang, Fei Huang, Yongbin Li
for: 强调大语言模型(LLM)与人类价值观 aligning,以减少危险内容的影响。
methods: 我们介绍了一种新的 Constructive and Diverse Feedback(CDF)方法, inspirited by constructivist learning theory,收集了三种不同类型的反馈,包括批评反馈、纠正反馈和喜好反馈,以便在训练数据集中解决不同难度级别的问题。
results: 我们通过对三个下游任务(问答、对话生成和文本概要)进行评估,发现 CDF 方法可以在较小的训练数据集上 достичь更高的对齐性表现,比之前的方法更高。Abstract
In recent research on large language models (LLMs), there has been a growing emphasis on aligning these models with human values to reduce the impact of harmful content. However, current alignment methods often rely solely on singular forms of human feedback, such as preferences, annotated labels, or natural language critiques, overlooking the potential advantages of combining these feedback types. This limitation leads to suboptimal performance, even when ample training data is available. In this paper, we introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance LLM alignment, inspired by constructivist learning theory. Our approach involves collecting three distinct types of feedback tailored to problems of varying difficulty levels within the training dataset. Specifically, we exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data. To assess the effectiveness of CDF, we evaluate it against previous methods in three downstream tasks: question answering, dialog generation, and text summarization. Experimental results demonstrate that CDF achieves superior performance even with a smaller training dataset.
摘要
Recent research on large language models (LLMs) 有增加对人类价值观Alignment的强调,以降低有害内容的影响。然而,现有的Alignment方法通常仅仅基于单一的人类反馈方式,如偏好、注释标签或自然语言批评,而忽视了可 combining这些反馈类型的可能性。这种局限性导致模型性能不佳,即使有充足的训练数据available。在这篇论文中,我们提出了一种新的Feedback方法,名为Constructive and Diverse Feedback(CDF), draws inspiration from constructivist learning theory。我们的方法是收集三种不同类型的反馈,适用于训练数据中的问题Difficulty Level不同的情况。特别是,我们利用了批评Feedback来解决容易的问题,修充Feedback来解决中等Difficulty Level的问题,以及偏好Feedback来解决困难的问题。通过使用这种多样化的反馈,我们可以增强模型的Alignment性能,并使用较少的训练数据。为评估CDF的效果,我们对之前的方法进行了三个下游任务的评估:问题回答、对话生成和文本概要。实验结果表明,CDF可以在较少的训练数据下达到更高的性能。
Stepwise functional refoundation of relational concept analysis
results: 返回一个家族的概念树,而不考虑数据中循环依赖关系时,可能存在其他可接受的解决方案Abstract
Relational concept analysis (RCA) is an extension of formal concept analysis allowing to deal with several related contexts simultaneously. It has been designed for learning description logic theories from data and used within various applications. A puzzling observation about RCA is that it returns a single family of concept lattices although, when the data feature circular dependencies, other solutions may be considered acceptable. The semantics of RCA, provided in an operational way, does not shed light on this issue. In this report, we define these acceptable solutions as those families of concept lattices which belong to the space determined by the initial contexts (well-formed), cannot scale new attributes (saturated), and refer only to concepts of the family (self-supported). We adopt a functional view on the RCA process by defining the space of well-formed solutions and two functions on that space: one expansive and the other contractive. We show that the acceptable solutions are the common fixed points of both functions. This is achieved step-by-step by starting from a minimal version of RCA that considers only one single context defined on a space of contexts and a space of lattices. These spaces are then joined into a single space of context-lattice pairs, which is further extended to a space of indexed families of context-lattice pairs representing the objects manip
摘要
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
paper_authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner
for: 这个论文是为了提出一种新的跨模态融合技术,用于自动语音识别(ASR)中的生成错误修正。
methods: 该方法利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一种新的 парадигshift towards generative error correction within the realm of n-best hypotheses。
results: 对于多种ASR dataset的评估,我们证明了我们的融合技术的稳定性和可重现性,并达到了相对于n-best假设的37.66%的单词错误率改善。Here’s the full answer in Simplified Chinese:
for: 这个论文是为了提出一种新的跨模态融合技术,用于自动语音识别(ASR)中的生成错误修正。
methods: 该方法利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一种新的 парадигshift towards generative error correction within the realm of n-best hypotheses。
results: 对于多种ASR dataset的评估,我们证明了我们的融合技术的稳定性和可重现性,并达到了相对于n-best假设的37.66%的单词错误率改善。Abstract
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
摘要
我们介绍了一种新的跨Modal融合技术,用于自动语音识别(ASR)的生成错误修正。我们的方法ология利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一个新的 парадигshift towards a fresh paradigm in generative error correction within the realm of n-best hypotheses。不同于现有的排名基于重新分配方法,我们的方法使用了不同的初始化技术和参数高效的算法来提高ASR性能,基于预训练的语音和文本模型。通过对多个ASR数据集的评估,我们评估了我们的融合技术的稳定性和可重现性,并证明了它的单词错误率相对改进(WERR)性能,相比于n-best假设的37.66%。为了鼓励未来的研究,我们将我们的代码和预训练模型公开在GitHub上,请参考https://github.com/Srijith-rkr/Whispering-LLaMA。
Retromorphic Testing: A New Approach to the Test Oracle Problem
paper_authors: Boxi Yu, Qiuyang Mang, Qingshuo Guo, Pinjia He
for: This paper focuses on developing a novel black-box testing methodology called Retromorphic Testing, which is inspired by the mathematical concept of inverse functions. The purpose is to provide a non-intrusive and effective approach to testing software systems.
methods: The proposed method uses an auxiliary program in conjunction with the program under test, creating a dual-program structure. The input data is processed by the forward program, and then the output is reversed to its original input format using the backward program. The testing modes include using the auxiliary program as either the forward or backward program.
results: The paper presents three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications. The method is demonstrated to be effective in revealing defects and bugs in the software systems under test.Abstract
A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. It leverages an auxiliary program in conjunction with the program under test, which establishes a dual-program structure consisting of a forward program and a backward program. The input data is first processed by the forward program and then its program output is reversed to its original input format using the backward program. In particular, the auxiliary program can operate as either the forward or backward program, leading to different testing modes. The process concludes by examining the relationship between the initial input and the transformed output within the input domain. For example, to test the implementation of the sine function $\sin(x)$, we can employ its inverse function, $\arcsin(x)$, and validate the equation $x = \sin(\arcsin(x)+2k\pi), \forall k \in \mathbb{Z}$. In addition to the high-level concept of Retromorphic Testing, this paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
摘要
一个测试 oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. It leverages an auxiliary program in conjunction with the program under test, which establishes a dual-program structure consisting of a forward program and a backward program. The input data is first processed by the forward program and then its program output is reversed to its original input format using the backward program. In particular, the auxiliary program can operate as either the forward or backward program, leading to different testing modes. The process concludes by examining the relationship between the initial input and the transformed output within the input domain. For example, to test the implementation of the sine function $\sin(x)$, we can employ its inverse function, $\arcsin(x)$, and validate the equation $x = \sin(\arcsin(x)+2k\pi), \forall k \in \mathbb{Z}$. In addition to the high-level concept of Retromorphic Testing, this paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
Proceedings of The first international workshop on eXplainable AI for the Arts (XAIxArts)
results: 论文在XAI在艺术领域的应用中发现了一些有价值的结果。Abstract
This first international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 15th ACM Conference on Creativity and Cognition (C&C 2023).
摘要
这是第一届国际工作坊 on 可解释 AI for the Arts (XAIxArts), 它将聚集一群研究人员来探讨 XAI 在艺术领域的角色。 工作坊于 ACM 创造力和认知会议 (C&C 2023) 举行。
TANGO: Time-Reversal Latent GraphODE for Multi-Agent Dynamical Systems
results: 我们的方法在各种物理系统上实验表现出色,特别是在困难的混沌三杆系统上实现了11.5%的MSE提高。Abstract
Learning complex multi-agent system dynamics from data is crucial across many domains, such as in physical simulations and material modeling. Extended from purely data-driven approaches, existing physics-informed approaches such as Hamiltonian Neural Network strictly follow energy conservation law to introduce inductive bias, making their learning more sample efficiently. However, many real-world systems do not strictly conserve energy, such as spring systems with frictions. Recognizing this, we turn our attention to a broader physical principle: Time-Reversal Symmetry, which depicts that the dynamics of a system shall remain invariant when traversed back over time. It still helps to preserve energies for conservative systems and in the meanwhile, serves as a strong inductive bias for non-conservative, reversible systems. To inject such inductive bias, in this paper, we propose a simple-yet-effective self-supervised regularization term as a soft constraint that aligns the forward and backward trajectories predicted by a continuous graph neural network-based ordinary differential equation (GraphODE). It effectively imposes time-reversal symmetry to enable more accurate model predictions across a wider range of dynamical systems under classical mechanics. In addition, we further provide theoretical analysis to show that our regularization essentially minimizes higher-order Taylor expansion terms during the ODE integration steps, which enables our model to be more noise-tolerant and even applicable to irreversible systems. Experimental results on a variety of physical systems demonstrate the effectiveness of our proposed method. Particularly, it achieves an MSE improvement of 11.5 % on a challenging chaotic triple-pendulum systems.
摘要
学习复杂多代理系统动态从数据是透支多个领域的关键,如物理 simulate 和材料模型。从某种意义上说,现有的物理 Informed Approach 如 Hamiltonian Neural Network 会 strictly follow 能量保守法则来引入 inductive bias,使其学习更加有效。然而,许多实际系统不会严格地保守能量,例如带有摩擦的spring系统。认识到这一点,我们转向了更广泛的物理原理:时间反转 симметry,即系统的动态将在时间反转后保持不变。它可以保持能量的方面,而且在非保守系统中也能提供强的 inductive bias。为了把这种 inductive bias 引入,在这篇论文中,我们提议一种简单 yet 有效的自顾supervised regularization term,用于规范一个基于 continues graph neural network 的 ordinary differential equation (GraphODE) 中的前进和后退轨迹预测。它有效地强制实现时间反转对称,使得模型预测更加准确,并且可以涵盖更广泛的物理系统。此外,我们还提供了理论分析,证明我们的 regularization 实际上在 ODE интеграル步骤中减少高阶泰勒展开项,使我们的模型更加快速敏感和灵活应用于不可逆系统。实验结果表明,我们提议的方法在多种物理系统上都有效。特别是,它在一个复杂的混沌三个拖钩系统上达到了11.5%的MSE提升。
results: GPT-4达到了与当前状态艺术的相当水平,并且对宣传技巧的检测具有较高的精度和准确性。Abstract
The prevalence of propaganda in our digital society poses a challenge to societal harmony and the dissemination of truth. Detecting propaganda through NLP in text is challenging due to subtle manipulation techniques and contextual dependencies. To address this issue, we investigate the effectiveness of modern Large Language Models (LLMs) such as GPT-3 and GPT-4 for propaganda detection. We conduct experiments using the SemEval-2020 task 11 dataset, which features news articles labeled with 14 propaganda techniques as a multi-label classification problem. Five variations of GPT-3 and GPT-4 are employed, incorporating various prompt engineering and fine-tuning strategies across the different models. We evaluate the models' performance by assessing metrics such as $F1$ score, $Precision$, and $Recall$, comparing the results with the current state-of-the-art approach using RoBERTa. Our findings demonstrate that GPT-4 achieves comparable results to the current state-of-the-art. Further, this study analyzes the potential and challenges of LLMs in complex tasks like propaganda detection.
摘要
现代社会中的宣传活动带来了社会和真实信息的困难。检测宣传的自然语言处理(NLP)在文本中是一个挑战,因为宣传者可以通过某些细微的操纵技巧和上下文依赖来隐秘宣传。为解决这个问题,我们调查了现代大型语言模型(LLM)such as GPT-3和GPT-4的宣传检测效果。我们在SemEval-2020任务11数据集上进行实验,这是一个新闻文章标注有14种宣传技巧的多标签分类问题。我们使用5种不同的GPT-3和GPT-4模型,包括不同的提问工程和精度调整策略。我们评估模型的表现,包括$F1$分数、$Precision$和$Recall$指标,并与使用RoBERTa的当前状态态-of-the-art方法进行比较。我们的发现表明GPT-4在当前状态态-of-the-art方法中获得了相似的结果。此外,这种研究还分析了LLMs在复杂任务中的潜在和挑战。
Advective Diffusion Transformers for Topological Generalization in Graph Learning
results: 本研究的实验结果表明,使用非本地扩散方法和ADiT模型可以在多种图学任务上实现superior表现,并且在图结构下降情况下保持良好的泛化能力。Abstract
Graph diffusion equations are intimately related to graph neural networks (GNNs) and have recently attracted attention as a principled framework for analyzing GNN dynamics, formalizing their expressive power, and justifying architectural choices. One key open questions in graph learning is the generalization capabilities of GNNs. A major limitation of current approaches hinges on the assumption that the graph topologies in the training and test sets come from the same distribution. In this paper, we make steps towards understanding the generalization of GNNs by exploring how graph diffusion equations extrapolate and generalize in the presence of varying graph topologies. We first show deficiencies in the generalization capability of existing models built upon local diffusion on graphs, stemming from the exponential sensitivity to topology variation. Our subsequent analysis reveals the promise of non-local diffusion, which advocates for feature propagation over fully-connected latent graphs, under the assumption of a specific data-generating condition. In addition to these findings, we propose a novel graph encoder backbone, Advective Diffusion Transformer (ADiT), inspired by advective graph diffusion equations that have a closed-form solution backed up with theoretical guarantees of desired generalization under topological distribution shifts. The new model, functioning as a versatile graph Transformer, demonstrates superior performance across a wide range of graph learning tasks.
摘要
GRAPH diffusion equations 是 GNN 的关联方法,最近受到关注,作为 GNN 的分析框架、表达力 formalization 和建筑设计的原则。一个关键的开问在 GRAPH 学习中是 GNN 的通用能力。现有的方法假设 training 和 test 集中的 GRAPH 结构来自同一个分布,这是一个主要的限制。在这篇论文中,我们向 GRAPH diffusion equations 的推广和通用性进行了研究。我们首先表明了现有的 LOCAL diffusion 模型在 GRAPH 结构变化方面存在欠佳的泛化能力,这是由 GRAPH 结构变化导致的极敏感性引起的。我们的后续分析表明了非 LOCAL diffusion 的潜在优势,它强调在具有完全相关的 latent graph 上进行特征传播,对于特定的数据生成条件,它具有理论保证的泛化性。此外,我们提出了一种新的 GRAPH 编码器基础,即 Advective Diffusion Transformer (ADiT),这种基础是基于 advective GRAPH diffusion equations 的关键解。新的模型,作为一种多样的 GRAPH Transformer,在各种 GRAPH 学习任务中显示出了优秀的表现。
Hexa: Self-Improving for Knowledge-Grounded Dialogue System
results: 我们通过对多个 benchmark 数据集进行实验,证明了我们的方法可以成功地使用自我改进机制来生成中间和最终回答,并提高了知识基〉的对话生成能力。Abstract
A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the generative performances of intermediate steps without the ground truth data. In particular, we propose a novel bootstrapping scheme with a guided prompt and a modified loss function to enhance the diversity of appropriate self-generated responses. Through experiments on various benchmark datasets, we empirically demonstrate that our method successfully leverages a self-improving mechanism in generating intermediate and final responses and improves the performances on the task of knowledge-grounded dialogue generation.
摘要
通常在知识基础对话生成中,会显式使用中间步骤(例如网络搜索、记忆检索)和模块化方法。然而,这些中间步骤的数据通常不可见,比对话响应的数据更难以获取。为了填充这些数据的缺失,我们提出了一种自我改进方法,以提高中间步骤的生成性能。具体来说,我们提出了一种新的启动方案,以及一种修改的损失函数,以提高自动生成的应ropriate响应的多样性。通过对各种标准数据集进行实验,我们证明了我们的方法可以成功地利用自我改进机制来生成中间和最终响应,并提高知识基础对话生成任务的性能。
For: The paper aims to improve the practicality of molecular property prediction benchmarks for drug discovery by creating a new benchmark called Lo-Hi, which includes two tasks: Lead Optimization and Hit Identification.* Methods: The paper uses a novel molecular splitting algorithm to solve the Balanced Vertex Minimum $k$-Cut problem for the Hi task, and tests state-of-the-art and classic machine learning models under practical settings.* Results: The paper shows that modern benchmarks are unrealistic and overoptimistic, and that the Lo-Hi benchmark is more practical and accurate for drug discovery applications.Here’s the simplified Chinese version of the three key points:* For: 这个论文目标是改进药物发现中的分子性能预测标准 benchmark,通过创建一个名为 Lo-Hi 的新 benchmark,包括两个任务:Lead Optimization 和 Hit Identification。* Methods: 论文使用一种新的分子拆分算法解决 Hi 任务中的 Balanced Vertex Minimum $k$-Cut 问题,并测试了当前最佳和经典机器学习模型在实际设置下的表现。* Results: 论文显示现有的标准 benchmark 是不实用的和过optimistic,而 Lo-Hi benchmark 更加实用和准确地反映药物发现应用中的分子性能预测问题。Abstract
Finding new drugs is getting harder and harder. One of the hopes of drug discovery is to use machine learning models to predict molecular properties. That is why models for molecular property prediction are being developed and tested on benchmarks such as MoleculeNet. However, existing benchmarks are unrealistic and are too different from applying the models in practice. We have created a new practical \emph{Lo-Hi} benchmark consisting of two tasks: Lead Optimization (Lo) and Hit Identification (Hi), corresponding to the real drug discovery process. For the Hi task, we designed a novel molecular splitting algorithm that solves the Balanced Vertex Minimum $k$-Cut problem. We tested state-of-the-art and classic ML models, revealing which works better under practical settings. We analyzed modern benchmarks and showed that they are unrealistic and overoptimistic. Review: https://openreview.net/forum?id=H2Yb28qGLV Lo-Hi benchmark: https://github.com/SteshinSS/lohi_neurips2023 Lo-Hi splitter library: https://github.com/SteshinSS/lohi_splitter
摘要
现在找新药物是越来越Difficult。一种希望的药物发现是使用机器学习模型预测分子性质。因此,模型 для分子性质预测在MoleculeNet等准 benchmark上进行了开发和测试。然而,现有的benchmark是不realistic的,与实际应用场景有很大差异。我们创建了一个新的实用 Lo-Hi benchmark,包括两个任务:Lead Optimization(Lo)和 Hit Identification(Hi),对应实际药物发现过程。对于Hi任务,我们设计了一种新的分子拆分算法,解决了Balanced Vertex Minimum $k$-Cut问题。我们测试了当今最佳和经典的机器学习模型,发现哪些在实际设置下表现更好。我们分析了现代benchmark,发现它们是不realistic和过optimistic。参考:https://openreview.net/forum?id=H2Yb28qGLVLo-Hi benchmark:https://github.com/SteshinSS/lohi_neurips2023Lo-Hi splitter library:https://github.com/SteshinSS/lohi_splitter
P5: Plug-and-Play Persona Prompting for Personalized Response Selection
paper_authors: Joosung Lee, Minsik Oh, Donghun Lee
for: This paper aims to address the challenges of using persona-grounded retrieval-based chatbots for personalized conversations, specifically the high cost of collecting persona-grounded corpora and the chatbot’s lack of consideration for persona in real-world applications.
methods: The proposed solution is a plug-and-play persona prompting method that allows the chatbot system to function as a standard open-domain chatbot when persona information is not available. The method uses a zero-shot setting to reduce the dependence on persona-grounded training data, and the model can be fine-tuned for even better performance.
results: The authors demonstrate that their approach improves the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively, and fine-tuning the model further improves the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively. This is the first attempt to solve the problem of personalized response selection using prompt sequences.Here’s the information in Simplified Chinese text:
methods: 提议的解决方案是一种插件式 persona 提示方法,允许 chatbot 系统在 persona 信息不available 时 функциональ如标准的 open-domain chatbot。该方法使用 zero-shot 设定,以减少基于 persona-grounded 训练数据的依赖。
results: 作者们示出了他们的方法可以提高标准模型的性能, Specifically, the zero-shot model improved the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively, and fine-tuning the model further improved the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively.Abstract
The use of persona-grounded retrieval-based chatbots is crucial for personalized conversations, but there are several challenges that need to be addressed. 1) In general, collecting persona-grounded corpus is very expensive. 2) The chatbot system does not always respond in consideration of persona at real applications. To address these challenges, we propose a plug-and-play persona prompting method. Our system can function as a standard open-domain chatbot if persona information is not available. We demonstrate that this approach performs well in the zero-shot setting, which reduces the dependence on persona-ground training data. This makes it easier to expand the system to other languages without the need to build a persona-grounded corpus. Additionally, our model can be fine-tuned for even better performance. In our experiments, the zero-shot model improved the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively. The fine-tuned model improved the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively. To the best of our knowledge, this is the first attempt to solve the problem of personalized response selection using prompt sequences. Our code is available on github~\footnote{https://github.com/rungjoo/plug-and-play-prompt-persona}.
摘要
使用基于搜索的人物固定的聊天机器人是至关重要的 для个性化对话,但有几个挑战需要解决。1)总体而言,收集基于人物的训练数据非常昂贵。2)聊天系统在实际应用中不一定会考虑到人物。为了解决这些挑战,我们提出了一种插件式人物提示方法。我们的系统可以作为标准的开放领域聊天机器人运行,如果人物信息不available。我们的实验表明,这种方法在零shot设定下表现良好,减少了基于人物固定训练数据的依赖。这使得我们可以更容易地扩展系统到其他语言,无需建立基于人物的训练数据。此外,我们的模型可以进行细化调整,以进一步提高表现。在我们的实验中,零shot模型在标准模型的基础上提高了7.71和1.04分,而修改后的模型则在原始人物和修改后的人物上提高了1.95和3.39分。到目前为止,这是个性化响应选择使用提示序列的第一次尝试。我们的代码可以在github上找到(https://github.com/rungjoo/plug-and-play-prompt-persona)。
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
results: 实验结果表明,ICA和ICD可以增加或减少针对语言模型的恶意攻击的成功率。Abstract
Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find that by providing just few in-context demonstrations without fine-tuning, LLMs can be manipulated to increase or decrease the probability of jailbreaking, i.e. answering malicious prompts. Based on these observations, we propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding aligned language model purposes. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrations of rejecting to answer harmful prompts. Our experiments show the effectiveness of ICA and ICD in increasing or reducing the success rate of adversarial jailbreaking attacks. Overall, we shed light on the potential of ICL to influence LLM behavior and provide a new perspective for enhancing the safety and alignment of LLMs.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.Please note that the translation is done by a machine and may not be perfect, and some cultural references or idioms may not be accurately translated.
What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?
methods: 该论文使用了信息学的视角来模型多Modal模型在缺失modalities时的场景,并提出了一种基于Uni-Modal Ensemble with Missing Modality Adaptation(UME-MMA)的方法来解决这个问题。UME-MMA使用了预训练的uni-Modal网络 weights来提高特征提取,并使用缺失modalities数据增强技术来更好地适应缺失modalities的情况。
results: 该论文在Audio-Visual dataset(如AV-MNIST、Kinetics-Sound、AVE)和视觉语言dataset(如MM-IMDB、UPMC Food101)中展示了UME-MMA的效果。Abstract
With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).
摘要
随着多Modal学习的成功增长,对多Modal模型在缺失模式下的Robustness研究得到了更多的关注。然而,先前的研究在这个领域存在一些限制,因为它们经常缺乏理论性的深度或者其方法论是特定的网络架构或模式所限制的。我们从信息论角度模拟了多Modal模型在缺失模式下的场景,并证明了在这些场景下性能的上限可以通过高效地利用缺失模式下的信息来逼近。在实践中,有两个关键方面:(1)encoder应该能够从非缺失模式中提取足够好的特征;(2)提取的特征应该能够在模式之间的混合过程中免疫噪音的影响。为此,我们提出了Uni-Modal Ensemble with Missing Modality Adaptation(UME-MMA)。UME-MMA使用uni-modal预训练权重来提高多Modal模型的特征提取,并使用缺失模式数据增强技术来更好地适应缺失模式。此外,UME-MMA基于晚期融合学习框架,允许插入多种编码器,使其适用于多种模式并允许大规模预训练编码器进一步提高性能。我们在AV-MNIST、Kinetics-Sound、AVE等音视频数据集和MM-IMDB、UPMC Food101等视语数据集中证明了UME-MMA的有效性。
Advanced Efficient Strategy for Detection of Dark Objects Based on Spiking Network with Multi-Box Detection
methods: combining spiked and normal convolution layers, pre-trained VGG16 feature extractor
results: 66.01%和41.25% mAP for detecting 20 different objects in the VOC-12 and 12 objects in the Ex-Dark dataset, superior performance compared to other state-of-the-art object detection modelsAbstract
Several deep learning algorithms have shown amazing performance for existing object detection tasks, but recognizing darker objects is the largest challenge. Moreover, those techniques struggled to detect or had a slow recognition rate, resulting in significant performance losses. As a result, an improved and accurate detection approach is required to address the above difficulty. The whole study proposes a combination of spiked and normal convolution layers as an energy-efficient and reliable object detector model. The proposed model is split into two sections. The first section is developed as a feature extractor, which utilizes pre-trained VGG16, and the second section of the proposal structure is the combination of spiked and normal Convolutional layers to detect the bounding boxes of images. We drew a pre-trained model for classifying detected objects. With state of the art Python libraries, spike layers can be trained efficiently. The proposed spike convolutional object detector (SCOD) has been evaluated on VOC and Ex-Dark datasets. SCOD reached 66.01% and 41.25% mAP for detecting 20 different objects in the VOC-12 and 12 objects in the Ex-Dark dataset. SCOD uses 14 Giga FLOPS for its forward path calculations. Experimental results indicated superior performance compared to Tiny YOLO, Spike YOLO, YOLO-LITE, Tinier YOLO and Center of loc+Xception based on mAP for the VOC dataset.
摘要
多种深度学习算法在现有的对象检测任务上表现出色,但检测黑色对象是最大的挑战。此外,这些技术在检测或识别速度较慢,导致性能下降。因此,需要一种改进的和准确的检测方法来解决上述困难。本研究提议一种结合刺激和常规卷积层的能量高效和可靠对象检测模型。该模型分为两部分。第一部分是特征提取器,利用预训练的VGG16,第二部分是将刺激和常规卷积层结合在一起,用于检测图像中的 bounding box。我们预训练了一个用于类别检测的模型。使用现代Python库,刺激层可以高效地训练。我们提出的刺激卷积对象检测器(SCOD)在VOC和Ex-Dark数据集上进行评估,SCOD在VOC-12和Ex-Dark数据集中分别达到了66.01%和41.25%的mAP。SCOD的前向计算需要14亿FLOPS。实验结果表明,SCOD在VOC数据集上比Tiny YOLO、Spike YOLO、YOLO-LITE、Tinier YOLO和Center of loc+Xception基于mAP的性能更高。
Geometrically Aligned Transfer Encoder for Inductive Transfer in Regression Tasks
results: 在多种分子图数据集上,GATE 比 convential 方法表现出色,在隐藏空间和极限区域都显示稳定的行为。Abstract
Transfer learning is a crucial technique for handling a small amount of data that is potentially related to other abundant data. However, most of the existing methods are focused on classification tasks using images and language datasets. Therefore, in order to expand the transfer learning scheme to regression tasks, we propose a novel transfer technique based on differential geometry, namely the Geometrically Aligned Transfer Encoder (GATE). In this method, we interpret the latent vectors from the model to exist on a Riemannian curved manifold. We find a proper diffeomorphism between pairs of tasks to ensure that every arbitrary point maps to a locally flat coordinate in the overlapping region, allowing the transfer of knowledge from the source to the target data. This also serves as an effective regularizer for the model to behave in extrapolation regions. In this article, we demonstrate that GATE outperforms conventional methods and exhibits stable behavior in both the latent space and extrapolation regions for various molecular graph datasets.
摘要
< translate into Simplified Chinese抽象:传承学是一种重要的技术,用于处理小量数据,但这些数据可能与其他丰富的数据相关。然而,现有的方法主要是用于图像和语言数据的分类任务。因此,我们提出了一种新的传承技术基于差分几何,即差分几何转移编码器(GATE)。在这种方法中,我们将模型中的隐藏 вектор视为在圆柱几何上的点。我们找到了对应的对称变换,使得每个任务的任意点在重叠区域中都可以映射到一个本地平坦坐标,从而实现了从源数据传播知识到目标数据。此外,这也 serves as a 有效的正则化项,使模型在极限区域行为稳定。在这篇文章中,我们示示了GATE比传统方法更高效,并在不同的分子图数据集上显示了稳定的行为。
Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup Anomalous Sound Detection
for: 这篇论文targetsUnsupervised anomalous sound detection (ASD), aiming to identify abnormal sounds by learning normal operational sounds’ features and sensing their deviations.
methods: 本研究使用了自类指导任务,利用normal data的类别 tasks to learn representation space for anomalous data, and proposes a training technique to ensure intra-class compactness and increase angle gap between normal and abnormal samples.
results: 实验结果显示,提案的方法在DCASE 2020 Challenge Task2 dataset上实现了最佳性能,与state-of-the-art方法相比,具有0.90%, 0.83%, 2.16%的改善(AUC、pAUC、mAUC分别)。Abstract
Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations. Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important through representation learning yielding compact intra-class and well-separated intra-class distributions. However, we show that conventional approaches often fail to ensure sufficient intra-class compactness and exhibit angular disparity between samples and their corresponding centers. In this paper, we propose a training technique aimed at ensuring intra-class compactness and increasing the angle gap between normal and abnormal samples. Furthermore, we present an architecture that extracts features for important temporal regions, enabling the model to learn which time frames should be emphasized or suppressed. Experimental results demonstrate that the proposed method achieves the best performance giving 0.90%, 0.83%, and 2.16% improvement in terms of AUC, pAUC, and mAUC, respectively, compared to the state-of-the-art method on DCASE 2020 Challenge Task2 dataset.
摘要
无监督异常声音检测(ASD)目标是通过学习正常操作声音特征来识别异常声音。 current approaches 通常采用自我监督任务,利用正常数据的分类来学习表征。然而,我们发现,现有方法通常无法保证异常样本内的准确性和相关性。在这篇论文中,我们提出了一种培训技术,以确保异常样本内的准确性和相关性,同时提高正常和异常样本之间的角度差。此外,我们还提出了一种建模方法,以EXTRACT Features for important temporal regions,使模型能够学习哪些时间区域是重要的。实验结果表明,我们提出的方法可以达到最佳性能,与现有方法相比,在 DCASE 2020 Challenge Task2 数据集上提高了0.90%、0.83% 和 2.16% 的 AUC、pAUC 和 mAUC 分别。
paper_authors: Arafat Islam, Md. Imtiaz Habib for: 火灾探测,特别是室外、森林火灾中的小型火焰探测methods: 提出了一个改进的YOLOv5火灾探测深度学习算法,包括增强特征提取网络的扩展和特征堆峰推广等技术results: 该算法可以实现小型火焰探测的高精度探测,其中mAP达90.5%,f1分数达88%,并且可以实现实时森林火灾探测,平均探测时间为0.12秒/帧。Abstract
For the detection of fire-like targets in indoor, outdoor and forest fire images, as well as fire detection under different natural lights, an improved YOLOv5 fire detection deep learning algorithm is proposed. The YOLOv5 detection model expands the feature extraction network from three dimensions, which enhances feature propagation of fire small targets identification, improves network performance, and reduces model parameters. Furthermore, through the promotion of the feature pyramid, the top-performing prediction box is obtained. Fire-YOLOv5 attains excellent results compared to state-of-the-art object detection networks, notably in the detection of small targets of fire and smoke with mAP 90.5% and f1 score 88%. Overall, the Fire-YOLOv5 detection model can effectively deal with the inspection of small fire targets, as well as fire-like and smoke-like objects with F1 score 0.88. When the input image size is 416 x 416 resolution, the average detection time is 0.12 s per frame, which can provide real-time forest fire detection. Moreover, the algorithm proposed in this paper can also be applied to small target detection under other complicated situations. The proposed system shows an improved approach in all fire detection metrics such as precision, recall, and mean average precision.
摘要
For the detection of fire-like targets in indoor, outdoor and forest fire images, as well as fire detection under different natural lights, an improved YOLOv5 fire detection deep learning algorithm is proposed. The YOLOv5 detection model expands the feature extraction network from three dimensions, which enhances feature propagation of fire small targets identification, improves network performance, and reduces model parameters. Furthermore, through the promotion of the feature pyramid, the top-performing prediction box is obtained. Fire-YOLOv5 attains excellent results compared to state-of-the-art object detection networks, notably in the detection of small targets of fire and smoke with mAP 90.5% and f1 score 88%. Overall, the Fire-YOLOv5 detection model can effectively deal with the inspection of small fire targets, as well as fire-like and smoke-like objects with F1 score 0.88. When the input image size is 416 x 416 resolution, the average detection time is 0.12 s per frame, which can provide real-time forest fire detection. Moreover, the algorithm proposed in this paper can also be applied to small target detection under other complicated situations. The proposed system shows an improved approach in all fire detection metrics such as precision, recall, and mean average precision.Here's the text in Simplified Chinese characters:为了检测室内、户外和森林火图像中的火目标,以及火 detection under different natural lights,我们提出了一种改进的 YOLOv5 火 detection 深度学习算法。 YOLOv5 检测模型从三维特征提取网络中扩展了特征提取网络,以增强小火目标识别的特征传播,提高网络性能,并减少模型参数。此外,通过特征层的提高,得到了最佳预测盒。 Fire-YOLOv5 在比较其他物体检测网络时表现出色,特别是在小目标火和烟的识别上,具有 mAP 90.5% 和 f1 score 88%。总的来说,Fire-YOLOv5 检测模型可以有效地处理小火目标的检测,以及火类和烟类对象的识别。当输入图像大小为 416 x 416 像素时,每帧检测时间为 0.12 秒,可以提供实时森林火检测。此外,本文提出的算法还可以应用于其他复杂情况下的小目标检测。提议的系统在所有火检测指标中表现出色,包括准确率、回归率和mean average precision。
Filter Pruning For CNN With Enhanced Linear Representation Redundancy
results: 这个方法在Cifar-10 datasets上实现了93.64%的准确率,仅剩1.40M个参数和49.60M FLOPs。在ImageNet datasets上,这个方法实现了42.8%和47.3%的储存空间和计算量减少,并保持76.23%的准确率。Abstract
Structured network pruning excels non-structured methods because they can take advantage of the thriving developed parallel computing techniques. In this paper, we propose a new structured pruning method. Firstly, to create more structured redundancy, we present a data-driven loss function term calculated from the correlation coefficient matrix of different feature maps in the same layer, named CCM-loss. This loss term can encourage the neural network to learn stronger linear representation relations between feature maps during the training from the scratch so that more homogenous parts can be removed later in pruning. CCM-loss provides us with another universal transcendental mathematical tool besides L*-norm regularization, which concentrates on generating zeros, to generate more redundancy but for the different genres. Furthermore, we design a matching channel selection strategy based on principal components analysis to exploit the maximum potential ability of CCM-loss. In our new strategy, we mainly focus on the consistency and integrality of the information flow in the network. Instead of empirically hard-code the retain ratio for each layer, our channel selection strategy can dynamically adjust each layer's retain ratio according to the specific circumstance of a per-trained model to push the prune ratio to the limit. Notably, on the Cifar-10 dataset, our method brings 93.64% accuracy for pruned VGG-16 with only 1.40M parameters and 49.60M FLOPs, the pruned ratios for parameters and FLOPs are 90.6% and 84.2%, respectively. For ResNet-50 trained on the ImageNet dataset, our approach achieves 42.8% and 47.3% storage and computation reductions, respectively, with an accuracy of 76.23%. Our code is available at https://github.com/Bojue-Wang/CCM-LRR.
摘要
《结构化网络剪除超越非结构化方法,因为它们可以利用已有的发达并行计算技术。在这篇论文中,我们提出了一种新的结构化剪除方法。首先,我们提出了一种数据驱动的损失函数,即基于层内不同特征地图的相关系数矩阵(CCM-loss)。这个损失函数可以让神经网络在训练时从零开始学习更强的线性表示关系,以便在剪除后更多的同类部分可以被去除。CCM-loss为我们提供了除L*-norm正则化之外的另一种普遍适用的 трансцендент数学工具,可以产生更多的重复。此外,我们设计了基于主成分分析的通道选择策略,以便在最大化CCM-loss的情况下使用。在我们的新策略中,我们主要关注神经网络中信息流的一致性和完整性。而不是按照预先确定的方式来确定各层保留比例,我们的通道选择策略可以在每个训练过程中动态调整各层保留比例,以达到剪除比例的最大化。值得注意的是,在Cifar-10数据集上,我们的方法可以实现93.64%的准确率,只需要1.40M个参数和49.60M个操作量。剪除率为90.6%和84.2%。对于ImageNet数据集上训练的ResNet-50,我们的方法可以实现42.8%和47.3%的存储和计算剪除,准确率为76.23%。我们的代码可以在https://github.com/Bojue-Wang/CCM-LRR上获取。
Contrastive Prompt Learning-based Code Search based on Interaction Matrix
results: 通过对实际世界数据集进行广泛的实验,证明了我们的方法可以提高代码搜索的Semantic representation质量和自然语言和编程语言之间的匹配能力。Abstract
Code search aims to retrieve the code snippet that highly matches the given query described in natural language. Recently, many code pre-training approaches have demonstrated impressive performance on code search. However, existing code search methods still suffer from two performance constraints: inadequate semantic representation and the semantic gap between natural language (NL) and programming language (PL). In this paper, we propose CPLCS, a contrastive prompt learning-based code search method based on the cross-modal interaction mechanism. CPLCS comprises:(1) PL-NL contrastive learning, which learns the semantic matching relationship between PL and NL representations; (2) a prompt learning design for a dual-encoder structure that can alleviate the problem of inadequate semantic representation; (3) a cross-modal interaction mechanism to enhance the fine-grained mapping between NL and PL. We conduct extensive experiments to evaluate the effectiveness of our approach on a real-world dataset across six programming languages. The experiment results demonstrate the efficacy of our approach in improving semantic representation quality and mapping ability between PL and NL.
摘要
Code search 的目的是搜索与给定的自然语言(NL)查询语句相似的代码片段。近些年,许多代码预训练方法在代码搜索方面凭借了惊人的表现。然而,现有的代码搜索方法仍然受到两种性能约束:代码表示不够准确和自然语言(PL)和计算机语言(PL)之间的semantic gap。本文提出了一种基于对比提示学习的代码搜索方法(CPLCS),它包括:1. PL-NL对应学习,这种学习方法学习PL和NL表示之间的匹配关系;2.一种适应双encoder结构的提问学习设计,以解决代码表示不够准确的问题;3.一种跨模式交互机制,以增强NL和PL之间的细致对应。我们对实际世界数据集进行了广泛的实验,以评估我们的方法的有效性。实验结果表明,我们的方法可以提高代码表示质量和NL和PL之间的映射能力。
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction
results: 在多modal named entity recognition数据集Twitter-2015和Twitter-2017以及多modal relation extraction数据集MNRE上,我们的提议的I2SRM方法实现了竞争力的结果,Twitter-2015上的F1分数为77.12%, Twitter-2017上的F1分数为88.40%, MNRE数据集上的F1分数为84.12%.Abstract
Multimodal information extraction is attracting research attention nowadays, which requires aggregating representations from different modalities. In this paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task, which contains two modules. Firstly, the intra-sample relationship modeling module operates on a single sample and aims to learn effective representations. Embeddings from textual and visual modalities are shifted to bridge the modality gap caused by distinct pre-trained language and image models. Secondly, the inter-sample relationship modeling module considers relationships among multiple samples and focuses on capturing the interactions. An AttnMixup strategy is proposed, which not only enables collaboration among samples but also augments data to improve generalization. We conduct extensive experiments on the multimodal named entity recognition datasets Twitter-2015 and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our proposed method I2SRM achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
摘要
现在,多模态信息抽取正在引起研究者的关注,这需要将不同模式的表示合并。在这篇论文中,我们提出了内样本关系模型(I2SRM)方法,它包括两个模块。首先,内样本关系模型模块在单个样本上运行,旨在学习有效的表示。文本和视觉模式的嵌入都被Shift到bridgemodality gap caused by distinct pre-trained language and image models。其次,间样本关系模型模块考虑多个样本之间的关系,重点是捕捉交互。我们提出了AttnMixup策略,不仅可以在样本之间协作,还可以增强数据,以提高泛化性。我们在多模态命名实体识别数据集Twitter-2015和Twitter-2017以及多模态关系抽取数据集MNRE进行了广泛的实验。我们的提出的I2SRM方法在Twitter-2015上达到了77.12%的F1分数,在Twitter-2017上达到了88.40%的F1分数,并在MNRE上达到了84.12%的F1分数。
Predicting Three Types of Freezing of Gait Events Using Deep Learning Models
paper_authors: Wen Tao Mo, Jonathan H. Chan for:这篇论文旨在预测患有parkinson病的患者会发生停止进行步行的症状(冻结步行),并且预测不同类型的冻结步行事件。methods:本研究使用了深度学习模型,包括trasformer核心架构和Bidirectional LSTM层,以及不同的特征集合,来预测不同类型的冻结步行事件。results:最佳表现的模型在训练数据中取得了0.427的分数,可以在Kaggle的冻结步行预测竞赛中排名前5名。然而,我们也发现了训练数据中的过滤现象,可能可以通过伪造标签和模型架构简化来改善。Abstract
Freezing of gait is a Parkinson's Disease symptom that episodically inflicts a patient with the inability to step or turn while walking. While medical experts have discovered various triggers and alleviating actions for freezing of gait, the underlying causes and prediction models are still being explored today. Current freezing of gait prediction models that utilize machine learning achieve high sensitivity and specificity in freezing of gait predictions based on time-series data; however, these models lack specifications on the type of freezing of gait events. We develop various deep learning models using the transformer encoder architecture plus Bidirectional LSTM layers and different feature sets to predict the three different types of freezing of gait events. The best performing model achieves a score of 0.427 on testing data, which would rank top 5 in Kaggle's Freezing of Gait prediction competition, hosted by THE MICHAEL J. FOX FOUNDATION. However, we also recognize overfitting in training data that could be potentially improved through pseudo labelling on additional data and model architecture simplification.
摘要
冻结步态是parkinson病 symptom的一种 episodic 表现,患者在步行时会受到不能前进或转弯的困难。医学专家已经发现了冻结步态的许多触发因素和缓解方法,但是下面的原因和预测模型仍在探索中。现有的冻结步态预测模型使用机器学习技术,可以在时间序列数据上获得高的敏感度和特异度,但是这些模型缺乏冻结步态事件的类型specification。我们采用了不同的深度学习模型,包括 transformer encoder 架构和bi-directional LSTM层,以及不同的特征集来预测冻结步态事件的三种不同类型。最佳表现的模型在测试数据上得分为0.427,这将在Kaggle的冻结步态预测竞赛中排名前5。然而,我们也发现了训练数据中的过拟合,可能通过pseudo标注和额外数据的使用来改进。
Dobby: A Conversational Service Robot Driven by GPT-4
results: 研究结果显示,在一个自由游导游 scenarios中,具有对话AI能力的机器人比无此能力的机器人表现出更高的总效果、探索能力、审查能力、人格化接受度和适应性。Abstract
This work introduces a robotics platform which embeds a conversational AI agent in an embodied system for natural language understanding and intelligent decision-making for service tasks; integrating task planning and human-like conversation. The agent is derived from a large language model, which has learned from a vast corpus of general knowledge. In addition to generating dialogue, this agent can interface with the physical world by invoking commands on the robot; seamlessly merging communication and behavior. This system is demonstrated in a free-form tour-guide scenario, in an HRI study combining robots with and without conversational AI capabilities. Performance is measured along five dimensions: overall effectiveness, exploration abilities, scrutinization abilities, receptiveness to personification, and adaptability.
摘要
这项工作描述了一个机器人平台,其内置了一个基于大语言模型的对话智能代理人,用于自然语言理解和智能决策 для服务任务。该代理人可以通过 invoke 命令来与物理世界交互,从而将通信和行为融为一体。该系统在一个自由游览导览场景中进行了人机交互研究,并对无对话智能代理人和具有对话智能代理人的机器人进行了比较。研究指标包括总效果、探索能力、审查能力、人格化响应性和适应性。
Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition
paper_authors: Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, Daniel Murfet
for: 这 paper 的目的是 investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT).
methods: 这 paper 使用 closed formula derive theoretical loss, 并在两个隐藏维度情况下发现 regular $k$-gons 是 critical points.
results: 这 paper 提供 supporting theory 表明这些 $k$-gons 的 local learning coefficient (a geometric invariant) determines phase transitions in the Bayesian posterior as a function of training sample size. Empirical results show that the same $k$-gon critical points also determine the behavior of SGD training.Abstract
We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regular $k$-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of these $k$-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the same $k$-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.
摘要
我们研究托 modelo de superposición (TMS) 中的相对transition using Singular Learning Theory (SLT).我们 derivated a closed formula for the theoretical loss, and in the case of two hidden dimensions, we found that regular $k$-gons are critical points. We presented supporting theory indicating that the local learning coefficient (a geometric invariant) of these $k$-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then showed empirically that the same $k$-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we found that the learning process in TMS, whether through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.Here's the translation in Traditional Chinese:我们研究托模型超position (TMS) 中的相对转换使用Singular Learning Theory (SLT).我们 derivated a closed formula for the theoretical loss, 并在两个隐藏维度的情况下发现了 Regular $k$-gons 是托点。我们提供了支持理论,认为这些 $k$-gons 的本地学习系数(一个几何 invariant) determines phase transitions in the Bayesian posterior as a function of training sample size。我们随后证明了这些 $k$-gon 托点也determine SGD 训练的行为。图像 emerges 证明了 SGD 学习轨迹是受到顺序学习机制的影响。具体来说,我们发现 TMS 的学习过程, whether through SGD 或 Bayesian 学习,可以通过从高损失且低复杂性的区域到低损失且高复杂性的区域的 parameter space 中的旅程来描述。
Suppressing Overestimation in Q-Learning through Adversarial Behaviors
results: 对多种环境进行实验表明,提议的DAQ可以有效地降低过估偏好,并且可以轻松地应用于现有的感知学习算法中,以提高性能。Abstract
The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.
摘要
<> translate "The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias though dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments." into Simplified Chinese.中文简体版:本文的目标是提出一种新的Q学习算法,即幻数对手Q学习(DAQ),可以有效地控制标准Q学习中的过估偏见。通过幻数player,学习可以转化为两个玩家的零SUM游戏。提出的DAQ整合了多种Q学习变体来控制过估偏见,如maxmin Q学习和minmax Q学习(在本文中提出)。DAQ是一种简单 yet有效的方法,通过幻数对手行为来抑制过估偏见,并可以轻松应用于市场上的Q学习算法来提高性能。我们从一个整合的视角来分析DAQ的固定时间收敛性。本文的建议DAQ的性能在多个标准环境中进行了实验证明。
BC4LLM: Trusted Artificial Intelligence When Blockchain Meets Large Language Models
methods: 本文使用了区块链技术来解决 AI 学习数据的 authenticity 和可靠性问题,包括可靠学习团队、安全训练过程和可识别生成内容。
results: 本文预计通过基于区块链技术的 empowering 方案,可以实现人工智能的可靠性和安全性,并且在前沿通信网络领域可能带来很多应用和挑战。Abstract
In recent years, artificial intelligence (AI) and machine learning (ML) are reshaping society's production methods and productivity, and also changing the paradigm of scientific research. Among them, the AI language model represented by ChatGPT has made great progress. Such large language models (LLMs) serve people in the form of AI-generated content (AIGC) and are widely used in consulting, healthcare, and education. However, it is difficult to guarantee the authenticity and reliability of AIGC learning data. In addition, there are also hidden dangers of privacy disclosure in distributed AI training. Moreover, the content generated by LLMs is difficult to identify and trace, and it is difficult to cross-platform mutual recognition. The above information security issues in the coming era of AI powered by LLMs will be infinitely amplified and affect everyone's life. Therefore, we consider empowering LLMs using blockchain technology with superior security features to propose a vision for trusted AI. This paper mainly introduces the motivation and technical route of blockchain for LLM (BC4LLM), including reliable learning corpus, secure training process, and identifiable generated content. Meanwhile, this paper also reviews the potential applications and future challenges, especially in the frontier communication networks field, including network resource allocation, dynamic spectrum sharing, and semantic communication. Based on the above work combined and the prospect of blockchain and LLMs, it is expected to help the early realization of trusted AI and provide guidance for the academic community.
摘要
This paper mainly introduces the motivation and technical route of blockchain for LLM (BC4LLM), including reliable learning corpus, secure training process, and identifiable generated content. Meanwhile, this paper also reviews the potential applications and future challenges, especially in the frontier communication networks field, including network resource allocation, dynamic spectrum sharing, and semantic communication. Based on the above work combined and the prospect of blockchain and LLMs, it is expected to help the early realization of trusted AI and provide guidance for the academic community.Here is the translation in Simplified Chinese:近年来,人工智能(AI)和机器学习(ML)对社会生产方式和生产效率产生了深见影响,同时也改变了科学研究的 paradigma。其中,AI语言模型代表的ChatGPT等大语言模型(LLMs)已经做出了很大的进步。这些LLMs服务于人类在形式上为AI生成内容(AIGC),广泛应用于咨询、医疗和教育等领域。然而,保证AIGC学习数据的authenticity和可靠性具有挑战。此外,分布式AI培训中也存在隐藏的隐私泄露风险。此外,由LLMs生成的内容很难以识别和跟踪,同时也难以在不同平台之间进行跨平台认可。这些在未来的AI驱动by LLMs中的信息安全问题将无限扩大,影响每个人的生活。因此,我们认为使用区块链技术来加强LLMs,以提出一种可靠的AI视野。本文主要介绍了加强LLMs使用区块链技术的动机和技术路径,包括可靠的学习ikorpus,安全的培训过程和可识别的生成内容。同时,本文还进行了前ier communication networksfield中的潜在应用和未来挑战的评估,包括网络资源分配、动态频率共享和semantic communication。基于上述工作的结合以及区块链和LLMs的前景,我们期望通过提出可靠的AI来帮助早期实现可靠的AI,并为学术界提供指导。
Let Models Speak Ciphers: Multiagent Debate through Embeddings
paper_authors: Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang
for: 提高大语言模型(LLM)的理解能力
methods: 去掉 LLM 中的 токен抽样步骤,通过 Raw Transformer 输出嵌入表示模型的信念
results: 在五种理解任务和多个开源 LLM 中,CIPHER 较 traditional inference 提高了1-3.5%,表明嵌入作为 LLM 之间communication的 alternatinative “语言” 的优势和稳定性。Abstract
Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary. In this paper, we introduce a communication regime named CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. Specifically, we remove the token sampling step from LLMs and let them communicate their beliefs across the vocabulary through the expectation of the raw transformer output embeddings. Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. While the state-of-the-art LLM debate methods using natural language outperforms traditional inference by a margin of 1.5-8%, our experiment results show that CIPHER debate further extends this lead by 1-3.5% across five reasoning tasks and multiple open-source LLMs of varying sizes. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
摘要
Large Language Models (LLMs) 之间的讨论和辩论已经吸引了广泛的关注,因为它们可以提高 LLMs 的理解能力。 although natural language 是一个自然的选择,因为 LLMS 拥有语言理解能力,但是token sampling 步骤在生成自然语言时存在一定的风险,因为它只使用一个token来表示模型对整个词汇的信念。 在这篇论文中,我们提出了一种通信协议名为CIPHER(Communicative Inter-Model Protocol Through Embedding Representation),以解决这个问题。 Specifically,我们从 LLMS 中除了token sampling步骤,让它们通过 Raw Transformer 输出嵌入来交换信念。 这种方法的优点在于,它可以编码更广泛的信息,无需修改模型参数。 在使用自然语言进行辩论方法的现有状态的LLMs中,我们的实验结果表明,CIPHER辩论可以进一步提高这个领先的势头,在五种理解任务和多个开源 LLMS 中,平均提高1-3.5%。 这表明嵌入可以作为 LLMS 之间的另一种通信"语言"的一个有利的和稳定的选择。
Towards Mitigating Hallucination in Large Language Models via Self-Reflection
paper_authors: Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung
for: This paper focuses on the issue of hallucination in medical generative question-answering systems, and proposes an interactive self-reflection methodology to improve the factuality and consistency of the generated answers.
methods: The paper uses widely adopted large language models (LLMs) and datasets, and employs an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation to tackle the challenge of hallucination.
results: The experimental results show that the proposed approach outperforms baselines in reducing hallucination, and produces more accurate and consistent answers.Abstract
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
摘要
The AI Incident Database as an Educational Tool to Raise Awareness of AI Harms: A Classroom Exploration of Efficacy, Limitations, & Future Improvements
paper_authors: Michael Feffer, Nikolas Martelaro, Hoda Heidari for:* 这 paper 的目的是提高人们对 AI 技术的应用中可能出现的危害的意识,以及如何设计安全、可靠的 AI 系统。methods:* 该 paper 使用了 AI Incident Database (AIID) 作为教学工具,以帮助学生更好地理解 AI 技术在社会高危领域中可能出现的危害。results:* 该 study 发现,通过使用 AIID,学生的初始印象Changed significantly,他们更加意识到 AI 技术在社会中的危害,并且有更强的感觉要设计安全、可靠的 AI 系统。Abstract
Prior work has established the importance of integrating AI ethics topics into computer and data sciences curricula. We provide evidence suggesting that one of the critical objectives of AI Ethics education must be to raise awareness of AI harms. While there are various sources to learn about such harms, The AI Incident Database (AIID) is one of the few attempts at offering a relatively comprehensive database indexing prior instances of harms or near harms stemming from the deployment of AI technologies in the real world. This study assesses the effectiveness of AIID as an educational tool to raise awareness regarding the prevalence and severity of AI harms in socially high-stakes domains. We present findings obtained through a classroom study conducted at an R1 institution as part of a course focused on the societal and ethical considerations around AI and ML. Our qualitative findings characterize students' initial perceptions of core topics in AI ethics and their desire to close the educational gap between their technical skills and their ability to think systematically about ethical and societal aspects of their work. We find that interacting with the database helps students better understand the magnitude and severity of AI harms and instills in them a sense of urgency around (a) designing functional and safe AI and (b) strengthening governance and accountability mechanisms. Finally, we compile students' feedback about the tool and our class activity into actionable recommendations for the database development team and the broader community to improve awareness of AI harms in AI ethics education.
摘要
We conducted a classroom study at an R1 institution as part of a course focused on the societal and ethical considerations around AI and ML. Our findings show that interacting with the database helps students better understand the magnitude and severity of AI harms and instills in them a sense of urgency around designing functional and safe AI and strengthening governance and accountability mechanisms.Our qualitative findings also reveal that students desire to close the educational gap between their technical skills and their ability to think systematically about ethical and societal aspects of their work. We compile students' feedback about the tool and our class activity into actionable recommendations for the database development team and the broader community to improve awareness of AI harms in AI ethics education.In conclusion, our study demonstrates that the AIID is an effective educational tool to raise awareness of AI harms in socially high-stakes domains, and highlights the importance of incorporating AI ethics education into computer and data sciences curricula to address the urgent need for ethical and responsible AI development.
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
results: 实验结果表明,CodeFuse-13B在实际应用场景中,如代码生成、代码翻译、代码注释和测试用例生成等,都能够更好地处理中文输入,并在人类评价(HumanEval)中获得37.10%的 passer@1 分数,位居同参数大小的多语言代码 LLM 之列。Abstract
Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.
摘要
大型语言模型(Code LLMs)在业界获得了广泛关注,因为它们在软件工程的全生命周期中有广泛的应用。然而,现有模型对非英语输入的效果在多种语言程式码相关任务中仍然不够熟悉。本文介绍CodeFuse-13B,一个开源预训式程式码大型语言模型。它可以处理英语和中文提示,支持40种程式语言,并且通过使用高质量的预训数据和训练过程中的优化而获得效果。我们通过实际的使用场景、业界标准对chmark HumanEval-x以及特别设计的CodeFuseEval进行了广泛的实验。为评估CodeFuse的效果,我们 актив地收集了AntGroup的软件开发过程中的宝贵人类反馈。结果显示,CodeFuse-13B在HumanEval pass@1 score中获得37.10%,位居多种多语言程式码大型语言模型之一。在实际应用中,例如程式码生成、程式码翻译、程式码注释和测试案例生成等方面,CodeFuse在面对中文提示时表现更好。
Self-Discriminative Modeling for Anomalous Graph Detection
results: 提出了三种不同的计算效率和稳定性的算法,并与一些州OF-the-art图像级异常检测基线方法进行比较,显著提高了AUC。Abstract
This paper studies the problem of detecting anomalous graphs using a machine learning model trained on only normal graphs, which has many applications in molecule, biology, and social network data analysis. We present a self-discriminative modeling framework for anomalous graph detection. The key idea, mathematically and numerically illustrated, is to learn a discriminator (classifier) from the given normal graphs together with pseudo-anomalous graphs generated by a model jointly trained, where we never use any true anomalous graphs and we hope that the generated pseudo-anomalous graphs interpolate between normal ones and (real) anomalous ones. Under the framework, we provide three algorithms with different computational efficiencies and stabilities for anomalous graph detection. The three algorithms are compared with several state-of-the-art graph-level anomaly detection baselines on nine popular graph datasets (four with small size and five with moderate size) and show significant improvement in terms of AUC. The success of our algorithms stems from the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provide new insights for anomaly detection. Moreover, we investigate our algorithms for large-scale imbalanced graph datasets. Surprisingly, our algorithms, though fully unsupervised, are able to significantly outperform supervised learning algorithms of anomalous graph detection. The corresponding reason is also analyzed.
摘要
Under this framework, we present three algorithms with different computational efficiencies and stabilities for anomalous graph detection. These algorithms are compared with several state-of-the-art graph-level anomaly detection baselines on nine popular graph datasets, and they show significant improvement in terms of AUC. The success of our algorithms is due to the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provide new insights for anomaly detection.Moreover, we investigate our algorithms for large-scale imbalanced graph datasets and find that they can significantly outperform supervised learning algorithms for anomalous graph detection. We also analyze the reason for this surprising result.
Get the gist? Using large language models for few-shot decontextualization
results: 在多个领域中,使用几次训练的方法可以实现可靠的将句子脱离内容,并且可以跨领域实现 Transfer Learning。Abstract
In many NLP applications that involve interpreting sentences within a rich context -- for instance, information retrieval systems or dialogue systems -- it is desirable to be able to preserve the sentence in a form that can be readily understood without context, for later reuse -- a process known as ``decontextualization''. While previous work demonstrated that generative Seq2Seq models could effectively perform decontextualization after being fine-tuned on a specific dataset, this approach requires expensive human annotations and may not transfer to other domains. We propose a few-shot method of decontextualization using a large language model, and present preliminary results showing that this method achieves viable performance on multiple domains using only a small set of examples.
摘要
在许多自然语言处理(NLP)应用中,如信息检索系统或对话系统,旨在保留句子的形式,以便在不同上下文中重用——一种称为“减 contextualization”的过程。而过去的研究表明,可以使用生成 Seq2Seq 模型进行减 contextualization,但这种方法需要优质的人工标注,并且可能无法在其他领域传输。我们提出了一种几个示例的方法,使用大型语言模型进行减 contextualization,并提供了多个领域的初步结果,表明这种方法可以在不同领域 достичь可行的性能,只需要一小组示例。
We are what we repeatedly do: Inducing and deploying habitual schemas in persona-based responses
results: 作者在文章中提出了一种方法来从 generic facts 中生成 schema,然后从这些 schema 中挑选合适的一些来控制语言模型生成响应。这种方法可以帮助实现在对话系统中更加自然地表现出人物性。Abstract
Many practical applications of dialogue technology require the generation of responses according to a particular developer-specified persona. While a variety of personas can be elicited from recent large language models, the opaqueness and unpredictability of these models make it desirable to be able to specify personas in an explicit form. In previous work, personas have typically been represented as sets of one-off pieces of self-knowledge that are retrieved by the dialogue system for use in generation. However, in realistic human conversations, personas are often revealed through story-like narratives that involve rich habitual knowledge -- knowledge about kinds of events that an agent often participates in (e.g., work activities, hobbies, sporting activities, favorite entertainments, etc.), including typical goals, sub-events, preconditions, and postconditions of those events. We capture such habitual knowledge using an explicit schema representation, and propose an approach to dialogue generation that retrieves relevant schemas to condition a large language model to generate persona-based responses. Furthermore, we demonstrate a method for bootstrapping the creation of such schemas by first generating generic passages from a set of simple facts, and then inducing schemas from the generated passages.
摘要
很多实际应用场景中,对话技术需要根据开发者指定的 persona 生成响应。当前的大语言模型可以生成多种 persona,但这些模型的 complexity 和难于预测性使得可以将 persona 表示为明确的形式。在过去的工作中, persona 通常被表示为一组自我认知的一次性 retrieve,但在真实的人类对话中, persona 通常通过 rich 的习惯知识表示,包括代表工作、习惯、运动、喜好等活动的常见目标、子事件、前提和后果等知识。我们使用明确的schema表示方式来捕捉这些习惯知识,并提议一种基于 schema 的对话生成方法,使用这些 schema 来condition 一个大语言模型,以生成基于 persona 的响应。此外,我们还提出了一种方法,通过首先生成一组简单的事实,然后从这些事实中推导出 schema,来初始化 schema 的创建。
Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction
paper_authors: Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, Yonghui Wu for: 这种研究的目的是开发大型自然语言模型(LLM)上的软提示学习算法,检查提示的形状,提取提示使用冻结/不冻结LLM的方法,转移学习和少量学习能力。methods: 我们开发了一种基于软提示的LLM模型,并对4种训练策略进行比较,包括(1)没有提示的精度训练;(2)硬提示使用不冻结LLM;(3)软提示使用不冻结LLM;以及(4)软提示使用冻结LLM。我们使用了7种预训练LLM进行评估,并在两个benchmark数据集上进行评估。results: 结果表明,当LLM冻结时,GatorTron-3.9B with soft prompting得到了最好的精度分数为0.9118和0.8604 для概念EXTRACTION,与传统精度训练和硬提示基本模型相比,提高了0.63.1%和1.22.9%。GatorTron-345M with soft prompting得到了最好的F1分数为0.8332和0.7488 для端到端关系EXTRACTION,与其他两个模型相比,提高了0.22%和0.611.7%。当LLM冻结时,小型(i.e., 345 million parameters)LLM有一个很大的差距,需要扩大到比较大的参数量才能与冻结模型竞争。在跨机构评估中,使用冻结GatorTron-8.9B模型的软提示方法获得了最好的表现。这项研究表明了以下三点:(1)机器可以更好地学习软提示,(2)冻结LLM具有更好的少量学习和转移学习能力,以便多机构应用,(3)冻结LLM需要大型模型。Abstract
Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models.
摘要
方法:我们开发了一种软提示基于LLM模型,并比较了四种训练策略:(1)不使用提示;(2)使用不冻LLM的硬提示;(3)使用不冻LLM的软提示;(4)使用冻LLM的软提示。我们使用七种预训练LLM模型在两个benchmark数据集上进行临床概念和关系提取任务的评估。我们还评估了提示基于学习算法的跨机构传输学习能力和少量学习能力。结果和结论:当LLM模型处于不冻状态时,GatorTron-3.9B模型通过软提示方式取得了概念提取任务的严格F1分数最高,为0.9118和0.8604,分别超过了传统的精度训练和硬提示基于模型的0.6~3.1%和1.2~2.9%。GatorTron-345M模型通过软提示方式取得了综合关系提取任务的F1分数最高,为0.8332和0.7488,分别超过了其他两个模型的0.2~2%和0.6~11.7%。当LLM模型处于冻结状态时,小型(i.e., 345 million parameters)LLM模型具有大的差距,需要通过扩大模型规模来使其与不冻模型竞争。跨机构评估中,使用冻结GatorTron-8.9B模型的软提示方式取得了最佳性能。这个研究表明:(1)机器可以更好地学习软提示 than humans;(2)冻结LLM模型具有更好的少量学习和传输学习能力,以便实现多机构应用;(3)冻结LLM模型需要大型模型。
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering
results: 研究人员在MUSIC-AVQA v2.0上使用新基线模型,比对 existed benchmarks 提高了2%的准确率,创造了新的状态率记录。Abstract
In recent years, there has been a growing emphasis on the intersection of audio, vision, and text modalities, driving forward the advancements in multimodal research. However, strong bias that exists in any modality can lead to the model neglecting the others. Consequently, the model's ability to effectively reason across these diverse modalities is compromised, impeding further advancement. In this paper, we meticulously review each question type from the original dataset, selecting those with pronounced answer biases. To counter these biases, we gather complementary videos and questions, ensuring that no answers have outstanding skewed distribution. In particular, for binary questions, we strive to ensure that both answers are almost uniformly spread within each question category. As a result, we construct a new dataset, named MUSIC-AVQA v2.0, which is more challenging and we believe could better foster the progress of AVQA task. Furthermore, we present a novel baseline model that delves deeper into the audio-visual-text interrelation. On MUSIC-AVQA v2.0, this model surpasses all the existing benchmarks, improving accuracy by 2% on MUSIC-AVQA v2.0, setting a new state-of-the-art performance.
摘要
近年来,有关音频、视觉和文本modalities的交叉研究得到了越来越多的关注,这些研究带来了多模态研究的进步。然而,任何一个modalities中的强大偏见可能导致模型忽略其他modalities。因此,模型在多个不同modalities之间有效地进行推理的能力受到了限制,这障碍了进一步的进步。在这篇论文中,我们仔细审查了原始数据集中的每个问题类型,选择具有明显的答案偏见的问题。为了纠正这些偏见,我们收集了补充视频和问题,确保每个问题类型中的答案都具有均匀的分布。因此,我们构建了一个新的数据集,名为MUSIC-AVQA v2.0,它比原始数据集更加具有挑战性,我们认为这将更好地推动AVQA任务的进步。此外,我们还提出了一种新的基线模型,它深入探究音频-视觉-文本之间的关系。在MUSIC-AVQA v2.0上,这种模型的表现超过了所有现有的 bench marks,提高了MUSIC-AVQA v2.0上的准确率2%,创造了新的状态状态表现。
Evolution of Natural Language Processing Technology: Not Just Language Processing Towards General Purpose AI
results: 研究结果表明,通过使用深度学习技术可以实现“练习做完善”的原则,并且可以进行四则运算而无需显式学习。这种方法可以解释复杂的图像和生成与其相应的文本描述。此外,本研究还提供了应用于商业领域的实践例子。Abstract
Since the invention of computers, communication through natural language (actual human language) has been a dream technology. However, natural language is extremely difficult to mathematically formulate, making it difficult to realize as an algorithm without considering programming. While there have been numerous technological developments, one cannot say that any results allowing free utilization have been achieved thus far. In the case of language learning in humans, for instance when learning one's mother tongue or foreign language, one must admit that this process is similar to the adage "practice makes perfect" in principle, even though the learning method is significant up to a point. Deep learning has played a central role in contemporary AI technology in recent years. When applied to natural language processing (NLP), this produced unprecedented results. Achievements exceeding the initial predictions have been reported from the results of learning vast amounts of textual data using deep learning. For instance, four arithmetic operations could be performed without explicit learning, thereby enabling the explanation of complex images and the generation of images from corresponding explanatory texts. It is an accurate example of the learner embodying the concept of "practice makes perfect" by using vast amounts of textual data. This report provides a technological explanation of how cutting-edge NLP has made it possible to realize the "practice makes perfect" principle. Additionally, examples of how this can be applied to business are provided. We reported in June 2022 in Japanese on the NLP movement from late 2021 to early 2022. We would like to summarize this as a memorandum since this is just the initial movement leading to the current large language models (LLMs).
摘要
Deep learning has played a central role in contemporary AI technology in recent years. When applied to natural language processing (NLP), this produced unprecedented results. Achievements exceeding the initial predictions have been reported from the results of learning vast amounts of textual data using deep learning. For instance, four arithmetic operations could be performed without explicit learning, thereby enabling the explanation of complex images and the generation of images from corresponding explanatory texts. It is an accurate example of the learner embodying the concept of "practice makes perfect" by using vast amounts of textual data.This report provides a technological explanation of how cutting-edge NLP has made it possible to realize the "practice makes perfect" principle. Additionally, examples of how this can be applied to business are provided. We reported in June 2022 in Japanese on the NLP movement from late 2021 to early 2022. We would like to summarize this as a memorandum since this is just the initial movement leading to the current large language models (LLMs).
GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models
results: GPT-4在农业考试中达到了88%的正确率,比之前的通用模型高,并在一个实验中与人类参与者相比而得到了最高表现。 GPT-4还可以为农业教育、评估和作物管理提供有价值的意见和建议。Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding across various domains, including healthcare and finance. For some tasks, LLMs achieve similar or better performance than trained human beings, therefore it is reasonable to employ human exams (e.g., certification tests) to assess the performance of LLMs. We present a comprehensive evaluation of popular LLMs, such as Llama 2 and GPT, on their ability to answer agriculture-related questions. In our evaluation, we also employ RAG (Retrieval-Augmented Generation) and ER (Ensemble Refinement) techniques, which combine information retrieval, generation capabilities, and prompting strategies to improve the LLMs' performance. To demonstrate the capabilities of LLMs, we selected agriculture exams and benchmark datasets from three of the largest agriculture producer countries: Brazil, India, and the USA. Our analysis highlights GPT-4's ability to achieve a passing score on exams to earn credits for renewing agronomist certifications, answering 93% of the questions correctly and outperforming earlier general-purpose models, which achieved 88% accuracy. On one of our experiments, GPT-4 obtained the highest performance when compared to human subjects. This performance suggests that GPT-4 could potentially pass on major graduate education admission tests or even earn credits for renewing agronomy certificates. We also explore the models' capacity to address general agriculture-related questions and generate crop management guidelines for Brazilian and Indian farmers, utilizing robust datasets from the Brazilian Agency of Agriculture (Embrapa) and graduate program exams from India. The results suggest that GPT-4, ER, and RAG can contribute meaningfully to agricultural education, assessment, and crop management practice, offering valuable insights to farmers and agricultural professionals.
摘要
大型自然语言模型(LLM)已经展示了在各种领域的自然语言理解能力,包括医疗和金融等。对于某些任务,LLM可以达到人类训练的水平或更高,因此可以使用人类考试(例如证书考试)来评估LLM的性能。我们对 популяр的LLM,如LLama 2和GPT,进行了全面的评估,以测试它们在农业相关问题上的能力。在我们的评估中,我们还使用了RAG(Retrieval-Augmented Generation)和ER(Ensemble Refinement)技术,这些技术将 Retrieval、生成和提示策略相结合以提高LLM的性能。为了展示LLM的能力,我们选择了来自世界上三个最大农业生产国的农业考试和标准Dataset:巴西、印度和美国。我们的分析显示,GPT-4在考试中可以达到88%的正确率,比之前的通用模型提高了5%。在一些实验中,GPT-4 even outperformed human subjects,这种性能表明GPT-4可能可以通过大学 entrance exams或者农业证书更新考试。我们还探讨了模型在农业相关问题上的总能力和生成农业管理指南的能力,使用了巴西农业局(Embrapa)的robust数据集和印度大学考试。结果表明,GPT-4、ER和RAG可以在农业教育、评估和农业管理实践中发挥重要作用,为农民和农业专业人员提供有价值的意见。
paper_authors: Jingyang Xiang, Siqi Li, Jun Chen, Shipeng Bai, Yukai Ma, Guang Dai, Yong Liu
for: This paper aims to train a uniform 1$\times$N sparse structured network from scratch, which can overcome the problems of expensive training cost, memory access, sub-optimal model quality, and unbalanced workload across threads in existing sparse weight selection and fine-tuning methods.
methods: The proposed method, called Soft Uniform Block Pruning (SUBP), repeatedly allows pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process, making the model less dependent on pre-training and achieving balanced workload.
results: The paper shows that the proposed SUBP method consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch, as demonstrated by comprehensive experiments across various CNN architectures on ImageNet.Abstract
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https://github.com/JingyangXiang/SUBP}.
摘要
study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.