results: 对比于现有的音乐生成模型,JEN-1 Composer 能够实现更高的音乐质量和控制能力,并且可以在用户提供的音乐风格和元素的基础上进行高级的音乐创作。Abstract
With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation from scratch. However, finer-grained control over multi-track generation remains an open challenge. Existing models exhibit strong raw generation capability but lack the flexibility to compose separate tracks and combine them in a controllable manner, differing from typical workflows of human composers. To address this issue, we propose JEN-1 Composer, a unified framework to efficiently model marginal, conditional, and joint distributions over multi-track music via a single model. JEN-1 Composer framework exhibits the capacity to seamlessly incorporate any diffusion-based music generation system, \textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music generation. We introduce a curriculum training strategy aimed at incrementally instructing the model in the transition from single-track generation to the flexible generation of multi-track combinations. During the inference, users have the ability to iteratively produce and choose music tracks that meet their preferences, subsequently creating an entire musical composition incrementally following the proposed Human-AI co-composition workflow. Quantitative and qualitative assessments demonstrate state-of-the-art performance in controllable and high-fidelity multi-track music synthesis. The proposed JEN-1 Composer represents a significant advance toward interactive AI-facilitated music creation and composition. Demos will be available at https://jenmusic.ai/audio-demos.
摘要
With the rapid development of generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation from scratch. However, finer-grained control over multi-track generation remains an open challenge. Existing models have strong raw generation capability but lack the flexibility to compose separate tracks and combine them in a controllable manner, differing from typical workflows of human composers. To address this issue, we propose JEN-1 Composer, a unified framework to efficiently model marginal, conditional, and joint distributions over multi-track music via a single model. The JEN-1 Composer framework can seamlessly incorporate any diffusion-based music generation system, such as Jen-1, enhancing its capacity for versatile multi-track music generation. We introduce a curriculum training strategy aimed at incrementally instructing the model in the transition from single-track generation to the flexible generation of multi-track combinations. During the inference, users have the ability to iteratively produce and choose music tracks that meet their preferences, subsequently creating an entire musical composition incrementally following the proposed Human-AI co-composition workflow. Quantitative and qualitative assessments demonstrate state-of-the-art performance in controllable and high-fidelity multi-track music synthesis. The proposed JEN-1 Composer represents a significant advance toward interactive AI-facilitated music creation and composition. Demos will be available at https://jenmusic.ai/audio-demos.
Predicting recovery following stroke: deep learning, multimodal data and feature selection using explainable AI
paper_authors: Adam White, Margarita Saranti, Artur d’Avila Garcez, Thomas M. H. Hope, Cathy J. Price, Howard Bowman
for: 这 paper 的目的是使用机器学习自动预测 stroke 后症状和其回归治疗的效果。
methods: 这 paper 使用了两种策略:首先使用 2D 图像概述 MRI 扫描结果,其次选择关键特征以提高分类精度。此外,文章还介绍了一种新的方法,即在 MRI 图像和表格数据之间融合学习。
results: 文章的结果显示,可以通过组合 MRI 图像和表格数据来实现高精度的 post-stroke 分类。在不同的 CNN 架构和数据 Representation 下,分类精度最高达 0.854。Abstract
Machine learning offers great potential for automated prediction of post-stroke symptoms and their response to rehabilitation. Major challenges for this endeavour include the very high dimensionality of neuroimaging data, the relatively small size of the datasets available for learning, and how to effectively combine neuroimaging and tabular data (e.g. demographic information and clinical characteristics). This paper evaluates several solutions based on two strategies. The first is to use 2D images that summarise MRI scans. The second is to select key features that improve classification accuracy. Additionally, we introduce the novel approach of training a convolutional neural network (CNN) on images that combine regions-of-interest extracted from MRIs, with symbolic representations of tabular data. We evaluate a series of CNN architectures (both 2D and a 3D) that are trained on different representations of MRI and tabular data, to predict whether a composite measure of post-stroke spoken picture description ability is in the aphasic or non-aphasic range. MRI and tabular data were acquired from 758 English speaking stroke survivors who participated in the PLORAS study. The classification accuracy for a baseline logistic regression was 0.678 for lesion size alone, rising to 0.757 and 0.813 when initial symptom severity and recovery time were successively added. The highest classification accuracy 0.854 was observed when 8 regions-of-interest was extracted from each MRI scan and combined with lesion size, initial severity and recovery time in a 2D Residual Neural Network.Our findings demonstrate how imaging and tabular data can be combined for high post-stroke classification accuracy, even when the dataset is small in machine learning terms. We conclude by proposing how the current models could be improved to achieve even higher levels of accuracy using images from hospital scanners.
摘要
Machine learning可以提供很大的潜在 для自动预测 poste stroke 症状和其回归治疗的结果。主要挑战包括神经成像数据的非常高维度,可用学习 dataset 的较小尺寸,以及如何有效地结合神经成像和表格数据(例如人口信息和临床特征)。本文评估了多种解决方案,包括使用 2D 图像简化 MRI 扫描结果,以及选择关键特征来提高分类精度。此外,我们还引入了一种新的方法,即在 MRI 扫描结果和表格数据之间进行 симвоlic 表示的训练 convolutional neural network (CNN)。我们评估了一系列 CNN 架构(包括 2D 和 3D),并在不同的 MRI 和表格数据表示下进行训练,以预测stroke 后 spoken picture 描述能力是否在非语症状范围内。MRI 和表格数据来自英国758名中文roke 幸存者参与了PLORAS 研究。分类精度的最高值为 0.854,出现在使用 8 个区域特征 Extracted from each MRI scan 和 lesion size、初始症状严重程度和Recovery time 的 2D Residual Neural Network 中。我们的发现表明,通过结合神经成像和表格数据,可以实现高度的 poste stroke 分类精度,即使数据集较小。我们结束的建议如下:通过使用医疗器械上的图像,可以进一步提高当前模型的准确率。
Rare Event Probability Learning by Normalizing Flows
paper_authors: Zhenggqi Gao, Dinghuai Zhang, Luca Daniel, Duane S. Boning for:NOFIS 是一种用于估计罕seen事件的方法,可以在多种领域中提供精确的估计。methods:NOFIS 使用了 normalizing flows 的特点,通过学习一系列的提案分布来实现高效的估计。results:NOFIS 在多个测试 caso 中表现出色,superior 于基eline方法,并且可以提供高质量的估计结果。Abstract
A rare event is defined by a low probability of occurrence. Accurate estimation of such small probabilities is of utmost importance across diverse domains. Conventional Monte Carlo methods are inefficient, demanding an exorbitant number of samples to achieve reliable estimates. Inspired by the exact sampling capabilities of normalizing flows, we revisit this challenge and propose normalizing flow assisted importance sampling, termed NOFIS. NOFIS first learns a sequence of proposal distributions associated with predefined nested subset events by minimizing KL divergence losses. Next, it estimates the rare event probability by utilizing importance sampling in conjunction with the last proposal. The efficacy of our NOFIS method is substantiated through comprehensive qualitative visualizations, affirming the optimality of the learned proposal distribution, as well as a series of quantitative experiments encompassing $10$ distinct test cases, which highlight NOFIS's superiority over baseline approaches.
摘要
一种罕见事件被定义为发生的概率很低。正确地估计这种小概率是多种领域的关键问题。传统的 Монте卡洛方法是不具有效率,需要很多样本来获得可靠的估计。以启发式扩展流为引用,我们再次挑战这个问题,并提出了正则化流助けimportance sampling(NOFIS)方法。NOFIS首先学习一个序列的提议分布,这些分布与预先定义的嵌套子事件相关。然后,它使用重要样本法并与最后一个提议相结合,来估计罕见事件的概率。我们通过对多个测试案例进行详细的可见化,证明了提议分布的优化性,以及对基eline方法的超越。
Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep Reinforcement Learning
results: 我们的实验结果显示,静止转移和动态转移都可以减少获得最佳策略所需的时间,并且在不同的决策任务中均有良好的表现。Abstract
Reinforcement learning (RL) is a powerful tool for finding optimal policies in sequential decision processes. However, deep RL methods suffer from two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside of the training distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose two methods for generating Q-value estimates: static transfer, which reasons over an abstract Markov Decision Process constructed based on prior knowledge, and dynamic transfer, where symbolic information is extracted from a teacher Deep Q-Network (DQN). The resulting Q-value estimates from either method are used to bootstrap learning in the target environment via a modified DQN loss function. We list several failure modes of existing automaton-based transfer methods and demonstrate that both static and dynamic automaton distillation decrease the time required to find optimal policies for various decision tasks.
摘要
《强化学习(RL)是一种有力的工具,可以找到sequential decision process中的优化策略。但是深度RL方法受到两点弱点:收集agent经验所需的成本是实际RL问题中 prohibitively expensive,并且学习的策略具有poor generalization在训练分布外的任务上。为了缓解这些问题,我们引入自动机液化,一种neuro-symbolic transfer learning的形式,其中Q值估计从教师中提取到一个低维度表示形式中,这个形式是一个自动机。然后我们提出了两种方法来生成Q值估计:静态传输,这里是reasoning over一个基于先前知识构建的抽象Markov决策过程,以及动态传输,其中Symbolic信息从一个教师深度Q网络(DQN)中提取出来。这些Q值估计的结果被用来启动target环境中的学习,通过一个修改后DQN损失函数。我们列出了现有自动机基于转移方法的失败模式,并证明了静态和动态自动机液化都可以降低在不同决策任务中找到优化策略所需的时间。》Note: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from the original text.
paper_authors: Elnaserledinellah Mahmood Abdelwahab for:The paper challenges the assumptions of Modern Logics, particularly those of Frege, Russell, and Tarski, and their applications in formal languages.methods:The paper uses undisputed principles of Arabic to falsify the Logicians’ ideas and demonstrate the limitations of their approaches. It also utilizes the existence of “meaning-particles” in Arabic syntax to efficiently recognize words, phrases, and sentences.results:The paper shows that the assumptions of Modern Logics contradict basic principles of Arabic, and that the approaches based on these assumptions are not applicable to Arabic. It also presents a new way to approach the computational problem of Satisfiability (SAT) using the realization that parsing Arabic utilizes the existence of “meaning-particles” within syntax. The paper provides practical evidence, obtained for multiplication circuits, supporting its claims.Abstract
Modern Logics, as formulated notably by Frege, Russell and Tarski involved basic assumptions about Natural Languages in general and Indo-European Languages in particular, which are contested by Linguists. Based upon those assumptions, formal Languages were designed to overcome what Logicians claimed to be 'defects' of Natural Language. In this paper we show that those assumptions contradict basic principles of Arabic. More specifically: The Logicians ideas, that within Natural Language words refer to objects, 'ToBe'-constructions represent identity statements, Indefinite Descriptions must be replaced by existential quantifiers to form meaningful Sentences and Symbols can have no interpretation-independent meanings, are all falsified using undisputed principles of Arabic. The here presented falsification serves two purposes. First, it is used as a factual basis for the rejection of approaches adopting Semantic axioms of Mathematical Logics as Models for meaning of Arabic Syntax. Second, it shows a way to approach the important computational problem: Satisfiability (SAT). The described way is based upon the realization that parsing Arabic utilizes the existence of 'meaning-particles' within Syntax to efficiently recognize words, phrases and Sentences. Similar meaning-particles are shown to exist in 3CNF formulas, which, when properly handled within the machinery of 3SAT-Solvers, enable structural conditions to be imposed on formulas, sufficient alone to guarantee the efficient production of non-exponentially sized Free Binary Decision Diagrams (FBDDs). We show, why known exponential Lower Bounds on sizes of FBDDs do not contradict our results and reveal practical evidence, obtained for multiplication circuits, supporting our claims.
摘要
现代逻辑,如Frege、Russell和Tarski所提出的基本假设,对于自然语言和印欧语言而言都存在争议。基于这些假设,形式语言被设计用于超越逻辑家所认为自然语言存在的缺陷。在这篇论文中,我们表明了这些假设与阿拉伯语言的基本原理矛盾。具体来说,逻辑家所认为的各种假设,如自然语言中词语引用对象、'ToBe'-构造表示Identidad句子、不确定描述需要通过存在量词替换来形成意义的句子,以及符号没有独立的解释意义,都被使用不争的阿拉伯语言原理驳斥。这种驳斥服两目的。首先,它用作拒绝采用数学逻辑语义模型的方法的拒绝基础。其次,它显示了如何使用阿拉伯语言的存在意思粒子来效率地识别单词、短语和句子。这种方法还可以应用于计算问题:满足(SAT)。我们表明了如何使用这种方法,并解释了为什么已知的下界不会与我们的结果相矛盾。此外,我们还提供了实践证据,来支持我们的主张。
Dynamic V2X Autonomous Perception from Road-to-Vehicle Vision
paper_authors: Jiayao Tan, Fan Lyu, Linyan Li, Fuyuan Hu, Tingliang Feng, Fenglei Xu, Rui Yao for: 提高自动驾驶系统的安全性和可靠性,适应动态场景methods: 基于路径视觉建立道路到车辆视觉,提出适应性强的道路到车辆视觉感知方法(AR2VP)results: 在3D对象检测和分割任务中,AR2VP在性能和带宽之间做出了优秀的折衔,同时在动态环境中保持了模型的适应性。Abstract
Vehicle-to-everything (V2X) perception is an innovative technology that enhances vehicle perception accuracy, thereby elevating the security and reliability of autonomous systems. However, existing V2X perception methods focus on static scenes from mainly vehicle-based vision, which is constrained by sensor capabilities and communication loads. To adapt V2X perception models to dynamic scenes, we propose to build V2X perception from road-to-vehicle vision and present Adaptive Road-to-Vehicle Perception (AR2VP) method. In AR2VP,we leverage roadside units to offer stable, wide-range sensing capabilities and serve as communication hubs. AR2VP is devised to tackle both intra-scene and inter-scene changes. For the former, we construct a dynamic perception representing module, which efficiently integrates vehicle perceptions, enabling vehicles to capture a more comprehensive range of dynamic factors within the scene.Moreover, we introduce a road-to-vehicle perception compensating module, aimed at preserving the maximized roadside unit perception information in the presence of intra-scene changes.For inter-scene changes, we implement an experience replay mechanism leveraging the roadside unit's storage capacity to retain a subset of historical scene data, maintaining model robustness in response to inter-scene shifts. We conduct perception experiment on 3D object detection and segmentation, and the results show that AR2VP excels in both performance-bandwidth trade-offs and adaptability within dynamic environments.
摘要
自动驾驶系统的安全和可靠性得到了提高,由于交通场景的变化和不确定性,需要进一步提高汽车的感知精度。现有的V2X感知方法都是基于主要是汽车视觉的静止场景,受到感知器和通信负担的限制。为了适应动态场景,我们提出了基于路面到汽车视觉的Adaptive Road-to-Vehicle Perception(AR2VP)方法。在AR2VP中,我们利用路边设备提供稳定、广泛感知能力,并作为通信枢纽。AR2VP能够应对内场景和间场景变化。对于内场景变化,我们构建了动态感知表示模块,能够有效地集成汽车感知,让汽车能够捕捉更广泛的动态因素。此外,我们引入了路面到汽车感知补做模块,以保持路边设备感知信息的最大化,对于内场景变化。对于间场景变化,我们实施了经验回放机制,利用路边设备的存储容量保留一部分历史场景数据,以保持模型对间场景变化的Robustness。我们对3D объек特征检测和分割进行感知实验,结果显示,AR2VP在性能和带宽之间的负担平衡和动态环境中的适应性都表现出色。
results: CACTUS在多种数据集和iot平台上实现了显著的准确率、响应时间和计算预算的改善。Abstract
While existing strategies for optimizing deep learning-based classification models on low-power platforms assume the models are trained on all classes of interest, this paper posits that adopting context-awareness i.e. focusing solely on the likely classes in the current context, can substantially enhance performance in resource-constrained environments. We propose a new paradigm, CACTUS, for scalable and efficient context-aware classification where a micro-classifier recognizes a small set of classes relevant to the current context and, when context change happens, rapidly switches to another suitable micro-classifier. CACTUS has several innovations including optimizing the training cost of context-aware classifiers, enabling on-the-fly context-aware switching between classifiers, and selecting the best context-aware classifiers given limited resources. We show that CACTUS achieves significant benefits in accuracy, latency, and compute budget across a range of datasets and IoT platforms.
摘要
While existing strategies for optimizing deep learning-based classification models on low-power platforms assume the models are trained on all classes of interest, this paper proposes a new approach that focuses solely on the likely classes in the current context, which can significantly enhance performance in resource-constrained environments. The proposed paradigm, CACTUS, is designed for scalable and efficient context-aware classification, where a micro-classifier recognizes a small set of classes relevant to the current context and can rapidly switch to another suitable micro-classifier when context changes occur. CACTUS has several innovative features, including optimizing the training cost of context-aware classifiers, enabling on-the-fly context-aware switching between classifiers, and selecting the best context-aware classifiers given limited resources. The paper shows that CACTUS achieves significant benefits in accuracy, latency, and compute budget across a range of datasets and IoT platforms.
Dynamic Task and Weight Prioritization Curriculum Learning for Multimodal Imagery
paper_authors: Huseyin Fuat Alsan, Taner Arsan for:这篇论文探索了在多modal深度学习模型下进行后灾分析,使用了curriculum learning方法来优化模型的性能。methods:这篇论文提出了一个curriculum learning策略,通过让深度学习模型在增加复杂性的数据上进行运动,以提高模型的性能。这篇论文使用了U-Net模型进行semantic segmentation和图像编码,并使用了自定义的文本分类器进行视觉问题回答。results:这篇论文的结果显示, DATWEP方法可以帮助提高多modal深度学习模型的视觉问题回答性能。 sources code可以在https://github.com/fualsan/DATWEP上取得。Abstract
This paper explores post-disaster analytics using multimodal deep learning models trained with curriculum learning method. Studying post-disaster analytics is important as it plays a crucial role in mitigating the impact of disasters by providing timely and accurate insights into the extent of damage and the allocation of resources. We propose a curriculum learning strategy to enhance the performance of multimodal deep learning models. Curriculum learning emulates the progressive learning sequence in human education by training deep learning models on increasingly complex data. Our primary objective is to develop a curriculum-trained multimodal deep learning model, with a particular focus on visual question answering (VQA) capable of jointly processing image and text data, in conjunction with semantic segmentation for disaster analytics using the FloodNet\footnote{https://github.com/BinaLab/FloodNet-Challenge-EARTHVISION2021} dataset. To achieve this, U-Net model is used for semantic segmentation and image encoding. A custom built text classifier is used for visual question answering. Existing curriculum learning methods rely on manually defined difficulty functions. We introduce a novel curriculum learning approach termed Dynamic Task and Weight Prioritization (DATWEP), which leverages a gradient-based method to automatically decide task difficulty during curriculum learning training, thereby eliminating the need for explicit difficulty computation. The integration of DATWEP into our multimodal model shows improvement on VQA performance. Source code is available at https://github.com/fualsan/DATWEP.
摘要
Web3 Meets AI Marketplace: Exploring Opportunities, Analyzing Challenges, and Suggesting Solutions
results: 本文提出了一种解决AI市场在Web3空间快速发展的方案,并且打开了该领域的新的商业机会。Abstract
Web3 and AI have been among the most discussed fields over the recent years, with substantial hype surrounding each field's potential to transform the world as we know it. However, as the hype settles, it's evident that neither AI nor Web3 can address all challenges independently. Consequently, the intersection of AI and Web3 is gaining increased attention, emerging as a new field with the potential to address the limitations of each. In this article, we will focus on the integration of web3 and the AI marketplace, where AI services and products can be provided in a decentralized manner (DeAI). A comprehensive review is provided by summarizing the opportunities and challenges on this topic. Additionally, we offer analyses and solutions to address these challenges. We've developed a framework that lets users pay with any kind of cryptocurrency to get AI services. Additionally, they can also enjoy AI services for free on our platform by simply locking up their assets temporarily in the protocol. This unique approach is a first in the industry. Before this, offering free AI services in the web3 community wasn't possible. Our solution opens up exciting opportunities for the AI marketplace in the web3 space to grow and be widely adopted.
摘要
“Web3和人工智能(AI)在最近几年内得到了很多关注,但是它们无法独立解决全部问题。因此,Web3和AI之间的交叉领域正在吸引越来越多的关注,并且被认为可以抵消每个领域的局限性。本文将关注Web3和AI市场的融合,即DeAI(Decentralized AI)。我们将提供全面的机会和挑战的概述,以及解决这些挑战的分析和解决方案。我们已经开发了一套框架,允许用户使用任何种 криптовалю来购买AI服务。此外,用户还可以在我们的平台上免费使用AI服务,只需将资产短时间内锁定在协议中。这种独特的方法是行业中的首次实践。在这之前,在Web3社区中无法免费提供AI服务。我们的解决方案将推动AI市场在Web3空间广泛采用和普及。”
Roles of Scaling and Instruction Tuning in Language Perception: Model vs. Human Attention
results: 结果显示, scaling 可以增强人类阅读注意力的效果,并减少无关的模式依赖性,而 instruction tuning 则不会对语言理解产生影响。此外,现有的 LLMS 都在注意力方面存在一定的不足,它们的注意力更接近非native than native 语言。Abstract
Recent large language models (LLMs) have revealed strong abilities to understand natural language. Since most of them share the same basic structure, i.e. the transformer block, possible contributors to their success in the training process are scaling and instruction tuning. However, how these factors affect the models' language perception is unclear. This work compares the self-attention of several existing LLMs (LLaMA, Alpaca and Vicuna) in different sizes (7B, 13B, 30B, 65B), together with eye saccade, an aspect of human reading attention, to assess the effect of scaling and instruction tuning on language perception. Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not. However, instruction tuning significantly enhances the models' sensitivity to instructions. We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models. Our code and data used in the analysis is available on GitHub.
摘要
最近的大型语言模型(LLMs)表现出了对自然语言的强大理解能力。由于大多数这些模型具有相同的基本结构,即转换块,因此可能的贡献因素包括扩大和指导调整。然而,这些因素对模型的语言识别是如何影响的还未清楚。这项工作比较了一些现有的LLMs(LLaMA、Alpaca和Vicuna)在不同大小(7B、13B、30B、65B)中的自我注意力,以及人类阅读注意力的眼动踪迹,以评估扩大和指导调整对语言识别的影响。结果显示,扩大可以提高人类类似性和有效注意力,而减少了杂乱模式的依赖性。然而,指导调整并没有这样的效果。此外,我们发现现有的LLMs在注意力方面都更接近非本地语言 speaker,这表明所有模型的语言识别能力有所不足。我们在 GitHub 上提供了代码和数据,用于进行分析。
results: 这个论文的结果表明,使用” Bespoke solvers”可以大幅提高生成质量,只需要1%的GPU时间和80个学习参数。Abstract
Diffusion or flow-based models are powerful generative paradigms that are notoriously hard to sample as samples are defined as solutions to high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) which require a large Number of Function Evaluations (NFE) to approximate well. Existing methods to alleviate the costly sampling process include model distillation and designing dedicated ODE solvers. However, distillation is costly to train and sometimes can deteriorate quality, while dedicated solvers still require relatively large NFE to produce high quality samples. In this paper we introduce "Bespoke solvers", a novel framework for constructing custom ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach optimizes an order consistent and parameter-efficient solver (e.g., with 80 learnable parameters), is trained for roughly 1% of the GPU time required for training the pre-trained model, and significantly improves approximation and generation quality compared to dedicated solvers. For example, a Bespoke solver for a CIFAR10 model produces samples with Fr\'echet Inception Distance (FID) of 2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 NFE.
摘要
Diffusion或流程基本模型是一种强大的生成概念,但它们很难进行样本生成,因为样本是定义为高维度常微方程(ODE)或随机常微方程(SDE)的解。现有的方法可以减少样本生成的成本,包括模型热塑化和专门设计的ODE解程。然而,热塑化训练成本较高,并且有时会降低质量,而专门的解程仍然需要相对较多的功能评估(NFE)来生成高质量的样本。在这篇论文中,我们介绍了“特制解程”,一种新的框架,用于建立针对已经训练的流变模型的自定义ODE解程。我们的方法通过优化一个与顺序数相同的并高效的参数(例如80个学习参数),在约1%的GPU时间上训练,并显著改善了样本生成和预测质量,相比特定解程。例如,一个特制解程为CIFAR10模型生成的样本的Fréchet Inception Distance(FID)为2.73,只需10个NFE,并在20个NFE下达到1%的原始真实值(GT)FID(2.59)。在更加困难的ImageNet-64×64上,特制样本的FID为2.2,只需10个NFE,并在20个NFE下达到2%的GT FID(1.71)。
Gauge-optimal approximate learning for small data classification problems
for: 这篇论文是为了解决小数据学问题, Specifically, the paper aims to address small data learning problems, where there is a significant discrepancy between the limited amount of response variable observations and the large feature space dimension.
methods: 本论文提出了一个新的方法,即对焦点测量(Gauge-Optimal Approximate Learning,GOAL)算法,这个算法可以实现缩减和旋转特征空间,并提供一个分析可能的解方案。 The paper proposes a new method called Gauge-Optimal Approximate Learning (GOAL) algorithm, which can reduce and rotate the feature space and provide an analytically tractable joint solution to the dimension reduction, feature segmentation, and classification problems for small data learning problems.
results: 实验结果显示, compared to other state-of-the-art machine learning (ML) tools, the proposed GOAL algorithm outperforms the reported best competitors for these problems both in learning performance and computational cost. The experimental results show that the proposed algorithm can accurately classify the data and is more efficient than other methods.Abstract
Small data learning problems are characterized by a significant discrepancy between the limited amount of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information, and cannot derive an appropriate learning rule which allows to discriminate between different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the Gauge-Optimal Approximate Learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space, and that it can be approximated through a monotonically convergent algorithm which presents -- under the assumption of a discrete segmentation of the feature space -- a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning (ML) tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Nino Southern Oscillation and inference of epigenetically-induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems both in learning performance and computational cost.
摘要
小数据学习问题特征在于响应变量观察数量较少,而特征空间维度很大。在这种情况下,常见的学习工具困难分化重要的特征和无关信息,并 derivate一个适当的学习规则来区分不同的类别。为解决这个问题,我们利用减少和旋转特征空间的低维度投影的想法,并提出了一种可解算的GOAL算法,该算法可以同时解决小数据学习问题中的维度减少、特征分解和分类问题。我们证明了GOAL算法的优化解决方案是均匀分割的几何函数,并且可以通过一个 monotonically convergent 算法来approximate,该算法在特征空间的精确分割情况下具有closed-form解决方案和linear iteration cost scaling。GOAL算法与其他当前领先的机器学习工具进行比较,在 sintetic data 和挑战性的实际应用中(如气候科学和生物信息学)表现出色,其性能和计算成本均高于报道的最佳竞争对手。
results: 研究人员透过调整该多modal的感知系统,提高了网球的推进精度和速度,并且引入了一个新的旋转估计方法以提高网球的旋转精度。最后,研究人员还展示了结合事件式摄像头和神经网络的精度实时网球检测方法。Abstract
In recent years, robotic table tennis has become a popular research challenge for perception and robot control. Here, we present an improved table tennis robot system with high accuracy vision detection and fast robot reaction. Based on previous work, our system contains a KUKA robot arm with 6 DOF, with four frame-based cameras and two additional event-based cameras. We developed a novel calibration approach to calibrate this multimodal perception system. For table tennis, spin estimation is crucial. Therefore, we introduced a novel, and more accurate spin estimation approach. Finally, we show how combining the output of an event-based camera and a Spiking Neural Network (SNN) can be used for accurate ball detection.
摘要
Recently, robotic table tennis has become a popular research challenge for perception and robot control. Here, we present an improved table tennis robot system with high accuracy vision detection and fast robot reaction. Based on previous work, our system contains a KUKA robot arm with 6 DOF, with four frame-based cameras and two additional event-based cameras. We developed a novel calibration approach to calibrate this multimodal perception system. For table tennis, spin estimation is crucial. Therefore, we introduced a novel, and more accurate spin estimation approach. Finally, we show how combining the output of an event-based camera and a Spiking Neural Network (SNN) can be used for accurate ball detection.Translation in Simplified Chinese:近年来,机器人乒乓球成为了视觉和机器人控制领域的流行研究挑战。在这里,我们提出了一个改进型的乒乓球机器人系统,拥有高精度视觉检测和快速机器人反应。基于先前的工作,我们的系统包括一个KUKA机器人臂 WITH 6 DOE,以及四个帧基的摄像头和两个事件基的摄像头。我们开发了一种新的委外纠偏方法来委外这个多模态感知系统。为乒乓球而言,轨迹估计是非常重要的。因此,我们引入了一种新的、更加准确的轨迹估计方法。最后,我们示出了将事件基摄像头的输出和神经网络(SNN)结合使用,可以实现高精度的球体检测。
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
results: TESTA可以减少视频Token的数量,提高视频编码的效率,并且在五个数据集上进行了段落到视频检索和长形视频问答任务的实验,经验表明,TESTA可以提高计算效率1.7倍,并且可以充分利用更长的输入帧,例如+13.7 R@1 on QuerYD和+6.5 R@1 on Condensed Movie。Abstract
Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.
摘要
大规模视频语言预训练已经取得了关键视频语言理解任务的显著进步。然而,视频编码的计算沉重仍然是效率瓶颈,特别是长形视频。这些视频具有自然的3D特性和空间时间重复,使得捕捉复杂的时间和空间关系变得困难。为解决这个问题,我们提出了高效的方法called TESTA。TESTA通过适应地聚合相似帧和帧内相似的小块来缩短视 semantics。TESTA可以减少视觉token数量,从而加速视频编码。基于TESTA,我们介绍了一个带有分开的空间时间token聚合模块的预训练视频语言模型。我们在五个数据集上进行了对 paragraph-to-video retrieval和长形 VideoQA任务的测试。实验结果显示,TESTA可以提高计算效率,并且在处理 longer input frames 时 achieved significant performance gains,比如QuerYD上的+13.7 R@1和Condensed Movie上的+6.5 R@1。
A Unique Training Strategy to Enhance Language Models Capabilities for Health Mention Detection from Social Media Content
paper_authors: Pervaiz Iqbal Khan, Muhammad Nabeel Asim, Andreas Dengel, Sheraz Ahmed
for: 提取社交媒体上的健康相关内容,用于疾病传播、评估药物对疾病的影响等应用。
methods: 采用随机权重扰动和对比学习策略来训练语言模型,以便从社交媒体文本中提取通用模式。
results: 提出一种基于多种语言模型的元预测器,可以将社交媒体文本分类为非健康和健康相关两类,并在三个公共评测数据集上实现了3.87%的F1分数提升和超过现有健康提及分类预测器的性能。Abstract
An ever-increasing amount of social media content requires advanced AI-based computer programs capable of extracting useful information. Specifically, the extraction of health-related content from social media is useful for the development of diverse types of applications including disease spread, mortality rate prediction, and finding the impact of diverse types of drugs on diverse types of diseases. Language models are competent in extracting the syntactic and semantics of text. However, they face a hard time extracting similar patterns from social media texts. The primary reason for this shortfall lies in the non-standardized writing style commonly employed by social media users. Following the need for an optimal language model competent in extracting useful patterns from social media text, the key goal of this paper is to train language models in such a way that they learn to derive generalized patterns. The key goal is achieved through the incorporation of random weighted perturbation and contrastive learning strategies. On top of a unique training strategy, a meta predictor is proposed that reaps the benefits of 5 different language models for discriminating posts of social media text into non-health and health-related classes. Comprehensive experimentation across 3 public benchmark datasets reveals that the proposed training strategy improves the performance of the language models up to 3.87%, in terms of F1-score, as compared to their performance with traditional training. Furthermore, the proposed meta predictor outperforms existing health mention classification predictors across all 3 benchmark datasets.
摘要
随着社交媒体内容的不断增加,需要更高级的人工智能计算机程序来提取有用信息。具体来说,从社交媒体中提取健康相关内容非常有用,可以用于生病传播、死亡率预测和不同类型的药物对不同类型疾病的影响等多种应用。语言模型可以提取文本的语法和 semantics,但它们在社交媒体文本上遇到困难。主要的原因在于社交媒体用户通常采用不标准的写作风格。为了解决这个问题,本文的关键目标是使语言模型学习泛化模式。这个目标通过随机权重扰动和对比学习策略来实现。此外,我们还提出了一种基于5种语言模型的元预测器,可以对社交媒体文本分类为非健康和健康相关类别。通过对3个公共 benchmark 数据集进行广泛的实验,我们发现,我们的训练策略可以提高语言模型的性能,相比传统训练策略,提高F1-score的表现达3.87%。此外,我们的元预测器还可以在所有3个 benchmark 数据集上超过现有的健康提及分类预测器。
MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion
results: 在三个资讯搜寻 dataset 上进行了广泛的实验,与其他基eline相比,提高了查询扩展的性能。Abstract
Query expansion is a commonly-used technique in many search systems to better represent users' information needs with additional query terms. Existing studies for this task usually propose to expand a query with retrieved or generated contextual documents. However, both types of methods have clear limitations. For retrieval-based methods, the documents retrieved with the original query might not be accurate enough to reveal the search intent, especially when the query is brief or ambiguous. For generation-based methods, existing models can hardly be trained or aligned on a particular corpus, due to the lack of corpus-specific labeled data. In this paper, we propose a novel Large Language Model (LLM) based mutual verification framework for query expansion, which alleviates the aforementioned limitations. Specifically, we first design a query-query-document generation pipeline, which can effectively leverage the contextual knowledge encoded in LLMs to generate sub-queries and corresponding documents from multiple perspectives. Next, we employ a mutual verification method for both generated and retrieved contextual documents, where 1) retrieved documents are filtered with the external contextual knowledge in generated documents, and 2) generated documents are filtered with the corpus-specific knowledge in retrieved documents. Overall, the proposed method allows retrieved and generated documents to complement each other to finalize a better query expansion. We conduct extensive experiments on three information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO. The results demonstrate that our method outperforms other baselines significantly.
摘要
很多搜索系统中使用的查询扩展技术可以更好地表达用户的信息需求。现有的研究通常是使用已经retsieved或生成的文档来扩展查询。然而,这两种方法都有明显的局限性。对于retsieval-based方法来说,用于扩展查询的文档可能并不准确地反映搜索意图,特别是当查询语句简短或 ambiguous 时。对于生成-based方法来说,现有的模型很难在特定的文献上进行训练或对Alignment,因为缺乏带有标注数据的文献特有的训练数据。在本文中,我们提出了一种基于大型自然语言模型(LLM)的 queries 的共同验证框架,以解决以上所述的局限性。具体来说,我们首先设计了一个查询-查询-文档生成管道,可以借助LLM中的上下文知识来生成多个角度的子查询和相应的文档。接着,我们使用一种互verify方法,其中1)retsieved的文档被 Filter 出外部上下文知识生成的文档中,2)生成的文档被 Filter 出特定文献中的训练数据。总的来说,我们的方法可以让retsieved和生成的文档互相补做,以实现更好的查询扩展。我们在TREC-DL-2020、TREC-COVID和MSMARCO三个信息检索数据集上进行了广泛的实验,结果表明我们的方法与其他基准方法相比显著有优势。
Exploring the Emotional Landscape of Music: An Analysis of Valence Trends and Genre Variations in Spotify Music Data
results: 研究发现了音乐情感关系的模式,包括时间的变化和情绪的过渡。这些发现有助于深入理解音乐和情感之间的关系,并提供了长期的音乐情感探索。Abstract
This paper conducts an intricate analysis of musical emotions and trends using Spotify music data, encompassing audio features and valence scores extracted through the Spotipi API. Employing regression modeling, temporal analysis, mood transitions, and genre investigation, the study uncovers patterns within music-emotion relationships. Regression models linear, support vector, random forest, and ridge, are employed to predict valence scores. Temporal analysis reveals shifts in valence distribution over time, while mood transition exploration illuminates emotional dynamics within playlists. The research contributes to nuanced insights into music's emotional fabric, enhancing comprehension of the interplay between music and emotions through years.
摘要
Note: Simplified Chinese is also known as "Mandarin" Chinese.Translation Notes:* "valence scores" is translated as "积分" (jīpǐn), which is a term commonly used in Chinese to refer to the emotional content of music.* "regression modeling" is translated as "回归分析" (huíjì fāngxì), which is a term commonly used in Chinese to refer to statistical modeling techniques used to predict continuous outcomes.* "temporal analysis" is translated as "时间分析" (shíjiàn fāngxì), which is a term commonly used in Chinese to refer to the analysis of data over time.* "mood transitions" is translated as "情绪转移" (qíngxù zhōngmǐ), which is a term commonly used in Chinese to refer to the changes in emotional states within a piece of music.* "genre investigation" is translated as "类型调查" (lèitype jiàozhè), which is a term commonly used in Chinese to refer to the examination of different styles or genres of music.
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
results: 根据实验结果,TeacherLM-7.1B模型在MMLU测试中获得了零shot得分52.3,超过了大多数超过100亿参数的模型。此外,基于TeacherLM-7.1B模型,我们对58个NLP数据集进行了数据增强,并在多任务 Setting中教育了不同参数的OPT和BLOOM系列学生模型。实验结果表明,TeacherLM的数据增强对学生模型带来了显著的改进。Abstract
Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of just "what". The TeacherLM-7.1B model achieved a zero-shot score of 52.3 on MMLU, surpassing most models with over 100B parameters. Even more remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we augmented 58 NLP datasets and taught various student models with different parameters from OPT and BLOOM series in a multi-task setting. The experimental results indicate that the data augmentation provided by TeacherLM has brought significant benefits. We will release the TeacherLM series of models and augmented datasets as open-source.
摘要
大语言模型(LLM)在不同的自然语言处理任务中表现出了吸引人的推理和数据增强能力。然而,小型模型呢?在这项工作中,我们提出了TeacherLM-7.1B模型,可以对大多数NLU样本进行相关基础知识、链条思维和常见错误的标注,使得其他模型可以学习“为什么”而不仅仅是“什么”,从而提高NLU模型的泛化能力。TeacherLM-7.1B模型在MMLU上取得了零基eline得分52.3,超过了大多数超过100亿参数的模型。此外,TeacherLM还具有出色的数据增强能力。基于TeacherLM,我们对58个NLU数据集进行了增强,并使用不同的OPT和BLOOM系列模型在多任务 Setting中进行了启发。实验结果表明,TeacherLM提供的数据增强对NLU模型的表现带来了显著的改善。我们将发布TeacherLM系列模型和增强数据集作为开源。
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation
paper_authors: Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
for: 本研究探讨了弱类开放词汇 semantic segmentation(WOVSS)问题,即通过仅使用图像和文本对进行学习, segmenting objects of arbitrary classes。
methods: existings works 增强 vanilla vision transformer 的方法,通过引入显式分组识别,例如使用多个组token/中心来分组图像 токен并进行组级文本对齐。
results: 我们的提议方法可以减少对group token的粒度不一致,并且可以在不同的批处理级别进行多模态规范化,从而提高分区能力和精度。实验结果显示,我们的提议方法可以在多个 benchmark 数据集上达到状态 искусственный智能的性能。Abstract
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS), which learns to segment objects of arbitrary classes using mere image-text pairs. Existing works turn to enhance the vanilla vision transformer by introducing explicit grouping recognition, i.e., employing several group tokens/centroids to cluster the image tokens and perform the group-text alignment. Nevertheless, these methods suffer from a granularity inconsistency regarding the usage of group tokens, which are aligned in the all-to-one v.s. one-to-one manners during the training and inference phases, respectively. We argue that this discrepancy arises from the lack of elaborate supervision for each group token. To bridge this granularity gap, this paper explores explicit supervision for the group tokens from the prototypical knowledge. To this end, this paper proposes the non-learnable prototypical regularization (NPR) where non-learnable prototypes are estimated from source features to serve as supervision and enable contrastive matching of the group tokens. This regularization encourages the group tokens to segment objects with less redundancy and capture more comprehensive semantic regions, leading to increased compactness and richness. Based on NPR, we propose the prototypical guidance segmentation network (PGSeg) that incorporates multi-modal regularization by leveraging prototypical sources from both images and texts at different levels, progressively enhancing the segmentation capability with diverse prototypical patterns. Experimental results show that our proposed method achieves state-of-the-art performance on several benchmark datasets. The source code is available at https://github.com/Ferenas/PGSeg.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China.
AMIR: Automated MisInformation Rebuttal – A COVID-19 Vaccination Datasets based Recommendation System
results: 研究表明,通过使用这种方法,可以快速、高效地对谣言进行回击,并且可以扩展到其他社交媒体平台和谣言类型。Abstract
Misinformation has emerged as a major societal threat in recent years in general; specifically in the context of the COVID-19 pandemic, it has wrecked havoc, for instance, by fuelling vaccine hesitancy. Cost-effective, scalable solutions for combating misinformation are the need of the hour. This work explored how existing information obtained from social media and augmented with more curated fact checked data repositories can be harnessed to facilitate automated rebuttal of misinformation at scale. While the ideas herein can be generalized and reapplied in the broader context of misinformation mitigation using a multitude of information sources and catering to the spectrum of social media platforms, this work serves as a proof of concept, and as such, it is confined in its scope to only rebuttal of tweets, and in the specific context of misinformation regarding COVID-19. It leverages two publicly available datasets, viz. FaCov (fact-checked articles) and misleading (social media Twitter) data on COVID-19 Vaccination.
摘要
“误情传播已经成为现代社会的主要问题,尤其是在 COVID-19 大流行期间。这种误情传播可能会导致疫苗抵触,例如通过传播不实信息。为了解决这个问题,我们需要一些可靠且可扩展的解决方案。这个研究探索了如何使用社交媒体上的现有信息,加上更加精心的实验 checked 数据库,来自动反驳误情传播。这个研究的想法可以应用于更 широ的误情传播问题,使用多种信息来源和覆盖多个社交媒体平台。这个研究作为证明,仅对于 Twitter 上的反驳误情传播进行了评估,并且仅在 COVID-19 疫苗接种方面进行了评估。它使用了两个公共可用的数据集,namely,FaCov(实验 checked 文章)和 misleading(社交媒体 Twitter)数据集。”Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
Bipartite Graph Pre-training for Unsupervised Extractive Summarization with Graph Convolutional Auto-Encoders
results: 我们的方法在下游任务中表现出色,超越了使用BERT或RoBERTa的句子表示。Abstract
Pre-trained sentence representations are crucial for identifying significant sentences in unsupervised document extractive summarization. However, the traditional two-step paradigm of pre-training and sentence-ranking, creates a gap due to differing optimization objectives. To address this issue, we argue that utilizing pre-trained embeddings derived from a process specifically designed to optimize cohensive and distinctive sentence representations helps rank significant sentences. To do so, we propose a novel graph pre-training auto-encoder to obtain sentence embeddings by explicitly modelling intra-sentential distinctive features and inter-sentential cohesive features through sentence-word bipartite graphs. These pre-trained sentence representations are then utilized in a graph-based ranking algorithm for unsupervised summarization. Our method produces predominant performance for unsupervised summarization frameworks by providing summary-worthy sentence representations. It surpasses heavy BERT- or RoBERTa-based sentence representations in downstream tasks.
摘要
NP-SBFL: Bridging the Gap Between Spectrum-Based Fault Localization and Faulty Neural Pathways Diagnosis
results: 对于MNIST和CIFAR-10两个常用的数据集,以及三种异常神经元度量 Tarantula、Ochiai 和 Barinel,我们的方法比基elines更有效地识别异常路径和生成攻击输入。特别是在 Tarantula 上,NP-SBFL-MGA 的异常检测率达到 96.75%,超过 DeepFault 在 Ochiai 上的 89.90% 和 NP-SBFL-GA 在 Ochiai 上的 60.61%。Abstract
Deep learning has revolutionized various real-world applications, but the quality of Deep Neural Networks (DNNs) remains a concern. DNNs are complex and have millions of parameters, making it difficult to determine their contributions to fulfilling a task. Moreover, the behavior of a DNN is highly influenced by the data used during training, making it challenging to collect enough data to exercise all potential DNN behavior under all possible scenarios. This paper proposes a novel NP-SBFL method that adapts spectrum-based fault localization (SBFL) to locate faulty neural pathways. Our method identifies critical neurons using the layer-wise relevance propagation (LRP) technique and determines which critical neurons are faulty. We propose a multi-stage gradient ascent (MGA), an extension of gradient ascent, to effectively activate a sequence of neurons one at a time while maintaining the activation of previous neurons. We evaluated the effectiveness of our method on two commonly used datasets, MNIST and CIFAR-10, two baselines DeepFault and NP-SBFL-GA, and three suspicious neuron measures, Tarantula, Ochiai, and Barinel. The empirical results showed that NP-SBFL-MGA is statistically more effective than the baselines at identifying suspicious paths and synthesizing adversarial inputs. Particularly, Tarantula on NP-SBFL-MGA had the highest fault detection rate at 96.75%, surpassing DeepFault on Ochiai (89.90%) and NP-SBFL-GA on Ochiai (60.61%). Our approach also yielded comparable results to the baselines in synthesizing naturalness inputs, and we found a positive correlation between the coverage of critical paths and the number of failed tests in DNN fault localization.
摘要
深度学习已经革命化了各种实际应用,但是深度神经网络(DNN)的质量仍然是一大问题。DNN具有很多参数,因此很难确定它们在完成任务时的贡献。此外,DNN的行为受到训练数据的影响,因此收集足够的数据来覆盖所有可能的scenario是很困难的。这篇论文提出了一种基于spectrum-based fault localization(SBFL)的新方法,用于 locate faulty neural pathways。我们的方法使用层 wise relevance propagation(LRP)技术来确定关键神经元,并使用多Stage gradient ascent(MGA)来有效地激活一系列神经元,而不会产生前一个神经元的激活失效。我们对MNIST和CIFAR-10两个常用的数据集进行了评估,与DeepFault和NP-SBFL-GA两个基eline进行了比较,以及三种异常神经元度量 Tarantula、Ochiai和Barinel。实际结果表明,NP-SBFL-MGA在 Identifying suspicious paths和生成攻击输入方面具有 statistically higher effectiveness than baselines。特别是在 Tarantula上,NP-SBFL-MGA的异常检测率为 96.75%,超过 DeepFault on Ochiai(89.90%)和 NP-SBFL-GA on Ochiai(60.61%)。我们的方法还在生成自然输入方面得到了相似的结果,并发现了关键路径覆盖率和失败测试数量之间的正相关关系。
DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding
results: 论文提出了一个新的DCQA数据集,包含6种不同的图表样式和699,051个需要高度理解和常识能力的问题。此外,论文还提出了一种使用表格数据、颜色集和基本问题模板生成大量理智问答对的问题生成引擎。Abstract
Visually-situated languages such as charts and plots are omnipresent in real-world documents. These graphical depictions are human-readable and are often analyzed in visually-rich documents to address a variety of questions that necessitate complex reasoning and common-sense responses. Despite the growing number of datasets that aim to answer questions over charts, most only address this task in isolation, without considering the broader context of document-level question answering. Moreover, such datasets lack adequate common-sense reasoning information in their questions. In this work, we introduce a novel task named document-level chart question answering (DCQA). The goal of this task is to conduct document-level question answering, extracting charts or plots in the document via document layout analysis (DLA) first and subsequently performing chart question answering (CQA). The newly developed benchmark dataset comprises 50,010 synthetic documents integrating charts in a wide range of styles (6 styles in contrast to 3 for PlotQA and ChartQA) and includes 699,051 questions that demand a high degree of reasoning ability and common-sense understanding. Besides, we present the development of a potent question-answer generation engine that employs table data, a rich color set, and basic question templates to produce a vast array of reasoning question-answer pairs automatically. Based on DCQA, we devise an OCR-free transformer for document-level chart-oriented understanding, capable of DLA and answering complex reasoning and common-sense questions over charts in an OCR-free manner. Our DCQA dataset is expected to foster research on understanding visualizations in documents, especially for scenarios that require complex reasoning for charts in the visually-rich document. We implement and evaluate a set of baselines, and our proposed method achieves comparable results.
摘要
文本中的可见语言如图表和折衣图是现实生活中文档中 ubique 存在的。这些图形展示是人类可读的,并在文档中常常用于回答复杂的问题,需要复杂的理解和常识。Despite the growing number of datasets that aim to answer questions over charts, most only address this task in isolation, without considering the broader context of document-level question answering. Moreover, such datasets lack adequate common-sense reasoning information in their questions. In this work, we introduce a novel task named 文档级图表问题回答 (DCQA). The goal of this task is to conduct document-level question answering, extracting charts or plots in the document via 文档布局分析 (DLA) first and subsequently performing 图表问题回答 (CQA). The newly developed benchmark dataset comprises 50,010 synthetic documents integrating charts in a wide range of styles (6 styles in contrast to 3 for PlotQA and ChartQA) and includes 699,051 questions that demand a high degree of reasoning ability and common-sense understanding. Besides, we present the development of a potent question-answer generation engine that employs table data, a rich color set, and basic question templates to produce a vast array of reasoning question-answer pairs automatically. Based on DCQA, we devise an OCR-free transformer for document-level chart-oriented understanding, capable of DLA and answering complex reasoning and common-sense questions over charts in an OCR-free manner. Our DCQA dataset is expected to foster research on understanding visualizations in documents, especially for scenarios that require complex reasoning for charts in the visually-rich document. We implement and evaluate a set of baselines, and our proposed method achieves comparable results.
results: 初步结果表明,LLMs在非西方世界的社会规范方面往往无法理解和遵循当地的习俗。Abstract
Etiquettes are an essential ingredient of day-to-day interactions among people. Moreover, etiquettes are region-specific, and etiquettes in one region might contradict those in other regions. In this paper, we propose EtiCor, an Etiquettes Corpus, having texts about social norms from five different regions across the globe. The corpus provides a test bed for evaluating LLMs for knowledge and understanding of region-specific etiquettes. Additionally, we propose the task of Etiquette Sensitivity. We experiment with state-of-the-art LLMs (Delphi, Falcon40B, and GPT-3.5). Initial results indicate that LLMs, mostly fail to understand etiquettes from regions from non-Western world.
摘要
礼仪是日常人际交流中的重要组成部分。同时,礼仪在不同地区之间可能存在差异,一地的礼仪可能与另一地的礼仪矛盾。在这篇论文中,我们提出了“礼仪库”(EtiCor),包含来自五个不同地区的社会规范文本。该库提供了对LRMs(语言模型)的评估和理解地区特有的礼仪知识的测试平台。此外,我们还提出了“礼仪敏感度”的任务。我们对当今顶尖LRMs(Delphi、Falcon40B和GPT-3.5)进行了实验,初步结果显示,LRMs在非西方世界地区的礼仪知识方面表现不佳。
Analyzing Vision Transformers for Image Classification in Class Embedding Space
results: 研究发现,图像块在层次结构中发展出了类型特定的表示,这与注意力机制和上下文信息有关。此外,该方法还可以用来确定图像中关键的部分,并且与传统的直接探测方法相比,具有显著的优势。Abstract
Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.
摘要
尽管变换器模型在计算机视觉领域的使用正在增长,但是我们仍需要更好地理解这些网络。这项工作提出了一种方法,用于反向工程视Transformers在图像分类任务上训练过后的内部表示。 Drawing inspiration from previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.
Spacecraft Autonomous Decision-Planning for Collision Avoidance: a Reinforcement Learning Approach
paper_authors: Nicolas Bourriez, Adrien Loizeau, Adam F. Abdin for:The paper is written for the purpose of proposing an implementation of autonomous collision avoidance decision-making capabilities on spacecraft using reinforcement learning techniques.methods:The proposed methodology is based on a partially observable Markov decision process (POMDP) framework, which considers epistemic and aleatory uncertainties and allows the AI system on board the spacecraft to learn stochastic policies for accurate collision avoidance maneuvers.results:The objective of the paper is to successfully delegate the decision-making process for autonomously implementing a collision avoidance maneuver to the spacecraft without human intervention, allowing for a faster response in the decision-making process and highly decentralized operations.Abstract
The space environment around the Earth is becoming increasingly populated by both active spacecraft and space debris. To avoid potential collision events, significant improvements in Space Situational Awareness (SSA) activities and Collision Avoidance (CA) technologies are allowing the tracking and maneuvering of spacecraft with increasing accuracy and reliability. However, these procedures still largely involve a high level of human intervention to make the necessary decisions. For an increasingly complex space environment, this decision-making strategy is not likely to be sustainable. Therefore, it is important to successfully introduce higher levels of automation for key Space Traffic Management (STM) processes to ensure the level of reliability needed for navigating a large number of spacecraft. These processes range from collision risk detection to the identification of the appropriate action to take and the execution of avoidance maneuvers. This work proposes an implementation of autonomous CA decision-making capabilities on spacecraft based on Reinforcement Learning (RL) techniques. A novel methodology based on a Partially Observable Markov Decision Process (POMDP) framework is developed to train the Artificial Intelligence (AI) system on board the spacecraft, considering epistemic and aleatory uncertainties. The proposed framework considers imperfect monitoring information about the status of the debris in orbit and allows the AI system to effectively learn stochastic policies to perform accurate Collision Avoidance Maneuvers (CAMs). The objective is to successfully delegate the decision-making process for autonomously implementing a CAM to the spacecraft without human intervention. This approach would allow for a faster response in the decision-making process and for highly decentralized operations.
摘要
地球附近的空间环境正在不断增加活跃的空间craft和空间垃圾的数量,为了避免 potential collision event,空间 situational awareness (SSA) 活动和 collision avoidance (CA) 技术得到了重要改进,可以准确地跟踪和 manipulate 空间craft。然而,这些过程仍然需要人类参与,以便做出必要的决策。随着空间环境的增加复杂度,这种决策策略可能不可持续。因此,需要成功地把Space Traffic Management (STM) 过程中的关键步骤自动化,以确保在大量空间craft navigating 时的可靠性。这些过程包括风险检测和避免措施的识别以及执行避免措施。本工作提出了基于 reinforcement learning (RL) 技术的自动化 CA 决策能力的实现。一种基于 partially observable Markov decision process (POMDP) 框架的新方法被开发,用于在空间craft上训练人工智能 (AI) 系统,考虑到了 epistemic 和 aleatory 不确定性。该方法考虑了在轨道上监测空间垃圾的状况不准确的情况,并允许 AI 系统学习 Stochastic policies 以实现准确的避免措施。目标是成功地委托空间craft上的 AI 系统自动实施避免措施,无需人类参与。这种方法可以提供更快的决策过程,并允许高度分布式的操作。
End-to-End Autoregressive Retrieval via Bootstrapping for Smart Reply Systems
results: 实验结果表明,这种方法可以与一些现有的基eline方法相比,在三个数据集上表现出5.1%-17.9%的改善空间,并且在0.5%-63.1%的多样性上具有显著的改善。Abstract
Reply suggestion systems represent a staple component of many instant messaging and email systems. However, the requirement to produce sets of replies, rather than individual replies, makes the task poorly suited for out-of-the-box retrieval architectures, which only consider individual message-reply similarity. As a result, these system often rely on additional post-processing modules to diversify the outputs. However, these approaches are ultimately bottlenecked by the performance of the initial retriever, which in practice struggles to present a sufficiently diverse range of options to the downstream diversification module, leading to the suggestions being less relevant to the user. In this paper, we consider a novel approach that radically simplifies this pipeline through an autoregressive text-to-text retrieval model, that learns the smart reply task end-to-end from a dataset of (message, reply set) pairs obtained via bootstrapping. Empirical results show this method consistently outperforms a range of state-of-the-art baselines across three datasets, corresponding to a 5.1%-17.9% improvement in relevance, and a 0.5%-63.1% improvement in diversity compared to the best baseline approach. We make our code publicly available.
摘要
快件和电子邮件系统中的回复建议系统是一个基本组件。然而,需要生成多个回复而不是单个回复,使得这种任务与传统的检索架构不Compatible,后者只考虑单个消息和回复之间的相似性。因此,这些系统通常需要额外的后处理模块来增加多样性。然而,这些方法受到最初的检索器的性能的限制,导致下游多样化模块无法提供充分多样化的选项,从而导致建议相对较少 relevance。在这篇论文中,我们考虑了一种新的方法,通过自然语言模型来实现简化这个管道。我们通过对 (消息、回复集) 对的数据集进行 bootstrap 来学习端到端的文本到文本检索模型。实验结果表明,这种方法可以一直 exceed state-of-the-art 基准方法,在三个数据集上取得了5.1%-17.9%的改善,并在多样性方面取得了0.5%-63.1%的改善。我们将代码公开发布。
Mask Propagation for Efficient Video Semantic Segmentation
results: 我们的mask propagation方法在VSPW和Cityscapes dataset上实现了SOTA的准确率和效率负担的 equilibrio。例如,我们的最佳模型(Swin-L backbone)在VSPW dataset上比SOTA的MRCFA(使用MiT-B5)高4.0%的mIoU,仅需26%的FLOPs。此外,我们的框架可以在Cityscapes验证集上减少到4倍的FLOPs,同时保持只有2%的mIoU下降。代码可以在https://github.com/ziplab/MPVSS中获取。Abstract
Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.
摘要
视频 semantic segmentation (VSS) 是将每帧视频序列中的每个像素分配 semantic 标签。先前的工作在这个领域已经实现了可观的结果,通过将图像 semantic segmentation 模型扩展到利用视频帧之间的时间关系,但这些方法经常产生巨大的计算成本。在这篇论文中,我们提出了一种高效的面 mask 传播框架,称为 MPVSS。我们的方法首先使用强大的查询基于图像 segmentor 在稀疏的关键帧上生成准确的二进制面和分类预测。然后,我们设计了学习查询的流量估计模块,使用学习的查询来生成每帧视频的流量映射,每个映射都与关键帧中的面预测相关。最后,面-流量对被折叠,以用于非关键帧的面预测。通过重用关键帧的预测,我们绕过处理大量视频帧的资源占用INTENSIVE segmentor,解决时间重复和大幅减少计算成本。广泛的实验表明,我们的面传播框架在 VSPW 和 Cityscapes 上实现了最佳的质量和效率的交换。例如,我们的最佳模型(Swin-L 背景)在 VSPW 数据集上的 mIoU 比 SOTA MRCFA(使用 MiT-B5 背景)高4.0%,而计算成本只占26% FLOPs。此外,我们的框架可以在 Cityscapes 验证集上减少至4倍的计算成本,而且只增加了2% mIoU 下降。代码可以在 GitHub 上找到:https://github.com/ziplab/MPVSS。
Building a Safer Maritime Environment Through Multi-Path Long-Term Vessel Trajectory Forecasting
paper_authors: Gabriel Spadon, Jay Kumar, Matthew Smith, Sarah Vela, Romina Gehrmann, Derek Eden, Joshua van Berkel, Amilcar Soares, Ronan Fablet, Ronald Pelot, Stan Matwin
results: 该模型在加拿大圣劳伦斯湾(Gulf of St. Lawrence)实现了R2分数超过98%,并且在不同的技术和特征下实现了高精度预测。此外,模型还表现出了更好的复杂决策能力和更高的准确率, average和 median预测错误分别为11km和6km。Abstract
Maritime transport is paramount to global economic growth and environmental sustainability. In this regard, the Automatic Identification System (AIS) data plays a significant role by offering real-time streaming data on vessel movement, which allows for enhanced traffic surveillance, assisting in vessel safety by avoiding vessel-to-vessel collisions and proactively preventing vessel-to-whale ones. This paper tackles an intrinsic problem to trajectory forecasting: the effective multi-path long-term vessel trajectory forecasting on engineered sequences of AIS data. We utilize an encoder-decoder model with Bidirectional Long Short-Term Memory Networks (Bi-LSTM) to predict the next 12 hours of vessel trajectories using 1 to 3 hours of AIS data. We feed the model with probabilistic features engineered from the AIS data that refer to the potential route and destination of each trajectory so that the model, leveraging convolutional layers for spatial feature learning and a position-aware attention mechanism that increases the importance of recent timesteps of a sequence during temporal feature learning, forecasts the vessel trajectory taking the potential route and destination into account. The F1 Score of these features is approximately 85% and 75%, indicating their efficiency in supplementing the neural network. We trialed our model in the Gulf of St. Lawrence, one of the North Atlantic Right Whales (NARW) habitats, achieving an R2 score exceeding 98% with varying techniques and features. Despite the high R2 score being attributed to well-defined shipping lanes, our model demonstrates superior complex decision-making during path selection. In addition, our model shows enhanced accuracy, with average and median forecasting errors of 11km and 6km, respectively. Our study confirms the potential of geographical data engineering and trajectory forecasting models for preserving marine life species.
摘要
海运是全球经济增长和环境可持续性的重要因素。在这个意义上,自动识别系统(AIS)数据在实时流处理中提供了船舶运动的实时流处理数据,从而帮助提高船舶管理和避免船舶相撞和避免船舶和鲸鱼相撞。这篇论文面临了一个核心问题:在Engineered Sequence of AIS Data上进行多path长期船舶轨迹预测。我们使用了编码器-解码器模型,并利用Bi-LSTM网络来预测下一个12小时的船舶轨迹,使用1-3小时的AIS数据。我们对AIS数据进行了Engineering probabilistic特征,这些特征描述了每个轨迹的潜在路径和目的地,以便模型可以通过卷积层学习空间特征和时间特征来预测船舶轨迹。F1 Score的效果为 approximately 85%和75%,表明这些特征的有效性。我们在加拿大圣劳伦斯湾进行了试验, achieve an R2 score exceeding 98% with varying techniques and features。 despite the high R2 score being attributed to well-defined shipping lanes, our model demonstrates superior complex decision-making during path selection。此外,我们的模型还显示了更高的准确率, average和median forecasting errors of 11km和6km, respectively。我们的研究证明了地理数据工程和轨迹预测模型在保护海洋生物种的潜在作用。
Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game
results: 这个研究获得了在对其他LLM-based代理人的胜率最高,并在对反对人类玩家的情况下保持了稳定性的成果。Abstract
Agents built with large language models (LLMs) have recently achieved great advancements. However, most of the efforts focus on single-agent or cooperative settings, leaving more general multi-agent environments underexplored. We propose a new framework powered by reinforcement learning (RL) to develop strategic language agents, i.e., LLM-based agents with strategic thinking ability, for a popular language game, Werewolf. Werewolf is a social deduction game with hidden roles that involves both cooperation and competition and emphasizes deceptive communication and diverse gameplay. Our agent tackles this game by first using LLMs to reason about potential deceptions and generate a set of strategically diverse actions. Then an RL policy, which selects an action from the candidates, is learned by population-based training to enhance the agents' decision-making ability. By combining LLMs with the RL policy, our agent produces a variety of emergent strategies, achieves the highest win rate against other LLM-based agents, and stays robust against adversarial human players in the Werewolf game.
摘要
大型语言模型(LLM)驱动的代理人最近获得了重要的进步。然而,大多数努力都集中在单个代理人或合作设置上,留下更通用的多代理人环境得以探索。我们提出了一个基于返点学习(RL)的新框架,用于开发语言代理人,即基于LLM的策略思维代理人,用于受欢迎的语言游戏《狼人》。《狼人》是一款社交推理游戏,涉及到协作和竞争,并强调误导性的交流和多样化游戏。我们的代理人首先使用LLM来理解潜在的误导和生成一组策略多样化的动作。然后,通过人口学习来学习RL策略,从候选actions中选择一个动作,以提高代理人的决策能力。通过结合LLM和RL策略,我们的代理人产生了多种emergent策略,在与其他LLM-based代理人的比赛中获得最高胜率,并在对人类玩家的抗击中保持稳定。
Machine Learning Algorithms to Predict Chess960 Result and Develop Opening Themes
results: 研究使用三种机器学习算法(KNN clustering、Random Forest 和 Gradient Boosted Trees)预测游戏结果,并通过分析开局中每个位置的 Piece 的移动,预测游戏的发展方向。Abstract
This work focuses on the analysis of Chess 960, also known as Fischer Random Chess, a variant of traditional chess where the starting positions of the pieces are randomized. The study aims to predict the game outcome using machine learning techniques and develop an opening theme for each starting position. The first part of the analysis utilizes machine learning models to predict the game result based on certain moves in each position. The methodology involves segregating raw data from .pgn files into usable formats and creating datasets comprising approximately 500 games for each starting position. Three machine learning algorithms -- KNN Clustering, Random Forest, and Gradient Boosted Trees -- have been used to predict the game outcome. To establish an opening theme, the board is divided into five regions: center, white kingside, white queenside, black kingside, and black queenside. The data from games played by top engines in all 960 positions is used to track the movement of pieces in the opening. By analysing the change in the number of pieces in each region at specific moves, the report predicts the region towards which the game is developing. These models provide valuable insights into predicting game outcomes and understanding the opening theme in Chess 960.
摘要
To begin, the study uses machine learning models to predict the game result based on certain moves in each position. The methodology involves separating raw data from .pgn files into usable formats and creating datasets comprising approximately 500 games for each starting position. Three machine learning algorithms - KNN Clustering, Random Forest, and Gradient Boosted Trees - have been used to predict the game outcome.To establish an opening theme, the board is divided into five regions: center, white kingside, white queenside, black kingside, and black queenside. The data from games played by top engines in all 960 positions is used to track the movement of pieces in the opening. By analyzing the change in the number of pieces in each region at specific moves, the report predicts the region towards which the game is developing. These models provide valuable insights into predicting game outcomes and understanding the opening theme in Chess 960.
The Utility of “Even if…” Semifactual Explanation to Optimise Positive Outcomes
results: 对比PRIOR WORK,该论文的算法能够更好地最大化用户的增值(Gain),并且 causality 在这个过程中具有重要性。 最重要的是,用户测试表明,当用户收到贷款批准的积极结果时,semifactual 解释比counterfactuals更有用。Abstract
When users receive either a positive or negative outcome from an automated system, Explainable AI (XAI) has almost exclusively focused on how to mutate negative outcomes into positive ones by crossing a decision boundary using counterfactuals (e.g., \textit{"If you earn 2k more, we will accept your loan application"}). Here, we instead focus on \textit{positive} outcomes, and take the novel step of using XAI to optimise them (e.g., \textit{"Even if you wish to half your down-payment, we will still accept your loan application"}). Explanations such as these that employ "even if..." reasoning, and do not cross a decision boundary, are known as semifactuals. To instantiate semifactuals in this context, we introduce the concept of \textit{Gain} (i.e., how much a user stands to benefit from the explanation), and consider the first causal formalisation of semifactuals. Tests on benchmark datasets show our algorithms are better at maximising gain compared to prior work, and that causality is important in the process. Most importantly however, a user study supports our main hypothesis by showing people find semifactual explanations more useful than counterfactuals when they receive the positive outcome of a loan acceptance.
摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation is written in the formal tone, which is appropriate for academic writing.)
Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?
results: 实验结果显示,使用该方法可以在医疗记录(EHR)数据集上实现出色的预测结果,并在大多数任务和数据集上超越最佳模型。Abstract
Many of diverse phenomena in nature often inherently encode both short and long term temporal dependencies, short term dependencies especially resulting from the direction of flow of time. In this respect, we discovered experimental evidences suggesting that {\it interrelations} of these events are higher for closer time stamps. However, to be able for attention based models to learn these regularities in short term dependencies, it requires large amounts of data which are often infeasible. This is due to the reason that, while they are good at learning piece wised temporal dependencies, attention based models lack structures that encode biases in time series. As a resolution, we propose a simple and efficient method that enables attention layers to better encode short term temporal bias of these data sets by applying learnable, adaptive kernels directly to the attention matrices. For the experiments, we chose various prediction tasks using Electronic Health Records (EHR) data sets since they are great examples that have underlying long and short term temporal dependencies. The results of our experiments show exceptional classification results compared to best performing models on most of the task and data sets.
摘要
很多自然现象具有各种多样化特征,其中许多现象具有短期和长期时间依赖关系。特别是短期时间依赖关系通常由时间流动的方向决定。我们发现了实验证据,表明在更近的时间戳之间的事件关系更高。然而,为了让注意力基本模型学习这些短期时间依赖关系,需要大量数据,而这些数据经常是不可能获得的。这是因为注意力基本模型能够学习独立的时间序列,但缺乏时间序列中的偏好编码结构。为解决这个问题,我们提出了一种简单而高效的方法,即在注意力矩阵上应用学习可变核函数,以更好地编码短期时间依赖关系。我们选择使用医疗记录(EHR)数据集进行实验,因为它们具有下面和长期时间依赖关系的特点。实验结果表明,我们的方法在大多数任务和数据集上达到了最佳性能。
CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing
for: The paper is written for improving the efficiency of video retrieval by compressing videos into binary codes and learning accurate hash codes for video retrieval.
methods: The paper uses contrastive learning with augmentation strategies to capture global spatio-temporal information and local spatio-temporal details within video frames, and incorporates two collaborative learning tasks to enhance the perception of temporal structure and the modeling of spatio-temporal relationships.
results: The proposed method outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.Here’s the simplified Chinese text for the three key points:
results: 提议的方法在四个视频标准数据集上超越了现有的自动生成视频哈希方法。Abstract
Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.
摘要
压缩视频到二进制编码可以提高检索速度和减少存储开销。然而,学习准确的Hash代码 для视频检索可以是一个挑战,因为视频帧之间存在高度的本地重复和复杂的全球依赖关系,特别是在标签缺失的情况下。现有的自动编号视频方法有效地设计了表达力强的时间编码器,但是它们没有完全利用视频的时间动态和空间显示特征,尤其是在较难和不可靠的学习任务下。为解决这些挑战,我们开始利用对比学习任务来捕捉视频的全球空间时间信息,并采用我们设计的扩展策略,以创建正确的对应对。这些扩展策略专注于视频帧中的空间和时间变化,以便生成不变于运动、缩放和视点的Hash代码。此外,我们还集成了两个协作学习任务,即帧顺序验证和场景变化规范,以捕捉视频帧中的本地空间时间细节,从而增强视频的时间结构和空间时间关系的模型。我们提出的Contrastive Hashing with Global-Local Spatio-temporal Information(CHAIN)方法在四个视频标准测试集上超越了当前自动编号视频方法的性能。我们的代码将会公开发布。
QWID: Quantized Weed Identification Deep neural network
results: 该方法在ResNet-50和InceptionV3架构上实现了准确率与模型大小、执行时间之间的平衡,并在Desktop、Mobile和Raspberry Pi等实际生产环境中实现了显著的模型大小和执行时间减少,同时保持了准确率的水平。Abstract
In this paper, we present an efficient solution for weed classification in agriculture. We focus on optimizing model performance at inference while respecting the constraints of the agricultural domain. We propose a Quantized Deep Neural Network model that classifies a dataset of 9 weed classes using 8-bit integer (int8) quantization, a departure from standard 32-bit floating point (fp32) models. Recognizing the hardware resource limitations in agriculture, our model balances model size, inference time, and accuracy, aligning with practical requirements. We evaluate the approach on ResNet-50 and InceptionV3 architectures, comparing their performance against their int8 quantized versions. Transfer learning and fine-tuning are applied using the DeepWeeds dataset. The results show staggering model size and inference time reductions while maintaining accuracy in real-world production scenarios like Desktop, Mobile and Raspberry Pi. Our work sheds light on a promising direction for efficient AI in agriculture, holding potential for broader applications. Code: https://github.com/parikshit14/QNN-for-weed
摘要
在这篇论文中,我们提出了一种高效的苔藿类分类方案,旨在在农业领域中提高模型性能的推理过程。我们提出了一种量化深度神经网络模型,通过使用8比特整数(int8)量化,与标准32比特浮点数(fp32)模型不同。考虑到农业领域的硬件资源有限,我们的模型坚持平衡模型大小、推理时间和准确率之间的权衡,符合实际需求。我们使用ResNet-50和InceptionV3架构进行评估,与其int8量化版本进行比较。通过使用传输学习和精度调整,我们在桌面、移动设备和Raspberry Pi等实际生产环境中实现了各种模型大小和推理时间的减少,同时保持了准确率。我们的工作探讨了苔藿类分类领域中高效的AI应用的可能性,对更广泛的应用产生了深见。Code:
Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation
results: 论文提出了一种名为 Delayed-PSVI 的优化策略,并提供了对这种策略的首次分析。这种策略在不知道延迟时间的情况下可以达到 $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ 的最差情况 regret。Abstract
Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
摘要
现代再强化学习(RL)研究已经做出了重要进步,通过函数近似来缓解样本复杂性问题,以提高性能。然而,现有的可证fficient算法通常需要快速获得反馈,如果忽略延迟的影响,则可能导致实际系统中的 regret 增长。在这项工作中,我们面临了RL中延迟反馈的挑战,使用线性函数近似,并利用 posterior sampling,这种方法在各种场景中都有良好的实际表现。我们首先介绍了延迟PSVI算法,这是一种使用噪声扰动和 posterior sampling 来精细探索值函数空间的优化算法。我们提供了RL中延迟反馈 posterior sampling 的首次分析,并证明我们的算法在存在未知随机延迟的情况下可以 achiev $\widetilde{O}(\sqrt{d^3H^3T} + d^2H^2 E[\tau])$ 最坏情况的 regret。这里 $E[\tau]$ 是预期的延迟。为了进一步改进其计算效率和扩展到高维RL问题,我们提出了增强版的 Delayed-LPSVI 算法,使用朗凯朋 dynamics 来实现约同 Optimal 的 regret 保证,但计算成本为 $\widetilde{O}(dHK)$。我们的实验表明我们的算法在统计和计算上具有良好的效果。
paper_authors: Tomasz Limisiewicz, David Mareček, Tomáš Musil
for: 这项研究旨在检测和 Mitigating 语言模型中的性别偏见。
methods: 该研究使用 causal analysis 方法来 indentify 问题atic model components,并发现 mid-upper feed-forward layers 最容易传递偏见。基于分析结果,我们采用 linear projection 方法来修改模型。
results: 我们的 DAMA 方法能够 Significantly 降低 bias 指标,同时保持模型在下游任务中的表现。我们发布了我们的方法和模型代码,可以 retrained LLaMA 的 state-of-the-art performance,但是significantly less biased。Abstract
Large language models are becoming the go-to solution for various language tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey biases. Based on the analysis results, we adapt the model by multiplying these layers by a linear projection. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.
摘要
大型语言模型在各种语言任务中成为首选解决方案,但是随着容量的增长,模型往往会受到来自于训练数据中的偏见和 sterotypes 的影响,导致模型偏误预测。本研究提出一种新的方法来检测和减轻语言模型中的性别偏见。我们通过 causal 分析发现,中upper feed-forward 层最容易传递偏见。根据分析结果,我们对这些层进行线性投影修改,实现了我们的 DAMA 方法,可以对多种度量进行减轻偏见,同时保持模型在下游任务上的性能。我们发布了我们的方法和模型代码,可以在 retrained LLaMA 的基础性能下进行训练,并且具有较少偏见的性能。
InstanT: Semi-supervised Learning with Instance-dependent Thresholds
results: 实验结果表明,使用实例висиendent阈值函数可以提高 SSL 的性能,并且可以更好地适应实际应用中的不同数据分布。Abstract
Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing interest in the development of SSL methods that use dynamic or adaptive thresholds. Yet, these methods typically apply the same threshold to all samples, or use class-dependent thresholds for instances belonging to a certain class, while neglecting instance-level information. In this paper, we propose the study of instance-dependent thresholds, which has the highest degree of freedom compared with existing methods. Specifically, we devise a novel instance-dependent threshold function for all unlabeled instances by utilizing their instance-level ambiguity and the instance-dependent error rates of pseudo-labels, so instances that are more likely to have incorrect pseudo-labels will have higher thresholds. Furthermore, we demonstrate that our instance-dependent threshold function provides a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns.
摘要
半监督学习(SSL)已经是机器学习领域的基本挑战之一。主要的SSL算法家族是使用 Pseudo-labeling 方法,即将 confidence 度不高的无标签实例分配 pseudo-label。因此,选择 pseudo-label 的标准是 SSL 的关键。近年来,有越来越多的关注于开发 SSL 方法,使用动态或适应性的阈值。然而,这些方法通常是对所有样本使用同一个阈值,或者使用基于类型的阈值 для归类为某个类型的实例,而忽略实例级别信息。在这篇论文中,我们提出了研究实例 dependent 阈值的想法。具体来说,我们设计了一种基于实例级别的阈值函数,用于所有无标签实例。我们利用实例级别的冗余和实例 dependent 的 pseudo-label 错误率,来确定实例的阈值。此外,我们也证明了我们的实例 dependent 阈值函数可以为 pseudo-label 提供 bounded 概率 garantue。
Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection
results: 我们的方法在官方提供的测试数据上达到了0.9555 的准确率。Abstract
This paper reports our submission under the team name `SynthDetectives' to the ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the task of AI-generated text detection. Our approach is novel in terms of its choice of models in that we use accessible and lightweight models in the ensemble. We show that ensembling the models results in an improved accuracy in comparison with using them individually. Our approach achieves an accuracy score of 0.9555 on the official test data provided by the shared task organisers.
摘要
translate to Simplified Chinese:这篇论文报道我们在 `SynthDetectives' 团队名下提交到 ALTA 2023 共同任务。我们使用Transformers核心来实现人工生成文本检测任务。我们的方法在选择模型方面具有创新性,我们使用可 accessible 和轻量级模型。我们的结果表明,将多个模型 ensemble 后,对比使用单个模型时,具有更高的准确率。我们在官方提供的测试数据上 achieved 0.9555 的准确率。
Ever Evolving Evaluator (EV3): Towards Flexible and Reliable Meta-Optimization for Knowledge Distillation
results: 实验结果表明,EV3可以安全地探索模型空间,并且在多个目标下可以动态调整任务。这种具有很大的灵活性和适应能力的方法,在多种领域可能有广泛的应用。Abstract
We introduce EV3, a novel meta-optimization framework designed to efficiently train scalable machine learning models through an intuitive explore-assess-adapt protocol. In each iteration of EV3, we explore various model parameter updates, assess them using pertinent evaluation methods, and adapt the model based on the optimal updates and previous progress history. EV3 offers substantial flexibility without imposing stringent constraints like differentiability on the key objectives relevant to the tasks of interest. Moreover, this protocol welcomes updates with biased gradients and allows for the use of a diversity of losses and optimizers. Additionally, in scenarios with multiple objectives, it can be used to dynamically prioritize tasks. With inspiration drawn from evolutionary algorithms, meta-learning, and neural architecture search, we investigate an application of EV3 to knowledge distillation. Our experimental results illustrate EV3's capability to safely explore model spaces, while hinting at its potential applicability across numerous domains due to its inherent flexibility and adaptability.
摘要
我们介绍EV3,一种新的元优化框架,用于高效地训练可扩展机器学习模型。EV3采用一种直观的探索-评估-适应协议,在每次迭代中探索不同的模型参数更新,使用相关的评估方法进行评估,并根据最佳更新和前一次进程历史来适应模型。EV3具有明显的灵活性,不需要在关键任务中强制执行梯度的微分性。此外,这个协议还允许使用多种损失函数和优化器,并在多个目标场景下动态准备任务。 drawing inspiration from evolutionary algorithms, meta-learning, and neural architecture search, we investigate the application of EV3 to knowledge distillation. Our experimental results show that EV3 can safely explore model spaces and hint at its potential applicability in various domains due to its inherent flexibility and adaptability.
Towards Generalized Multi-stage Clustering: Multi-view Self-distillation
results: 实验结果显示,这篇论文的方法在真实世界的多视角数据上表现较好,与现有的方法相比,具有更好的 clustering 性能。Abstract
Existing multi-stage clustering methods independently learn the salient features from multiple views and then perform the clustering task. Particularly, multi-view clustering (MVC) has attracted a lot of attention in multi-view or multi-modal scenarios. MVC aims at exploring common semantics and pseudo-labels from multiple views and clustering in a self-supervised manner. However, limited by noisy data and inadequate feature learning, such a clustering paradigm generates overconfident pseudo-labels that mis-guide the model to produce inaccurate predictions. Therefore, it is desirable to have a method that can correct this pseudo-label mistraction in multi-stage clustering to avoid the bias accumulation. To alleviate the effect of overconfident pseudo-labels and improve the generalization ability of the model, this paper proposes a novel multi-stage deep MVC framework where multi-view self-distillation (DistilMVC) is introduced to distill dark knowledge of label distribution. Specifically, in the feature subspace at different hierarchies, we explore the common semantics of multiple views through contrastive learning and obtain pseudo-labels by maximizing the mutual information between views. Additionally, a teacher network is responsible for distilling pseudo-labels into dark knowledge, supervising the student network and improving its predictive capabilities to enhance the robustness. Extensive experiments on real-world multi-view datasets show that our method has better clustering performance than state-of-the-art methods.
摘要
现有的多阶段划分方法独立地学习多视图中的突出特征,然后进行划分任务。特别是,多视图划分(MVC)在多视图或多模式场景中吸引了很多关注。MVC aimed at exploring common semantics and pseudo-labels from multiple views and clustering in a self-supervised manner. However, limited by noisy data and inadequate feature learning, such a clustering paradigm generates overconfident pseudo-labels that misguide the model to produce inaccurate predictions. Therefore, it is desirable to have a method that can correct this pseudo-label mistraction in multi-stage clustering to avoid the bias accumulation. To alleviate the effect of overconfident pseudo-labels and improve the generalization ability of the model, this paper proposes a novel multi-stage deep MVC framework, where multi-view self-distillation (DistilMVC) is introduced to distill dark knowledge of label distribution. Specifically, in the feature subspace at different hierarchies, we explore the common semantics of multiple views through contrastive learning and obtain pseudo-labels by maximizing the mutual information between views. Additionally, a teacher network is responsible for distilling pseudo-labels into dark knowledge, supervising the student network and improving its predictive capabilities to enhance the robustness. Extensive experiments on real-world multi-view datasets show that our method has better clustering performance than state-of-the-art methods.
Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks
results: 作者的方法可以学习出高性能且低复杂度的DNN,并且比之前使用低级、块稀或块低级矩阵的方法更高效。Abstract
This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. Finally, we introduce an effective initialization method for the proposed scheme. Our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.
摘要
To address this challenge, the authors propose a generalized and differentiable framework for learning efficient weight matrix structures using gradient descent. They define a new class of structured matrices that covers a wide range of existing structured matrices, and use a frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel to learn the structural parameters.The proposed method uses proximal gradient descent to optimize the structural parameters, and introduces an effective initialization method to improve performance. The authors demonstrate that their approach learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior methods that use low-rank, block-sparse, or block-low-rank matrices.
HDMNet: A Hierarchical Matching Network with Double Attention for Large-scale Outdoor LiDAR Point Cloud Registration
results: 对两个大规模外部LiDAR点云数据集进行了广泛的实验,证明了提案的HDMNet可以具有高精度和高效性。Abstract
Outdoor LiDAR point clouds are typically large-scale and complexly distributed. To achieve efficient and accurate registration, emphasizing the similarity among local regions and prioritizing global local-to-local matching is of utmost importance, subsequent to which accuracy can be enhanced through cost-effective fine registration. In this paper, a novel hierarchical neural network with double attention named HDMNet is proposed for large-scale outdoor LiDAR point cloud registration. Specifically, A novel feature consistency enhanced double-soft matching network is introduced to achieve two-stage matching with high flexibility while enlarging the receptive field with high efficiency in a patch-to patch manner, which significantly improves the registration performance. Moreover, in order to further utilize the sparse matching information from deeper layer, we develop a novel trainable embedding mask to incorporate the confidence scores of correspondences obtained from pose estimation of deeper layer, eliminating additional computations. The high-confidence keypoints in the sparser point cloud of the deeper layer correspond to a high-confidence spatial neighborhood region in shallower layer, which will receive more attention, while the features of non-key regions will be masked. Extensive experiments are conducted on two large-scale outdoor LiDAR point cloud datasets to demonstrate the high accuracy and efficiency of the proposed HDMNet.
摘要
大规模户外LiDAR点云注册通常具有复杂分布和大规模特征。为了实现高效和准确的注册,需要强调本地区域之间的相似性,并且优先进行全局本地-本地匹配。在本文中,一种名为HDMNet的新型层次神经网络是提出来解决大规模户外LiDAR点云注册问题。具体来说,我们引入了一种增强特征一致性的双注意网络,可以在patch-to-patch方式下实现高灵活性的两stage匹配,从而显著提高注册性能。此外,我们还开发了一种可 trains embeddingmask,以利用深层次姿态估计中的稀疏匹配信息,从而消除额外计算。高 confidence键点在更 sparse的点云中对应于深层次姿态中的高 confidence空间区域,这些区域将收到更多的注意力,而非键点区域的特征将被masked。我们在两个大规模户外LiDAR点云数据集上进行了广泛的实验,以示提出的HDMNet高精度和高效性。
Prompt-Engineering and Transformer-based Question Generation and Evaluation
results: 研究发现,使用这种方法可以生成高相似度的问题,其中30%的问题达到了高于70%的相似度。I hope this helps! Let me know if you have any other questions.Abstract
Question generation has numerous applications in the educational context. Question generation can prove helpful for students when reviewing content and testing themselves. Furthermore, a question generation model can aid teachers by lessening the burden of creating assessments and other practice material. This paper aims to find the best method to generate questions from textual data through a transformer model and prompt engineering. In this research, we finetuned a pretrained distilBERT model on the SQuAD question answering dataset to generate questions. In addition to training a transformer model, prompt engineering was applied to generate questions effectively using the LLaMA model. The generated questions were compared against the baseline questions in the SQuAD dataset to evaluate the effectiveness of four different prompts. All four prompts demonstrated over 60% similarity on average. Of the prompt-generated questions, 30% achieved a high similarity score greater than 70%.
摘要
Question generation has numerous applications in educational contexts. Question generation can help students review content and assess themselves. Additionally, a question generation model can aid teachers by reducing the burden of creating assessments and practice material. This paper aims to find the best method to generate questions from textual data using a transformer model and prompt engineering. In this research, we fine-tuned a pre-trained distilBERT model on the SQuAD question answering dataset to generate questions. In addition to training a transformer model, prompt engineering was applied to generate questions effectively using the LLaMA model. The generated questions were compared to the baseline questions in the SQuAD dataset to evaluate the effectiveness of four different prompts. All four prompts showed an average similarity of over 60%. Of the prompt-generated questions, 30% achieved a high similarity score of over 70%.