paper_authors: Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg
results: 这篇论文提出了一种基于NeMo工具包的多通道、多个话者语音识别系统,并通过了7个CHiME挑战任务的评测。系统的性能得到了 significiant进步,表明该系统在实际应用中具有极高的可靠性和精度。Abstract
We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Speaker Diarization Module, Multi-channel Audio Front-End Processing Module, and the ASR Module. These components collectively establish a cascading system, meticulously processing multi-channel and multi-speaker audio input. Moreover, this paper highlights the comprehensive optimization process that significantly enhanced our system's performance. Our team's submission is largely based on NeMo toolkits and will be publicly available.
摘要
我们现在介绍NVIDIA NeMo团队的多通道语音识别系统,用于7个CHiME挑战远程自动语音识别(DASR)任务。我们专注于通过分布式麦克风和麦克风数组记录语音的多通道多发言人语音识别系统的开发。这个系统主要由以下几个基本模块组成:说话人分类模块、多通道音频前端处理模块和ASR模块。这些组件结合起来形成了一个减法系统,精心处理多通道和多发言人的音频输入。此外,这篇论文还描述了我们对系统性能的全面优化过程,这有效地提高了我们的系统性能。我们的提交基于NeMo工具包,将在公共可用。
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
results: 本文 demonstates 了该模拟器可以生成具有实际统计特性的大规模语音混合数据集,并且可以用于训练 speaker diarization 和 voice activity detection 模型,以实现高效的识别和分离。Abstract
We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection. The acquisition of substantial datasets for speaker diarization often presents a significant challenge, particularly in multi-speaker scenarios. Furthermore, the precise time stamp annotation of speech data is a critical factor for training both speaker diarization and voice activity detection. Our proposed multi-speaker simulator tackles these problems by generating large-scale audio mixtures that maintain statistical properties closely aligned with the input parameters. We demonstrate that the proposed multi-speaker simulator generates audio mixtures with statistical properties that closely align with the input parameters derived from real-world statistics. Additionally, we present the effectiveness of speaker diarization and voice activity detection models, which have been trained exclusively on the generated simulated datasets.
摘要
我团队介绍了一种高级多话者语音数据 simulate器,特性是生成多话者语音录音。这个 simulate器的一个特点是通过调整统计参数来模拟 silence和 overlap 的分布。这种能力提供了一个适应性高的训练环境,用于开发适合 speaker diarization 和 voice activity detection 的神经网络模型。在多话者场景下获得大量的 speaker diarization 数据经常是一项大的挑战,而且 precisetimestamp 注释的语音数据是神经网络模型的训练必要因素。我们的提议的多话者 simulate器解决了这些问题,生成了具有统计性质相近于输入参数的大规模音频混合。我们还展示了通过 exclusively 在生成的模拟数据上训练的 speaker diarization 和 voice activity detection 模型的效果。
methods: 这项研究使用了许多端到端模型,以及多个工具包。它们 heavily 依赖了指导源分离(GSS)将多通道音频转化为单通道。另外,ASR 使用了自我超vised学习模型生成的语音表示,并进行了多个 ASR 系统的融合。
results: 研究中的系统使用了 oracle 分 segmentation,并在远场自动语音识别(DASR)领域实现了良好的成绩。Abstract
This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. We train and evaluate various end-to-end models with several toolkits. We heavily relied on Guided Source Separation (GSS) to convert multi-channel audio to single channel. The ASR is leveraging speech representations from models pre-trained by self-supervised learning, and we do a fusion of several ASR systems. In addition, we modified external data from the LibriSpeech corpus to become a close domain and added it to the training. Our efforts were focused on the far-field acoustic robustness sub-track of Task 1 - Distant Automatic Speech Recognition (DASR), our systems use oracle segmentation.
摘要
这份报告描述布雷诺技术大学(BUT)、阿格大学(AGH)和布宜诺斯艾利斯大学(UBA)在CHiME-7挑战中开发自动语音识别系统的共同努力。我们使用了准则分离(GSS)将多通道音频转化为单通道,并使用自我supervised学习预训练的语音表示模型。我们还对外部数据集进行了修改,使其成为近频域数据集,并将其添加到训练中。我们的努力主要集中在Task 1 - 远程自动语音识别(DASR)的远场静音环境下,我们使用了oracle分割。
Physics-informed Neural Network for Acoustic Resonance Analysis
results: 在一个一维声波管中进行了共振分析,并通过对前向和反向分析进行比较,证明了提案的方法的有效性。Abstract
This study proposes the physics-informed neural network (PINN) framework to solve the wave equation for acoustic resonance analysis. ResoNet, the analytical model proposed in this study, minimizes the loss function for periodic solutions, in addition to conventional PINN loss functions, thereby effectively using the function approximation capability of neural networks, while performing resonance analysis. Additionally, it can be easily applied to inverse problems. Herein, the resonance in a one-dimensional acoustic tube was analyzed. The effectiveness of the proposed method was validated through the forward and inverse analyses of the wave equation with energy-loss terms. In the forward analysis, the applicability of PINN to the resonance problem was evaluated by comparison with the finite-difference method. The inverse analysis, which included the identification of the energy loss term in the wave equation and design optimization of the acoustic tube, was performed with good accuracy.
摘要
这项研究提出了物理学 Informed Neural Network(PINN)框架,以解决音频振荡分析中的波方程。ResoNet,这个研究所提出的分析模型,不仅会遵循普通的PINN损失函数,还会将 périodic solutions 作为损失函数的最小化,从而有效地利用神经网络的函数近似能力,进行振荡分析。此外,它可以轻松应用于反向问题。在这种情况下,我们对一维音频管进行了振荡分析。我们采用了PINN方法和finite-difference方法进行前向和反向分析,并证明了PINN方法的可靠性和精度。Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.
Blind estimation of audio effects using an auto-encoder approach and differentiable signal processing
results: 我们的自适应oder方法可以在不知道AFX的具体实现情况下,对音频质量产生更好的估算,比传统的参数基于方法更好。Abstract
Blind Estimation of Audio Effects (BE-AFX) aims at estimating the Audio Effects (AFXs) applied to an original, unprocessed audio sample solely based on the processed audio sample. To train such a system traditional approaches optimize a loss between ground truth and estimated AFX parameters. This involves knowing the exact implementation of the AFXs used for the process. In this work, we propose an alternative solution that eliminates the requirement for knowing this implementation. Instead, we introduce an auto-encoder approach, which optimizes an audio quality metric. We explore, suggest, and compare various implementations of commonly used mastering AFXs, using differential signal processing or neural approximations. Our findings demonstrate that our auto-encoder approach yields superior estimates of the audio quality produced by a chain of AFXs, compared to the traditional parameter-based approach, even if the latter provides a more accurate parameter estimation.
摘要
《盲目估计音效(BE-AFX)》的目标是基于原始、未处理的音频样本估计音效(AFX)。传统方法通过优化损失函数来学习AFX参数。这需要了解AFX的具体实现。在这个工作中,我们提出了一种不同的解决方案,即使用自适应网络方法,优化音质指标。我们研究、建议和比较了各种通用音压缩AFX的实现方式,使用演变信号处理或神经网络近似。我们的发现表明,我们的自适应网络方法可以在不知道AFX实现细节的情况下提供更高质量的音频估计结果,与传统参数基本方法相比,即使后者可以更准确地估计参数。
EchoScan: Scanning Complex Indoor Geometries via Acoustic Echoes
results: 与视觉方法相比,EchoScan在具有不同形状的室内空间中表现出色,其可以准确地计算室内几何图像。Abstract
Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplans and heights, thereby enabling it to handle rooms with arbitrary shapes, including curved walls. The key innovation of EchoScan is its ability to analyze the complex relationship between low- and high-order reflections in room impulse responses (RIRs) using a multi-aggregation module. The analysis of high-order reflections also enables it to infer complex room shapes when echoes are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices. Compared with vision-based methods, EchoScan demonstrated outstanding geometry estimation performance in rooms with various shapes.
摘要
准确地估算室内空间几何结构是建立精准数字 duplicates 的关键,其广泛的工业应用包括在不熟悉环境中导航和有效逃生规划,特别是在低照明条件下。本研究介绍EchoScan,一种深度神经网络模型,利用声学折射来进行室内空间几何推测。传统的声音基本技术是估算室内几何参数,例如墙position和室内大小,因此限制了可以推测的室内几何类型。相比之下,EchoScan可以直接推测室内平面图和高度,因此可以处理具有任意形状的室内空间,包括拱形墙。EchoScan的关键创新在于使用多维度聚合模块来分析室内响应函数(RIRs)中的低频和高频响应之复杂关系。通过分析高频响应,EchoScan可以推测复杂的室内形状,即使声音在室内设备的位置不可见。在本研究中,EchoScan通过使用来自复杂环境的RIRs进行训练和评估,并使用实际的音频设备配置,与商业市场上可以购买的设备相符。与视觉基本方法相比,EchoScan在具有不同形状的室内空间中表现出色。
Experimental Results of Underwater Sound Speed Profile Inversion by Few-shot Multi-task Learning
methods: state-of-the-art SSP inversion methods include frameworks of matched field processing (MFP), compressive sensing (CS), and feedforeward neural networks (FNN), among which the FNN shows better real-time performance while maintaining the same level of accuracy.
results: MTL outperforms the state-of-the-art methods in terms of accuracy for SSP inversion, while inheriting the real-time advantage of FNN during the inversion stage.Abstract
Underwater Sound Speed Profile (SSP) distribution has great influence on the propagation mode of acoustic signal, thus the fast and accurate estimation of SSP is of great importance in building underwater observation systems. The state-of-the-art SSP inversion methods include frameworks of matched field processing (MFP), compressive sensing (CS), and feedforeward neural networks (FNN), among which the FNN shows better real-time performance while maintain the same level of accuracy. However, the training of FNN needs quite a lot historical SSP samples, which is diffcult to be satisfied in many ocean areas. This situation is called few-shot learning. To tackle this issue, we propose a multi-task learning (MTL) model with partial parameter sharing among different traning tasks. By MTL, common features could be extracted, thus accelerating the learning process on given tasks, and reducing the demand for reference samples, so as to enhance the generalization ability in few-shot learning. To verify the feasibility and effectiveness of MTL, a deep-ocean experiment was held in April 2023 at the South China Sea. Results shows that MTL outperforms the state-of-the-art methods in terms of accuracy for SSP inversion, while inherits the real-time advantage of FNN during the inversion stage.
摘要
水下声速谱(SSP)分布对声音信号的传播模式有着很大的影响,因此快速和准确地估算SSP是建设水下观测系统的关键。现状的SSP拟合方法包括匹配场处理(MFP)、压缩感知(CS)和前向神经网络(FNN)等,其中FNN在实时性方面表现更好,同时保持同等的准确性。然而,FNN的训练需要很多历史SSP样本,这在许多海洋区域是困难的满足。这种情况被称为“少shot learning”。为解决这个问题,我们提出了多任务学习(MTL)模型,其中参数之间有部分共享。通过MTL,可以提取共同特征,因此加速学习过程,降低参考样本的需求,从而提高总体化能力在少shot learning中。为验证MTL的可行性和效果,在2023年4月在南海进行了深海实验。结果显示,MTL比现状的方法在SSP拟合精度方面表现更好,同时继承FNN在拟合阶段的实时优势。
results: 提高了anti-spoofing检测系统的检测能力,增强了对TTS生成语音的检测能力Abstract
Spoofing speech detection is a hot and in-demand research field. However, current spoofing speech detection systems is lack of convincing evidence. In this paper, to increase the reliability of detection systems, the flaws of rhythm information inherent in the TTS-generated speech are analyzed. TTS models take text as input and utilize acoustic models to predict rhythm information, which introduces artifacts in the rhythm information. By filtering out vocal tract response, the remaining glottal flow with rhythm information retains detection ability for TTS-generated speech. Based on these analyses, a rhythm perturbation module is proposed to enhance the copy-synthesis data augmentation method. Fake utterances generated by the proposed method force the detecting model to pay attention to the artifacts in rhythm information and effectively improve the ability to detect TTS-generated speech of the anti-spoofing countermeasures.
摘要
假语言识别是一个热门的研究领域,但目前的假语言识别系统尚缺乏充分的证据。在这篇论文中,为了提高检测系统的可靠性,我们分析了 TTS 生成的语音中的饱和信息的缺陷。 TTS 模型将文本作为输入,利用声学模型预测语音中的饱和信息,这会导致语音中的饱和信息受到质量问题的影响。通过滤除声道响应,保留的喉咙流量仍然具有检测能力。基于这些分析,我们提出了一种饱和抖振模块,以增强复制数据增强法。这种方法生成的假语音让检测模型更加注意饱和信息中的瑕疵,从而提高了对 TTS 生成语音的检测能力。
results: 对于Camelyon16和肝癌数据集的下游分类和归类任务,该方法表现出了较好的性能,比较常见的SSL方法更高。Abstract
Recent advances in whole-slide image (WSI) scanners and computational capabilities have significantly propelled the application of artificial intelligence in histopathology slide analysis. While these strides are promising, current supervised learning approaches for WSI analysis come with the challenge of exhaustively labeling high-resolution slides - a process that is both labor-intensive and time-consuming. In contrast, self-supervised learning (SSL) pretraining strategies are emerging as a viable alternative, given that they don't rely on explicit data annotations. These SSL strategies are quickly bridging the performance disparity with their supervised counterparts. In this context, we introduce an SSL framework. This framework aims for transferable representation learning and semantically meaningful clustering by synergizing invariance loss and clustering loss in WSI analysis. Notably, our approach outperforms common SSL methods in downstream classification and clustering tasks, as evidenced by tests on the Camelyon16 and a pancreatic cancer dataset. The code and additional details are accessible at: https://github.com/wwyi1828/CluSiam.
摘要
(简体中文)最近,整个扫描图像(WSI)扫描器和计算能力的进步有助于人工智能在医学报告板图分析中应用。虽然这些进步非常有前途,但现有的监督学习方法 для WSI 分析带来了大量高分辨率板图标注的劳动和时间的挑战。相比之下,自动学习(SSL)预训练策略是一个可行的选择,因为它们不需要显式数据注释。这些 SSL 策略快速bridging 监督学习方法的性能差距。在这个上下文中,我们介绍了一个 SSL 框架。这个框架的目标是在 WSI 分析中实现可重用的表示学习和具有意义的归一化。尤其是,我们的方法在下游分类和归一化任务中表现出色,如Camelyon16 和胰腺癌数据集的测试所示。代码和更多细节可以在:https://github.com/wwyi1828/CluSiam 上获取。
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
results: 本文提供了一个广泛的视频分割模型的回顾,包括转换器模型的state-of-the-art模型,以及对不同视频分割任务的性能分析。同时,本文还提供了对视频模型的可读性分析,以及对视频时间缓动模型的分析。Abstract
Video segmentation encompasses a wide range of categories of problem formulation, e.g., object, scene, actor-action and multimodal video segmentation, for delineating task-specific scene components with pixel-level masks. Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. In addition, various interpretability approaches have appeared for transformer models and video temporal dynamics, motivated by the growing interest in basic scientific understanding, model diagnostics and societal implications of real-world deployment. Previous surveys mainly focused on ConvNet models on a subset of video segmentation tasks or transformers for classification tasks. Moreover, component-wise discussion of transformer-based video segmentation models has not yet received due focus. In addition, previous reviews of interpretability methods focused on transformers for classification, while analysis of video temporal dynamics modelling capabilities of video models received less attention. In this survey, we address the above with a thorough discussion of various categories of video segmentation, a component-wise discussion of the state-of-the-art transformer-based models, and a review of related interpretability methods. We first present an introduction to the different video segmentation task categories, their objectives, specific challenges and benchmark datasets. Next, we provide a component-wise review of recent transformer-based models and document the state of the art on different video segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc interpretability methods for transformer models and interpretability methods for understanding the role of the temporal dimension in video models. Finally, we conclude our discussion with future research directions.
摘要
视频分割涉及许多问题的类型化问题表示,如对象、场景、actor-action和多模态视频分割,用于在像素级划分场景组成部分。最近,这个研究领域的方法从 concentrate 在 ConvNet 基于的模型转移到 transformer 基于的模型。此外,对 transformer 模型的解释方法出现了多种, motivated 由实际应用中的基本科学理解、模型诊断和社会意义的增长。之前的survey 主要集中在 ConvNet 模型上一部分的视频分割任务或 transformer для类型任务。此外,transformer 基于视频分割模型的组件化讨论未得到过due focus。此外,以前的解释方法集中在 transformer 模型的类型任务上,而视频时间 Dinamics 模型分析 capabilities 收到了更少的注意。在这种survey中,我们解决了以下问题:提供不同类型的视频分割任务、其目标、特定挑战和标准数据集的介绍;提供 recent transformer 基于模型的组件化评论,并记录不同视频分割任务的state of the art ; 讨论 post-hoc 和 ante-hoc 解释方法 для transformer 模型,以及对视频模型中的时间维度的解释方法。最后,我们 conclude 我们的讨论,并提出未来研究方向。
Improving SCGAN’s Similarity Constraint and Learning a Better Disentangled Representation
For: 本研究使用SCGAN模型,通过添加同condition之间的相似性约束,来提高生成器网络的可读性和智能度。* Methods: 本研究使用了SSIM指标来衡量生成图像之间的相似性,并应用了对相似性约束的对照损失原理。* Results: 对比FID和因子VAE指标,改进后的模型表现更佳,并且有较好的泛化性。Here’s a more detailed explanation of each point:* For: The paper uses SCGAN (Similarity-constrained GAN) model to improve the readability and intelligence of the generator network. SCGAN adds a similarity constraint between generated images and conditions, which works as a tutor to instruct the generator network to comprehend the difference of representations based on conditions.* Methods: The paper uses SSIM (Structural Similarity Index Measure) to measure the similarity between generated images and conditions. The authors also apply the contrastive loss principle to the similarity constraint.* Results: The modified model performs better using FID (Fréchet Inception Distance) and FactorVAE (Factorized Variational Autoencoder) metrics. The modified model also has better generalizability compared to other models.Abstract
SCGAN adds a similarity constraint between generated images and conditions as a regularization term on generative adversarial networks. Similarity constraint works as a tutor to instruct the generator network to comprehend the difference of representations based on conditions. We understand how SCGAN works on a deeper level. This understanding makes us realize that the similarity constraint functions like the contrastive loss function. We believe that a model with high understanding and intelligence measures the similarity between images based on their structure and high level features, just like humans do. Two major changes we applied to SCGAN in order to make a modified model are using SSIM to measure similarity between images and applying contrastive loss principles to the similarity constraint. The modified model performs better using FID and FactorVAE metrics. The modified model also has better generalisability compared to other models. Keywords Generative Adversarial Nets, Unsupervised Learning, Disentangled Representation Learning, Contrastive Disentanglement, SSIM
摘要
SCGAN 添加了生成图像和条件之间的相似性约束作为生成对抗网络的规范项。相似性约束工作如一位教师,让生成网络理解基于条件的表示之间的差异。我们更深入了解 SCGAN 的工作原理,并发现它与对照损失函数相似。我们认为一个高度理解和智能的模型会根据图像的结构和高级特征来衡量图像之间的相似性,就像人类一样。为了使得 modified SCGAN 表现更好,我们对其进行了两个主要变更。首先,我们使用 SSIM 来衡量图像之间的相似性。其次,我们将对照损失原理应用到相似性约束中。这个修改后的模型在 FID 和 FactorVAE metric 上表现更好,并且有更高的泛化能力。关键词:生成对抗网络、无监督学习、分解表示学习、对照分解、SSIM
REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes
paper_authors: Matthew Hull, Zijie J. Wang, Duen Horng Chau
for: This paper is written for researchers and practitioners who want to study and defend against adversarial attacks on deep learning models in computer vision, specifically in the context of autonomous vehicles.
methods: The paper introduces REVAMP, an easy-to-use Python library that allows users to create attack scenarios with arbitrary objects and simulate realistic environmental factors, lighting, reflection, and refraction. REVAMP uses differentiable rendering to reproduce physically plausible adversarial objects.
results: The paper demonstrates the effectiveness of REVAMP in producing adversarial textures that can cause misclassification of objects in real-world scenarios. The audience can choose a scene, object to attack, desired attack class, and number of camera positions to use, and REVAMP will show how the altered texture causes the chosen object to be misclassified in real time.Abstract
Deep Learning models, such as those used in an autonomous vehicle are vulnerable to adversarial attacks where an attacker could place an adversarial object in the environment, leading to mis-classification. Generating these adversarial objects in the digital space has been extensively studied, however successfully transferring these attacks from the digital realm to the physical realm has proven challenging when controlling for real-world environmental factors. In response to these limitations, we introduce REVAMP, an easy-to-use Python library that is the first-of-its-kind tool for creating attack scenarios with arbitrary objects and simulating realistic environmental factors, lighting, reflection, and refraction. REVAMP enables researchers and practitioners to swiftly explore various scenarios within the digital realm by offering a wide range of configurable options for designing experiments and using differentiable rendering to reproduce physically plausible adversarial objects. We will demonstrate and invite the audience to try REVAMP to produce an adversarial texture on a chosen object while having control over various scene parameters. The audience will choose a scene, an object to attack, the desired attack class, and the number of camera positions to use. Then, in real time, we show how this altered texture causes the chosen object to be mis-classified, showcasing the potential of REVAMP in real-world scenarios. REVAMP is open-source and available at https://github.com/poloclub/revamp.
摘要
深度学习模型,如自动驾驶车辆中使用的模型,容易受到敌意攻击,攻击者可能会放置一个恶意物体在环境中,导致误分类。在数字世界中生成这些攻击物体已经广泛研究,但在控制真实环境因素时将这些攻击 transferred to the physical realm 是一个挑战。为了解决这些限制,我们介绍了 REVAMP,一个简单易用的 Python 库,它是首个类似的工具,可以创建攻击场景,并模拟真实环境因素,光照、折射和折射。 REVAMP 允许研究人员和实践者快速探索不同的场景,通过配置多种实验设置和使用可微的渲染来生成真实物理上的攻击对象。我们将在演示中让观众选择场景、 Object 和攻击类别,然后在实时中生成攻击的 Texture,并在不同的摄像头位置下显示攻击对象的误分类效果,展示 REVAMP 在真实场景中的潜力。 REVAMP 是开源的,可以在 https://github.com/poloclub/revamp 上下载。
Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection
for: handles extreme data imbalance in real-world datasets, specifically long-tailed object detection (LTOD)
methods: + explores extra data with image-level labels, but limited results due to semantic ambiguity and location sensitivity + proposes RichSem, a simple and effective method that leverages rich semantics from images as additional soft supervision for training detectors + adds a semantic branch to the detector to learn soft semantics and enhance feature representations for long-tailed object detection
results: + achieves consistent improvements on both overall and rare-category of LVIS under different backbones and detectors + achieves state-of-the-art performance without requiring complex training and testing procedures + demonstrates effectiveness on other long-tailed datasets with additional experiments.Abstract
Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-world datasets, where many tail classes have scarce instances. One popular strategy is to explore extra data with image-level labels, yet it produces limited results due to (1) semantic ambiguity -- an image-level label only captures a salient part of the image, ignoring the remaining rich semantics within the image; and (2) location sensitivity -- the label highly depends on the locations and crops of the original image, which may change after data transformations like random cropping. To remedy this, we propose RichSem, a simple but effective method, which is robust to learn rich semantics from coarse locations without the need of accurate bounding boxes. RichSem leverages rich semantics from images, which are then served as additional soft supervision for training detectors. Specifically, we add a semantic branch to our detector to learn these soft semantics and enhance feature representations for long-tailed object detection. The semantic branch is only used for training and is removed during inference. RichSem achieves consistent improvements on both overall and rare-category of LVIS under different backbones and detectors. Our method achieves state-of-the-art performance without requiring complex training and testing procedures. Moreover, we show the effectiveness of our method on other long-tailed datasets with additional experiments. Code is available at \url{https://github.com/MengLcool/RichSem}.
摘要
长尾物体检测(LTOD)目标是处理实际数据中的极端数据不均衡,其中许多尾类具有稀缺的实例数。一种受欢迎的策略是探索带有图像级标签的额外数据,但这会产生有限的结果,因为(1)语义抽象---图像级标签只捕捉图像中的一个突出部分,忽略图像中的剩余 semantics;(2)位置敏感---标签强度取决于原始图像的位置和裁剪,这些位置和裁剪可能会在数据变换后发生变化。为了缓解这些问题,我们提议了RichSem,一种简单 yet有效的方法,可以在不具备精确 bounding box 的情况下,强制学习图像中的丰富 semantics。RichSem 利用图像中的丰富 semantics,并将其作为增强特征表示的额外超级vision。我们在检测器中添加了一个语义分支,以学习这些软 semantics,并增强特征表示以适应长尾物体检测。语义分支只用于训练,并在检测中被移除。RichSem 在不同的背景和检测器下实现了稳定的改进,并 achieved state-of-the-art 性能。此外,我们还在其他长尾数据集上进行了进一步的实验,以证明我们的方法的一致性。代码可以在 \url{https://github.com/MengLcool/RichSem} 中找到。
Object-aware Inversion and Reassembly for Image Editing
methods: 该方法使用一种新的搜索度量,jointly considering the editability of the target and the fidelity of the non-editing region,来确定最佳的反向步骤数量。然后,对每个编辑对 separately 进行编辑,以避免概念匹配错误。最后,提出一个重新组装步骤,以将各个编辑结果和非编辑区域一起 integrate 得到最终的编辑图像。
results: 该方法在单个对象编辑和多个对象编辑场景中都有优秀的表现,特别是在多对象编辑场景中。Abstract
By comparing the original and target prompts in editing task, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
摘要
通过比较原始和目标提示的编辑任务,我们可以获得大量的编辑对,每个编辑对包含一个对象和其对应的编辑目标。为了在保持原始图像的精度的情况下进行编辑,现有的编辑方法通常包括一定数量的反向步骤,这些步骤将原始图像投影到它的噪音卷积表示中,然后通过受引导的目标提示进行净化处理。然而,我们发现在不同的编辑对中,理想的反向步骤数量差异很大,这是因为不同的编辑对具有不同的编辑难度。因此,现有的文献,它依靠固定的反向步骤数量,生成质量不够优化,特别是在处理多个编辑对的自然图像时。为此,我们提出了一种新的图像编辑模式,称为对象感知反向拼接(OIR),以实现对象精细编辑。我们设计了一个新的搜索指标,该指标确定每个编辑对的优化反向步骤数量,同时考虑目标的编辑性和非编辑区域的精度。我们使用这个搜索指标来查找每个编辑对的优化反向步骤数量,然后分别编辑这些编辑对,以避免概念匹配错误。接着,我们提出了一个额外的重新拼接步骤,以精准地结合各个编辑结果和非编辑区域,从而获得最终的编辑图像。为了系统地评估我们的方法的效果,我们收集了两个数据集,用于单个对象编辑和多个对象编辑的benchmark。实验结果表明,我们的方法在对象形状、颜色、材质、类别等方面具有更高的编辑性,特别是在多个对象编辑场景下。
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
paper_authors: Hanbo Zhang, Jie Xu, Yuchen Mo, Tao Kong
for: addresses the issues of reduced performance in realistic and open-ended scenarios in Human-Robot Interaction (HRI) by presenting a large-scale dataset, \invig, for interactive visual grounding under language ambiguity.
methods: leverages the \invig dataset and proposes a set of baseline solutions for end-to-end interactive visual disambiguation and grounding.
results: achieves a 45.6% success rate during validation, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI.Abstract
Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, \invig, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the \invig dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the \invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: \href{https://openivg.github.io}{https://openivg.github.io}.
摘要
人类交流中的不确定性是普遍存在的。过去的人机交互(HRI)方法经常采用预定的交互模板,导致在实际和开放式enario中表现不佳。为解决这些问题,我们提供了一个大规模的数据集,\invig,用于在语言不确定性下进行交互视觉固定。我们的数据集包括超过520万张图像和开放目标 oriented的不确定性对话,涵盖了数百万个物体实例和相应的问题答对。利用\invig数据集,我们进行了广泛的研究,并提出了一组基线解决方案,在验证中达到了45.6%的成功率。根据我们所知,\invig数据集是首个大规模的开放式交互视觉固定数据集,提供了实用又高度挑战的底层 benchmark для不确定性意识HRI。代码和数据集可以在:\href{https://openivg.github.io}{https://openivg.github.io}获取。
HSTR-Net: Reference Based Video Super-resolution for Aerial Surveillance with Dual Cameras
results: simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics. The method also exhibits sufficient frames per second (FPS) for WAS when deployed on a power-constrained drone equipped with dual cameras.Abstract
Aerial surveillance requires high spatio-temporal resolution (HSTR) video for more accurate detection and tracking of objects. This is especially true for wide-area surveillance (WAS), where the surveyed region is large and the objects of interest are small. This paper proposes a dual camera system for the generation of HSTR video using reference-based super-resolution (RefSR). One camera captures high spatial resolution low frame rate (HSLF) video while the other captures low spatial resolution high frame rate (LSHF) video simultaneously for the same scene. A novel deep learning architecture is proposed to fuse HSLF and LSHF video feeds and synthesize HSTR video frames at the output. The proposed model combines optical flow estimation and (channel-wise and spatial) attention mechanisms to capture the fine motion and intricate dependencies between frames of the two video feeds. Simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics. The method also exhibits sufficient frames per second (FPS) for WAS when deployed on a power-constrained drone equipped with dual cameras.
摘要
Translated into Simplified Chinese:空中监测需要高空时分解能力(HSTR)视频,以更加准确地探测和跟踪目标对象。这特别是对于广泛监测(WAS)来说,surveyed 区域很大,目标对象很小。这篇论文提议一种双摄像头系统,用于生成HSTR视频使用参考基于超解像(RefSR)。一个摄像头拍摄高空间分辨率低帧率(HSLF)视频,另一个摄像头拍摄低空间分辨率高帧率(LSHF)视频,同时拍摄同一场景。一种新的深度学习架构被提议,用于将HSLF和LSHF视频流合并并生成HSTR视频帧。提议的模型 combining 光流估算和通道 wise 和空间的注意机制,以捕捉视频流中细微的运动和细节依赖关系。实验显示,提议的模型在PSNR和SSIM指标上具有显著改进,并且在功率限制的飞机上部署时,具有足够的帧率(FPS) для WAS。
One-Shot Imitation Learning: A Pose Estimation Perspective
paper_authors: Pietro Vitiello, Kamil Dreczkowski, Edward Johns
for: 研究单个示例下的模仿学习,无需进一步的数据收集和任务或物体知识。
methods: 结合弹性轨迹传输和未看过的物体姿态估计,实现单个示例下的模仿学习。
results: 对十种真实世界任务进行深入研究,探讨摄像头准确性、姿态估计误差和空间总结对任务成功率的影响。Here’s the English version of the paper’s abstract for reference:”In this paper, we study imitation learning under the challenging setting of single demonstration, no further data collection, and no prior task or object knowledge. We show how imitation learning can be formulated as a combination of trajectory transfer and unseen object pose estimation. To explore this idea, we provide an in-depth study on how state-of-the-art unseen object pose estimators perform for one-shot imitation learning on ten real-world tasks, and we take a deep dive into the effects that camera calibration, pose estimation error, and spatial generalisation have on task success rates. For videos, please visit https://www.robot-learning.uk/pose-estimation-perspective.”Abstract
In this paper, we study imitation learning under the challenging setting of: (1) only a single demonstration, (2) no further data collection, and (3) no prior task or object knowledge. We show how, with these constraints, imitation learning can be formulated as a combination of trajectory transfer and unseen object pose estimation. To explore this idea, we provide an in-depth study on how state-of-the-art unseen object pose estimators perform for one-shot imitation learning on ten real-world tasks, and we take a deep dive into the effects that camera calibration, pose estimation error, and spatial generalisation have on task success rates. For videos, please visit https://www.robot-learning.uk/pose-estimation-perspective.
摘要
在这篇论文中,我们研究了一种具有以下三个挑战性 Setting:(1)只有单一示范,(2)没有进一步数据采集,(3)没有先前任务或物体知识。我们示出了如何,在这些限制下,使用imitating learning可以表述为路径传输和未经见过的物体姿态估计的组合。为了探讨这个想法,我们对state-of-the-art unseen object pose estimators进行了一项深入的研究,并对十种真实世界任务进行了一 shot imitation learning的研究。我们还对camera calibration、pose estimation error和空间泛化对任务成功率产生的影响进行了深入的分析。关于视频,请访问https://www.robot-learning.uk/pose-estimation-perspective。
Exploring Fairness in Pre-trained Visual Transformer based Natural and GAN Generated Image Detection Systems and Understanding the Impact of Image Compression in Fairness
paper_authors: Manjary P. Gangan, Anoop Kadan, Lajish V L
For: This paper aims to explore fairness in transformer-based image forensic algorithms and evaluate their bias in various domains, including gender, racial, affective, and intersectional.* Methods: The study uses a bias evaluation corpora to analyze bias in the algorithms and employs a wide set of individual and pairwise bias evaluation measures. Additionally, the study examines the impact of image compression on model bias using a two-phase evaluation setting.* Results: The paper explores the bias in transformer-based image forensic algorithms and evaluates their fairness in different domains, providing insights into the potential biases in these algorithms and the impact of image compression on their fairness.Abstract
It is not only sufficient to construct computational models that can accurately classify or detect fake images from real images taken from a camera, but it is also important to ensure whether these computational models are fair enough or produce biased outcomes that can eventually harm certain social groups or cause serious security threats. Exploring fairness in forensic algorithms is an initial step towards correcting these biases. Since visual transformers are recently being widely used in most image classification based tasks due to their capability to produce high accuracies, this study tries to explore bias in the transformer based image forensic algorithms that classify natural and GAN generated images. By procuring a bias evaluation corpora, this study analyzes bias in gender, racial, affective, and intersectional domains using a wide set of individual and pairwise bias evaluation measures. As the generalizability of the algorithms against image compression is an important factor to be considered in forensic tasks, this study also analyzes the role of image compression on model bias. Hence to study the impact of image compression on model bias, a two phase evaluation setting is followed, where a set of experiments is carried out in the uncompressed evaluation setting and the other in the compressed evaluation setting.
摘要
不仅需要建立可准确地分类或检测假图像的计算模型,还需要确保这些计算模型是否具备公平性,以避免对某些社会群体造成伤害或导致严重的安全问题。探索律法算法的公平性是初步减轻这些偏见的初步步骤。由于视觉转换器在最新的图像分类任务中广泛使用,这种研究尝试探索基于转换器的图像律法算法中的偏见。通过建立偏见评估库,这种研究使用了一系列个体和对比偏见评估方法来分析偏见的gender、种族、情感和交叉领域。为了考虑图像压缩对模型偏见的影响,这种研究还分析了图像压缩对模型偏见的影响。因此,这种研究采用了两个阶段的评估设定,其中一个是未压缩评估设定,另一个是压缩评估设定。
On the use of Vision-Language models for Visual Sentiment Analysis: a study on CLIP
results: 研究结果表明,CLIP-E 方法在 WEBEmo 上的细致分类 Task 上表现出色,并且在其他视觉情感分析 benchmark 上也进行了较好的泛化。Abstract
This work presents a study on how to exploit the CLIP embedding space to perform Visual Sentiment Analysis. We experiment with two architectures built on top of the CLIP embedding space, which we denote by CLIP-E. We train the CLIP-E models with WEBEmo, the largest publicly available and manually labeled benchmark for Visual Sentiment Analysis, and perform two sets of experiments. First, we test on WEBEmo and compare the CLIP-E architectures with state-of-the-art (SOTA) models and with CLIP Zero-Shot. Second, we perform cross dataset evaluation, and test the CLIP-E architectures trained with WEBEmo on other Visual Sentiment Analysis benchmarks. Our results show that the CLIP-E approaches outperform SOTA models in WEBEmo fine grained categorization, and they also generalize better when tested on datasets that have not been seen during training. Interestingly, we observed that for the FI dataset, CLIP Zero-Shot produces better accuracies than SOTA models and CLIP-E trained on WEBEmo. These results motivate several questions that we discuss in this paper, such as how we should design new benchmarks and evaluate Visual Sentiment Analysis, and whether we should keep designing tailored Deep Learning models for Visual Sentiment Analysis or focus our efforts on better using the knowledge encoded in large vision-language models such as CLIP for this task.
摘要
Here is the Simplified Chinese translation of the text:这个研究报告介绍了如何利用CLIP嵌入空间进行视觉情感分析。我们实验了两个基于CLIP嵌入空间的CLIP-E模型,并与状态对比模型和CLIP零shot模型进行比较。我们的结果显示,CLIP-E模型在WEBEmo上的细化分类 task 中表现出色,并且在其他视觉情感分析 benchmark 上进行交叉验证时也表现更好。另外,我们发现在FI dataset上,CLIP零shot模型的表现更好于状态对比模型和CLIP-E模型。这些结果提出了一些问题,例如如何设计 benchmark 和评估视觉情感分析,以及是否应该继续设计特定的深度学习模型 для视觉情感分析,还是更好地利用大视语模型如CLIP中的知识来进行这个任务。
Robust Class-Conditional Distribution Alignment for Partial Domain Adaptation
results: 我们的实验结果和抽象分析显示,我们的提案模型在与 benchmark 相比之下,具有更高的表现。Abstract
Unwanted samples from private source categories in the learning objective of a partial domain adaptation setup can lead to negative transfer and reduce classification performance. Existing methods, such as re-weighting or aggregating target predictions, are vulnerable to this issue, especially during initial training stages, and do not adequately address class-level feature alignment. Our proposed approach seeks to overcome these limitations by delving deeper than just the first-order moments to derive distinct and compact categorical distributions. We employ objectives that optimize the intra and inter-class distributions in a domain-invariant fashion and design a robust pseudo-labeling for efficient target supervision. Our approach incorporates a complement entropy objective module to reduce classification uncertainty and flatten incorrect category predictions. The experimental findings and ablation analysis of the proposed modules demonstrate the superior performance of our proposed model compared to benchmarks.
摘要
不想要的样本从私有领域类别中的学习目标在半领域适应设置中可能导致负向传输和降低分类性能。现有方法,如重新权重或聚合目标预测,在初始训练阶段 especialmente vulnerable to this issue,并不能够有效地处理类别特征对齐。我们提议的方法旨在超越首频级别的一致性, derive distinct and compact categorical distributions。我们使用域无关的目标函数优化内类和间类分布,并设计了一种robust pseudo-labeling来提高目标监督效率。我们的方法还包括一个 complement entropy 目标模块,以减少分类uncertainty和平滑错误类别预测。实验结果和ablation分析表明我们的提议模型在比较中表现出色。
Exploring Decision-based Black-box Attacks on Face Forgery Detection
results: 在FaceForensics++, CelebDF和工业API上实现了状态调用攻击性能,并且可以通过面Recognition进行攻击, exposed the security vulnerabilities of face forgery detectors.Abstract
Face forgery generation technologies generate vivid faces, which have raised public concerns about security and privacy. Many intelligent systems, such as electronic payment and identity verification, rely on face forgery detection. Although face forgery detection has successfully distinguished fake faces, recent studies have demonstrated that face forgery detectors are very vulnerable to adversarial examples. Meanwhile, existing attacks rely on network architectures or training datasets instead of the predicted labels, which leads to a gap in attacking deployed applications. To narrow this gap, we first explore the decision-based attacks on face forgery detection. However, applying existing decision-based attacks directly suffers from perturbation initialization failure and low image quality. First, we propose cross-task perturbation to handle initialization failures by utilizing the high correlation of face features on different tasks. Then, inspired by using frequency cues by face forgery detection, we propose the frequency decision-based attack. We add perturbations in the frequency domain and then constrain the visual quality in the spatial domain. Finally, extensive experiments demonstrate that our method achieves state-of-the-art attack performance on FaceForensics++, CelebDF, and industrial APIs, with high query efficiency and guaranteed image quality. Further, the fake faces by our method can pass face forgery detection and face recognition, which exposes the security problems of face forgery detectors.
摘要
“人脸伪造生成技术可以生成非常真实的人脸,但这也引起了公众对安全和隐私的担忧。许多智能系统,如电子支付和身份验证,均依赖于人脸伪造检测。虽然人脸伪造检测已经成功地分辨出假人脸,但最近的研究表明,人脸伪造检测器很容易受到敌意例采样的影响。此外,现有的攻击方法基于网络架构或训练数据集,而不是预测标签,导致攻击部署应用程序的差距。为了bridging这个差距,我们首先探索了人脸伪造检测的决策型攻击。然而,直接应用现有的决策型攻击方法会导致初始化失败和图像质量低下。为此,我们提出了跨任务杂化 perturbation,利用人脸特征之间的高相关性来处理初始化失败。然后,我们受到人脸伪造检测中使用频率规则的启发,我们提出了频率决策型攻击。我们在频率域添加杂化,然后在空间域强制实现视觉质量。最后,我们进行了广泛的实验,并证明我们的方法可以在FaceForensics++, CelebDF和工业API上 achieve state-of-the-art攻击性能,同时保证图像质量。此外,我们的假人脸可以通过人脸伪造检测和人脸识别,暴露了人脸伪造检测器的安全问题。”
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
results: 生成的动画视频具有自然的运动和高准确性,与输入图像之间存在很好的对应关系。与现有竞争者进行比较,该方法显示出了remarkable的优势。Abstract
Enhancing a still image with motion offers more engaged visual experience. Traditional image animation techniques mainly focus on animating natural scenes with random dynamics, such as clouds and fluid, and thus limits their applicability to generic visual contents. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich image embedding space using a learnable image encoding network, which facilitates the video model to digest the image content compatibly. However, some visual details still struggle to be preserved in the resulting videos. To supplement more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results reveal that our proposed method produces visually convincing animated videos, exhibiting both natural motions and high fidelity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors. The source code will be released upon publication.
摘要
加强静止图像可提供更加参与性的视觉体验。传统的图像动画技术主要集中于动感自然场景,如云彩和液体,因此限制其应用范围。为超越这些限制,我们探索将动态内容合成到开放频谱图像上,将图像转换成动画视频。关键思想是利用文本到视频扩散模型的运动优先,将图像 integrate 到生成过程中作为指导。给定一个图像,我们首先将其投影到文本相对丰富的图像嵌入空间中,使用学习图像编码网络,以便视频模型可以快速吸收图像内容。然而,一些视觉细节仍然困难保留在生成的视频中。为了补充更加精细的图像信息,我们进一步将全图像feed 到扩散模型中,将其与初始噪音 concatenate。实验结果表明,我们提出的方法可以生成视觉吸引人的动画视频,具有自然的运动和高准确性。比较评估表明,我们的方法在与现有竞争者进行比较时具有显著的优势。源代码将在出版时释出。
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach
results: 在 largely explored 4x blind super-resolution benchmarks 上提高性能,并在大放大因子(8x)图像SR benchmarks 上进行扩展。Abstract
The recent use of diffusion prior, enhanced by pre-trained text-image models, has markedly elevated the performance of image super-resolution (SR). To alleviate the huge computational cost required by pixel-based diffusion SR, latent-based methods utilize a feature encoder to transform the image and then implement the SR image generation in a compact latent space. Nevertheless, there are two major issues that limit the performance of latent-based diffusion. First, the compression of latent space usually causes reconstruction distortion. Second, huge computational cost constrains the parameter scale of the diffusion model. To counteract these issues, we first propose a frequency compensation module that enhances the frequency components from latent space to pixel space. The reconstruction distortion (especially for high-frequency information) can be significantly decreased. Then, we propose to use Sample-Space Mixture of Experts (SS-MoE) to achieve more powerful latent-based SR, which steadily improves the capacity of the model without a significant increase in inference costs. These carefully crafted designs contribute to performance improvements in largely explored 4x blind super-resolution benchmarks and extend to large magnification factors, i.e., 8x image SR benchmarks. The code is available at https://github.com/amandaluof/moe_sr.
摘要
近期使用扩散优先,增强了基于文本图像模型的图像超解析(SR)的性能。为了降低像素级扩散SR的巨大计算成本, latent-based 方法使用一个特征编码器将图像转换为一个紧凑的 latent space,然后实现SR图像生成。然而,latent-based diffusion 存在两个主要问题,首先,压缩 latent space 通常会导致重建误差。其次,巨大计算成本限制了扩散模型的参数缩放。为了解决这些问题,我们首先提出了频率补偿模块,它可以增强 latent space 中的频率成分到像素空间中。这可以减少重建误差,特别是高频信息。然后,我们提出使用 Sample-Space Mixture of Experts(SS-MoE)来实现更强大的 latent-based SR,它可以不断提高模型的容量,而无需显著增加推理成本。这些精心设计的改进措施在 largely explored 4x blind super-resolution benchmarks 中得到了性能提升,并可以扩展到大折射因子,例如 8x 图像 SR benchmarks。代码可以在 GitHub 上找到:https://github.com/amandaluof/moe_sr。
results: 实验结果表明,BFNs 可以成功地生成非站ARY数据,并且在不同的数据上保持高度的生成能力。Abstract
Bayesian Flow Networks (BFNs) has been recently proposed as one of the most promising direction to universal generative modelling, having ability to learn any of the data type. Their power comes from the expressiveness of neural networks and Bayesian inference which make them suitable in the context of continual learning. We delve into the mechanics behind BFNs and conduct the experiments to empirically verify the generative capabilities on non-stationary data.
摘要
bayesian flow networks (BFNs) 最近被提出为一种最有前途的通用生成模型,能够学习任何数据类型。它们的力量来自于神经网络的表达能力和权重推断,使其适用于连续学习场景。我们深入探究 BFNs 的机制,并通过实验验证它们在非站立数据上的生成能力。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.
Multi-modal Medical Neurological Image Fusion using Wavelet Pooled Edge Preserving Autoencoder
results: 实验表明,提出的方法可以提供较好的视觉和量化结果,比其他现有的融合方法更好。Abstract
Medical image fusion integrates the complementary diagnostic information of the source image modalities for improved visualization and analysis of underlying anomalies. Recently, deep learning-based models have excelled the conventional fusion methods by executing feature extraction, feature selection, and feature fusion tasks, simultaneously. However, most of the existing convolutional neural network (CNN) architectures use conventional pooling or strided convolutional strategies to downsample the feature maps. It causes the blurring or loss of important diagnostic information and edge details available in the source images and dilutes the efficacy of the feature extraction process. Therefore, this paper presents an end-to-end unsupervised fusion model for multimodal medical images based on an edge-preserving dense autoencoder network. In the proposed model, feature extraction is improved by using wavelet decomposition-based attention pooling of feature maps. This helps in preserving the fine edge detail information present in both the source images and enhances the visual perception of fused images. Further, the proposed model is trained on a variety of medical image pairs which helps in capturing the intensity distributions of the source images and preserves the diagnostic information effectively. Substantial experiments are conducted which demonstrate that the proposed method provides improved visual and quantitative results as compared to the other state-of-the-art fusion methods.
摘要
医学图像融合将多种图像模式的诊断信息融合在一起,以提高图像的可视化和分析下面的缺陷。现在,深度学习基本模型已经超越了传统的融合方法,通过同时执行特征提取、特征选择和特征融合任务。然而,大多数现有的卷积神经网络架构使用传统的下采样或步长卷积策略来减少特征图。这会导致图像中的细节信息和Edge detail丢失,从而降低特征提取的效果。因此,本文提出了一种无监督的末端融合模型,基于密集自适应网络。在提议的模型中,特征提取得到了提高,通过使用波峰分解基于注意力卷积特征图。这有助于保留图像中细节信息和Edge detail,并提高融合图像的可见性。此外,提议的模型在多种医学图像对的训练下得到了良好的效果,从而有效地捕捉图像的Intensity分布和保留诊断信息。实验结果表明,提议的方法与其他状态最佳融合方法相比,提供了改进的可见和量化结果。
A New Multimodal Medical Image Fusion based on Laplacian Autoencoder with Channel Attention
results: 我们提出了一种新的多模态医疗图像融合模型,基于 интеграble Laplacian-Gaussian concatenation with attention pooling(LGCA),能够有效保留多个图像的补充信息和重要的组织结构。Abstract
Medical image fusion combines the complementary information of multimodal medical images to assist medical professionals in the clinical diagnosis of patients' disorders and provide guidance during preoperative and intra-operative procedures. Deep learning (DL) models have achieved end-to-end image fusion with highly robust and accurate fusion performance. However, most DL-based fusion models perform down-sampling on the input images to minimize the number of learnable parameters and computations. During this process, salient features of the source images become irretrievable leading to the loss of crucial diagnostic edge details and contrast of various brain tissues. In this paper, we propose a new multimodal medical image fusion model is proposed that is based on integrated Laplacian-Gaussian concatenation with attention pooling (LGCA). We prove that our model preserves effectively complementary information and important tissue structures.
摘要
医疗影像融合技术将多模态医疗影像的补充信息融合到一起,以帮助医生更好地诊断病人的疾病和提供操作过程中的导航。深度学习(DL)模型已经实现了端到端的图像融合,并且具有高度的稳定性和准确性。然而,大多数DL模型在输入图像下采样时会产生重要的特征的损失,导致诊断边缘细节和不同脑组织的对比度的损失。在这篇论文中,我们提出了一种基于集成勺板-高斯拼接(LGCA)的新型多模态医疗影像融合模型。我们证明了我们的模型能够有效地保留补充信息和重要的组织结构。
IRAD: Implicit Representation-driven Image Resampling against Adversarial Attacks
For: This paper proposes a novel approach to defending against adversarial attacks, specifically image resampling, which transforms a discrete image into a new one to alleviate the influence of adversarial perturbations while preserving essential semantic information.* Methods: The paper presents basic resampling methods that employ interpolation strategies and coordinate shifting magnitudes, as well as an improved approach called implicit representation-driven image resampling (IRAD) that constructs an implicit continuous representation of input images and automatically generates pixel-wise shifts for resampling.* Results: The paper demonstrates that the proposed method significantly enhances the adversarial robustness of diverse deep models against various attacks while maintaining high accuracy on clean images, and outperforms existing defense methods in terms of accuracy and computational efficiency.Abstract
We introduce a novel approach to counter adversarial attacks, namely, image resampling. Image resampling transforms a discrete image into a new one, simulating the process of scene recapturing or rerendering as specified by a geometrical transformation. The underlying rationale behind our idea is that image resampling can alleviate the influence of adversarial perturbations while preserving essential semantic information, thereby conferring an inherent advantage in defending against adversarial attacks. To validate this concept, we present a comprehensive study on leveraging image resampling to defend against adversarial attacks. We have developed basic resampling methods that employ interpolation strategies and coordinate shifting magnitudes. Our analysis reveals that these basic methods can partially mitigate adversarial attacks. However, they come with apparent limitations: the accuracy of clean images noticeably decreases, while the improvement in accuracy on adversarial examples is not substantial. We propose implicit representation-driven image resampling (IRAD) to overcome these limitations. First, we construct an implicit continuous representation that enables us to represent any input image within a continuous coordinate space. Second, we introduce SampleNet, which automatically generates pixel-wise shifts for resampling in response to different inputs. Furthermore, we can extend our approach to the state-of-the-art diffusion-based method, accelerating it with fewer time steps while preserving its defense capability. Extensive experiments demonstrate that our method significantly enhances the adversarial robustness of diverse deep models against various attacks while maintaining high accuracy on clean images.
摘要
我们提出了一种新的对抗针对攻击方法,即图像重采样。图像重采样将离散图像转换成一个新的图像,模拟了场景重新捕捉或重新渲染的过程,根据几何变换。我们的理念是,通过图像重采样,可以减轻针对攻击的影响,保留基本的semantic信息,从而提供防御针对攻击的自然优势。为验证这个概念,我们进行了对图像重采样 defend against adversarial attacks的全面研究。我们开发了基本的重采样方法,使用 interpolating strategies和坐标Shift的大小。我们的分析表明,这些基本方法可以部分减轻针对攻击,但是它们有显著的限制:clean图像的准确率明显下降,而针对攻击的改进率不substantial。我们提出了基于implicit continuous representation的图像重采样方法(IRAD),以超越这些限制。首先,我们构建了一个implicit continuous representation,允许我们将任何输入图像转换为一个连续坐标空间中的表示。其次,我们引入SampleNet,它自动生成了不同输入的像素偏移,以便重采样。此外,我们可以将我们的方法扩展到当前领域的先进液态方法,通过减少时间步骤而保持防御能力。广泛的实验表明,我们的方法可以对多种攻击进行针对攻击,保持高精度clean images。
A Comparative Study of Image Restoration Networks for General Backbone Network Design
results: 实验结果表明,新提出的通用图像修复网络X-Restormer在多种图像修复任务中具有优秀的任务通用性和高性能。Abstract
Despite the significant progress made by deep models in various image restoration tasks, existing image restoration networks still face challenges in terms of task generality. An intuitive manifestation is that networks which excel in certain tasks often fail to deliver satisfactory results in others. To illustrate this point, we select five representative image restoration networks and conduct a comparative study on five classic image restoration tasks. First, we provide a detailed explanation of the characteristics of different image restoration tasks and backbone networks. Following this, we present the benchmark results and analyze the reasons behind the performance disparity of different models across various tasks. Drawing from this comparative study, we propose that a general image restoration backbone network needs to meet the functional requirements of diverse tasks. Based on this principle, we design a new general image restoration backbone network, X-Restormer. Extensive experiments demonstrate that X-Restormer possesses good task generality and achieves state-of-the-art performance across a variety of tasks.
摘要
尽管深度模型在不同的图像恢复任务中做出了显著的进步,现有的图像恢复网络仍然面临任务总体性的挑战。我们选择了五种代表性的图像恢复网络,对五个经典的图像恢复任务进行比较研究。首先,我们介绍了不同图像恢复任务和后ION网络的特点。接着,我们公布了标准测试结果,并分析了不同模型在不同任务中的性能差异的原因。基于这次比较研究,我们提出了一个通用的图像恢复后ION网络需要满足多种任务的功能要求。根据这个原则,我们设计了一个新的通用图像恢复后ION网络——X-Restormer。广泛的实验表明,X-Restormer具有优秀的任务总体性和在多种任务中达到了 estado del arte 性能。
Fractional Concepts in Neural Networks: Enhancing Activation and Loss Functions
results: 研究表明,通过使用分数导数顺序,神经网络可以更好地适应输入数据,并减少输出错误。这可能可以提高神经网络的总性能。Abstract
The paper presents a method for using fractional concepts in a neural network to modify the activation and loss functions. The methodology allows the neural network to define and optimize its activation functions by determining the fractional derivative order of the training process as an additional hyperparameter. This will enable neurons in the network to adjust their activation functions to match input data better and reduce output errors, potentially improving the network's overall performance.
摘要
文章提出了一种使用分数概念在神经网络中修改活动函数和损失函数的方法。该方法ология让神经网络可以通过确定训练过程中分数导数顺序作为额外参数来定义和优化其活动函数。这将允许神经元在网络中调整其活动函数以更好地匹配输入数据,降低输出错误,并可能提高网络的总性能。
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now
paper_authors: Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu for:这篇论文主要关注于如何评估安全驱动的扩散模型(DMs)是否能够删除不想要的概念、风格和物品。methods:本研究使用了反击攻击(也称为反击提示)来评估不知道的安全驱动扩散模型(DMs)是否能够删除不想要的概念、风格和物品。我们开发了一个名为UnlearnDiff的新型反击学习方法,利用扩散模型的自然分类能力来快速生成反击提示,使其成为对于生成模型的攻击如同于对于图像分类攻击的程度。results:我们通过对五种常见的安全驱动扩散模型(DMs)进行了多项测试,以评估它们在删除不想要的概念、风格和物品方面的不知道性和效率。结果显示,UnlearnDiff 比预设的反击提示方法更有效率和高效。代码可以在 https://github.com/OPTML-Group/Diffusion-MU-Attack 获取。请注意,这篇论文可能会包含一些可能会导致不良影响的模型输出。Abstract
The recent advances in diffusion models (DMs) have revolutionized the generation of complex and diverse images. However, these models also introduce potential safety hazards, such as the production of harmful content and infringement of data copyrights. Although there have been efforts to create safety-driven unlearning methods to counteract these challenges, doubts remain about their capabilities. To bridge this uncertainty, we propose an evaluation framework built upon adversarial attacks (also referred to as adversarial prompts), in order to discern the trustworthiness of these safety-driven unlearned DMs. Specifically, our research explores the (worst-case) robustness of unlearned DMs in eradicating unwanted concepts, styles, and objects, assessed by the generation of adversarial prompts. We develop a novel adversarial learning approach called UnlearnDiff that leverages the inherent classification capabilities of DMs to streamline the generation of adversarial prompts, making it as simple for DMs as it is for image classification attacks. This technique streamlines the creation of adversarial prompts, making the process as intuitive for generative modeling as it is for image classification assaults. Through comprehensive benchmarking, we assess the unlearning robustness of five prevalent unlearned DMs across multiple tasks. Our results underscore the effectiveness and efficiency of UnlearnDiff when compared to state-of-the-art adversarial prompting methods. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper contains model outputs that may be offensive in nature.
摘要
近期的扩散模型(DM)的进步已经革命化了复杂和多样的图像生成。然而,这些模型也可能导致安全隐患,如生成危险内容和数据权利侵犯。虽然有努力创建安全驱动的卸载方法来解决这些挑战,但是存在uncertainty。为了bridging这个uncertainty,我们提出了基于对抗攻击的评估框架,以评估安全驱动的卸载DM的可靠性。 Specifically, our research explores the(worst-case)Robustness of unlearned DMs in eliminating unwanted concepts, styles, and objects, assessed by the generation of adversarial prompts. We develop a novel adversarial learning approach called UnlearnDiff that leverages the inherent classification capabilities of DMs to streamline the generation of adversarial prompts, making it as simple for DMs as it is for image classification attacks. This technique streamlines the creation of adversarial prompts, making the process as intuitive for generative modeling as it is for image classification assaults. Through comprehensive benchmarking, we assess the unlearning robustness of five prevalent unlearned DMs across multiple tasks. Our results underscore the effectiveness and efficiency of UnlearnDiff when compared to state-of-the-art adversarial prompting methods. 代码可以在https://github.com/OPTML-Group/Diffusion-MU-Attack中找到。注意:本文可能包含有害内容的模型输出。
Evaluating the Fairness of Discriminative Foundation Models in Computer Vision
results: 研究发现,fair PCA方法在大多数任务中能够很好地减轻偏见,但不同的减轻方法在不同任务中的效果不同。因此,选择合适的减轻方法取决于具体的用 caso。Abstract
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP), that are used for labeling tasks. We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy. Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning. We categorize desired behaviors based around three axes: (i) if the task concerns humans; (ii) how subjective the task is (i.e., how likely it is that people from a diverse range of backgrounds would agree on a labeling); and (iii) the intended purpose of the task and if fairness is better served by impartiality (i.e., making decisions independent of the protected attributes) or representation (i.e., making decisions to maximize diversity). Finally, we provide quantitative fairness evaluations for both binary-valued and multi-valued protected attributes over ten diverse datasets. We find that fair PCA, a post-processing method for fair representations, works very well for debiasing in most of the aforementioned tasks while incurring only minor loss of performance. However, different debiasing approaches vary in their effectiveness depending on the task. Hence, one should choose the debiasing approach depending on the specific use case.
摘要
我们提出一种新的分类法用于评估推理基模型(如语言对比预训练CLIP)的偏见。然后,我们系统地评估了现有的偏见缓解方法,以我们的分类法为基础。具体来说,我们评估了OpenAI的CLIP和OpenCLIP模型,用于关键应用程序,如零shot分类、图像检索和图像描述。我们根据三个轴分类愿望行为:(i)如果任务关注人类;(ii)任务是主观的(即人们来自多样化背景的人们是否能够达成一致的标签);以及(iii)任务的目的是否更好地服务于公平(即通过免除保护属性来做出决策,或者通过提高多样性来做出决策)。最后,我们为十个多样的数据集提供了量化的公平评估,发现了 fair PCA POST 处理方法可以很好地减少偏见,但不同的减少方法在不同任务中的效果不同。因此,选择减少方法应该根据具体的用例。
VQ-NeRF: Neural Reflectance Decomposition and Editing with Vector Quantization
For: The paper proposes a novel neural network model called VQ-NeRF that enables discrete material editing in 3D scenes.* Methods: The model consists of two branches: a continuous branch that predicts decomposed materials, and a discrete branch that uses vector quantization to quantize continuous materials into individual ones. The model also employs a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene.* Results: The proposed model demonstrates superior performance in material segmentation and editing, and is evaluated on both computer-generated and real-world scenes. Additionally, the model provides an interactive interface for material editing, making it more user-friendly.Abstract
We propose VQ-NeRF, a two-branch neural network model that incorporates Vector Quantization (VQ) to decompose and edit reflectance fields in 3D scenes. Conventional neural reflectance fields use only continuous representations to model 3D scenes, despite the fact that objects are typically composed of discrete materials in reality. This lack of discretization can result in noisy material decomposition and complicated material editing. To address these limitations, our model consists of a continuous branch and a discrete branch. The continuous branch follows the conventional pipeline to predict decomposed materials, while the discrete branch uses the VQ mechanism to quantize continuous materials into individual ones. By discretizing the materials, our model can reduce noise in the decomposition process and generate a segmentation map of discrete materials. Specific materials can be easily selected for further editing by clicking on the corresponding area of the segmentation outcomes. Additionally, we propose a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene, which reduces redundancy in the material segmentation process. To improve usability, we also develop an interactive interface to further assist material editing. We evaluate our model on both computer-generated and real-world scenes, demonstrating its superior performance. To the best of our knowledge, our model is the first to enable discrete material editing in 3D scenes.
摘要
(Simplified Chinese translation)我们提出VQ-NeRF模型,一个两个分支神经网络模型,它 integrate了Vector Quantization(VQ)来分解和编辑3D场景中的反射场景。传统的神经反射场景只使用连续表示来模型3D场景,尽管实际上物体通常由不同的材料组成。这种缺失可能导致材料分解过程中的噪声和复杂的材料编辑。为解决这些限制,我们的模型包括一个连续分支和一个精度分支。连续分支遵循传统的管道来预测分解的材料,而精度分支使用VQ机制来量化连续的材料为个体材料。通过归一化材料,我们的模型可以减少分解过程中的噪声并生成分解后的材料分 segmentation 图像。用户可以通过点击相应的分 segmentation 结果中的区域来选择特定的材料进行进一步的编辑。此外,我们还提出了基于Dropout的VQ代码字rankStrategy来预测场景中的材料数量,从而减少材料分 segmentation 过程中的重复性。为了提高可用性,我们还开发了一个交互式界面,以助于物料编辑。我们在计算机生成和实际场景中评估了我们的模型,并证明其在场景中的superior performance。到目前为止,我们的模型是第一个在3D场景中启用精度编辑的模型。
Learning to Generate Parameters of ConvNets for Unseen Image Data
results: 经过广泛的实验证明,提出的 PudNet 模型可以快速预测 ConvNet 的参数,并且可以在不同的数据集上保持比较高的性能。Abstract
Typical Convolutional Neural Networks (ConvNets) depend heavily on large amounts of image data and resort to an iterative optimization algorithm (e.g., SGD or Adam) to learn network parameters, which makes training very time- and resource-intensive. In this paper, we propose a new training paradigm and formulate the parameter learning of ConvNets into a prediction task: given a ConvNet architecture, we observe there exists correlations between image datasets and their corresponding optimal network parameters, and explore if we can learn a hyper-mapping between them to capture the relations, such that we can directly predict the parameters of the network for an image dataset never seen during the training phase. To do this, we put forward a new hypernetwork based model, called PudNet, which intends to learn a mapping between datasets and their corresponding network parameters, and then predicts parameters for unseen data with only a single forward propagation. Moreover, our model benefits from a series of adaptive hyper recurrent units sharing weights to capture the dependencies of parameters among different network layers. Extensive experiments demonstrate that our proposed method achieves good efficacy for unseen image datasets on two kinds of settings: Intra-dataset prediction and Inter-dataset prediction. Our PudNet can also well scale up to large-scale datasets, e.g., ImageNet-1K. It takes 8967 GPU seconds to train ResNet-18 on the ImageNet-1K using GC from scratch and obtain a top-5 accuracy of 44.65 %. However, our PudNet costs only 3.89 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (44.92 %), more than 2,300 times faster than the traditional training paradigm.
摘要
传统的卷积神经网络(ConvNet)需要大量的图像数据和迭代优化算法(如SGD或Adam)来学习网络参数,这使得训练非常时间和资源浪费。在这篇论文中,我们提出了一种新的训练方法,将参数学习转换成预测任务:给定一个ConvNet架构,我们观察到图像集和其对应的优化网络参数之间存在相关性,并explore我们可以学习一个映射来捕捉这些关系,以便直接预测未经训练的数据集中的参数。为此,我们提出了一种新的权重共享循环神经网络模型,称为PudNet,它的目标是学习图像集和其对应的网络参数之间的映射,并在只需要一次前向传播的情况下预测参数。此外,我们的模型还具有一系列适应性的循环单元,这些单元共享权重来捕捉不同层的参数之间的依赖关系。我们的实验表明,我们的提出的方法可以在两种不同的设置下达到好的效果:内部预测和间部预测。此外,我们的PudNet还可以适应大规模数据集,例如ImageNet-1K。使用GC从零开始训练ResNet-18,需要8967个GPU秒,而我们的PudNet只需要3.89个GPU秒来预测ResNet-18的参数,并达到相同的性能(44.92%),比传统训练方法超过2,300倍快。
Revisiting Transferable Adversarial Image Examples: Attack Categorization, Evaluation Guidelines, and New Insights
For: This paper aims to address the critical security concerns of transferable adversarial examples in real-world black-box attack scenarios by identifying two main problems in common evaluation practices and proposing new evaluation guidelines.* Methods: The paper proposes a novel attack categorization strategy and conducts systematic and fair intra-category analyses on transferability, as well as considering diverse imperceptibility metrics and finer-grained stealthiness characteristics from the perspective of attack traceback.* Results: The paper provides the first large-scale evaluation of transferable adversarial examples on ImageNet, involving 23 representative attacks against 9 representative defenses, and leads to new insights such as the superiority of an early attack method, the vulnerability of a state-of-the-art defense, and the negative correlation between stealthiness and transferability.Abstract
Transferable adversarial examples raise critical security concerns in real-world, black-box attack scenarios. However, in this work, we identify two main problems in common evaluation practices: (1) For attack transferability, lack of systematic, one-to-one attack comparison and fair hyperparameter settings. (2) For attack stealthiness, simply no comparisons. To address these problems, we establish new evaluation guidelines by (1) proposing a novel attack categorization strategy and conducting systematic and fair intra-category analyses on transferability, and (2) considering diverse imperceptibility metrics and finer-grained stealthiness characteristics from the perspective of attack traceback. To this end, we provide the first large-scale evaluation of transferable adversarial examples on ImageNet, involving 23 representative attacks against 9 representative defenses. Our evaluation leads to a number of new insights, including consensus-challenging ones: (1) Under a fair attack hyperparameter setting, one early attack method, DI, actually outperforms all the follow-up methods. (2) A state-of-the-art defense, DiffPure, actually gives a false sense of (white-box) security since it is indeed largely bypassed by our (black-box) transferable attacks. (3) Even when all attacks are bounded by the same $L_p$ norm, they lead to dramatically different stealthiness performance, which negatively correlates with their transferability performance. Overall, our work demonstrates that existing problematic evaluations have indeed caused misleading conclusions and missing points, and as a result, hindered the assessment of the actual progress in this field.
摘要
通过我们的研究,我们发现了两个主要问题在常见评估方法中:(1)在攻击传输性能方面缺乏系统性、一对一的攻击比较和公平的超参数设置。(2)在攻击隐蔽性方面缺乏对比。为了解决这些问题,我们建立了新的评估指南,包括提出了一种新的攻击分类策略和对 transferability 进行系统性和公平的内部分析。此外,我们还考虑了多种隐蔽性指标和更细化的攻击特征,从攻击跟踪的角度来评估隐蔽性。为了实现这一目标,我们对 ImageNet 进行了大规模的攻击传输性评估,包括 23 种代表性攻击和 9 种代表性防御。我们的评估导致了一些新的发现,包括:(1)在公平攻击超参数设置下,早期的攻击方法 DI 实际上超越了所有后续方法。(2)一种现状顶尖的防御 DiffPure 实际上给了一种 FALSE 的安全感,因为它实际上被我们的黑盒传输攻击大量绕过。(3)即使所有攻击都受限于同一个 $L_p$ нор,它们在隐蔽性方面表现出了截然不同的表现,这与传输性性能呈负相关。总之,我们的研究表明,现有的问题atic 评估已经导致了误导性的结论和缺失点,因此阻碍了这一领域的实际进步的评估。
Mesh Represented Recycle Learning for 3D Hand Pose and Mesh Estimation
results: 提高手势估计和手图估计的性能,不需要在推理阶段添加计算负担。Abstract
In general, hand pose estimation aims to improve the robustness of model performance in the real-world scenes. However, it is difficult to enhance the robustness since existing datasets are obtained in restricted environments to annotate 3D information. Although neural networks quantitatively achieve a high estimation accuracy, unsatisfied results can be observed in visual quality. This discrepancy between quantitative results and their visual qualities remains an open issue in the hand pose representation. To this end, we propose a mesh represented recycle learning strategy for 3D hand pose and mesh estimation which reinforces synthesized hand mesh representation in a training phase. To be specific, a hand pose and mesh estimation model first predicts parametric 3D hand annotations (i.e., 3D keypoint positions and vertices for hand mesh) with real-world hand images in the training phase. Second, synthetic hand images are generated with self-estimated hand mesh representations. After that, the synthetic hand images are fed into the same model again. Thus, the proposed learning strategy simultaneously improves quantitative results and visual qualities by reinforcing synthetic mesh representation. To encourage consistency between original model output and its recycled one, we propose self-correlation loss which maximizes the accuracy and reliability of our learning strategy. Consequently, the model effectively conducts self-refinement on hand pose estimation by learning mesh representation from its own output. To demonstrate the effectiveness of our learning strategy, we provide extensive experiments on FreiHAND dataset. Notably, our learning strategy improves the performance on hand pose and mesh estimation without any extra computational burden during the inference.
摘要
通常,手姿估计的目的是提高模型在实际场景中的可靠性。然而,因为现有的数据集是在限制的环境中标注3D信息,因此增强可靠性是困难的。虽然神经网络在量化结果方面取得了高精度,但视觉质量不满足。这种在量化结果和视觉质量之间的差异是手姿表示的开放问题。为解决这个问题,我们提议一种基于循环学习的手姿和三角形估计策略,即在训练阶段使用自动生成的手姿三角形表示来强化模型的输出。具体来说,一个手姿和三角形估计模型首先预测实际世界中手图像的3D键点位置和三角形Vertex,然后使用自动生成的手姿三角形来生成 sintetic手图像。最后,这些 sintetic手图像被 feed 到同一个模型中,以便模型可以进行自我反复学习。因此,我们的学习策略同时提高了量化结果和视觉质量,通过强化自动生成的手姿三角形表示。为保证模型的输出与重新输入的一致性,我们提议一种自相关损失函数,该函数最大化了模型的准确性和可靠性。因此,模型可以通过自己的输出来进行自我反复学习,从而提高手姿估计的性能。为证明我们的学习策略的有效性,我们提供了大量的实验结果,并证明了我们的策略不会在推理阶段增加计算负担。
paper_authors: Xudong Gao, Xiao Guang Gao, Jia Rong, Xiaowei Chen, Xiang Liao, Jun Chen for: This paper aims to address the challenges of multi-label classification (MLC) in image recognition, specifically when objects within the visual field occlude one another.methods: The paper introduces a pioneering integrated network framework called HB-net, built upon the foundation of Holistic Bursting (HB) cell clusters, to recognize multiple occluded objects within images. The framework incorporates various Bursting cell cluster structures and an evidence accumulation mechanism.results: The models incorporating the HB framework exhibit a significant $2.98%$ enhancement in recognition accuracy compared to models without the HB framework ($1.0298$ times, $p=0.0499$). Despite having only three convolutional layers and approximately $1/30$ of the parameters, the models that combine the HB framework and EA mechanism achieve a comparable level of accuracy and resilience to ResNet50.Here’s the Chinese version:for: 这篇论文目标是解决图像认知中的多标签分类(MLC)挑战,特别是在视场中物体相互 occlude 时,同时识别 occluded 和 occluding 物体。methods: 这篇论文提出了一种先锋的集成网络框架,名为 HB-net,基于启发式强迫细胞(HB)群集,用于同时识别图像中的多个 occluded 对象。框架包括多种 Bursting 细胞群集结构和证据积累机制。results: 包含 HB 框架的模型比无 HB 框架的模型显著提高 $2.98%$ 的识别精度 ($1.0298$ 倍, $p=0.0499$)。尽管 HB-net 模型只有三层 convolutional 层和大约 $1/30$ 的参数,但是与 ResNet50 具有相同的精度和鲁棒性。Abstract
Within the realm of image recognition, a specific category of multi-label classification (MLC) challenges arises when objects within the visual field may occlude one another, demanding simultaneous identification of both occluded and occluding objects. Traditional convolutional neural networks (CNNs) can tackle these challenges; however, those models tend to be bulky and can only attain modest levels of accuracy. Leveraging insights from cutting-edge neural science research, specifically the Holistic Bursting (HB) cell, this paper introduces a pioneering integrated network framework named HB-net. Built upon the foundation of HB cell clusters, HB-net is designed to address the intricate task of simultaneously recognizing multiple occluded objects within images. Various Bursting cell cluster structures are introduced, complemented by an evidence accumulation mechanism. Testing is conducted on multiple datasets comprising digits and letters. The results demonstrate that models incorporating the HB framework exhibit a significant $2.98\%$ enhancement in recognition accuracy compared to models without the HB framework ($1.0298$ times, $p=0.0499$). Although in high-noise settings, standard CNNs exhibit slightly greater robustness when compared to HB-net models, the models that combine the HB framework and EA mechanism achieve a comparable level of accuracy and resilience to ResNet50, despite having only three convolutional layers and approximately $1/30$ of the parameters. The findings of this study offer valuable insights for improving computer vision algorithms. The essential code is provided at https://github.com/d-lab438/hb-net.git.
摘要
在图像识别领域中,特定类型的多标签分类(MLC)挑战在图像中可能存在对象干扰 Each other,需要同时识别干扰和干扰物体。传统的卷积神经网络(CNN)可以解决这些挑战,但这些模型往往很大,只能达到 moderate 级别的准确率。基于最新的神经科学研究,尤其是全局爆发(HB)细胞,这篇论文介绍了一种先进的集成网络框架,名为HB-网。HB-网基于HB细胞群集的基础上设计,用于同时识别图像中多个干扰物体。文中还提出了多种爆发细胞群结构,以及证据积累机制。测试结果表明,包含HB框架的模型与无HB框架模型相比,显著提高了识别精度($2.98\%$,$p=0.0499$)。虽然在高噪设置下,标准CNN模型在鲁棒性方面轻微优于HB-网模型,但HB-网模型与EA机制结合的模型可以与ResNet50模型具有相同的准确率和鲁棒性,即使只有三个卷积层和约$1/30$的参数。本研究发现的结论对计算机视觉算法提供了有价值的指导。代码可以在https://github.com/d-lab438/hb-net.git中找到。
ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map
results: 我们在HO-3D和DexYCB数据集上进行了广泛的评估,并证明了我们的方法在手重建方面的表现更高,并且能够生成更真实的物体形状。Abstract
3D reconstruction of hand-object manipulations is important for emulating human actions. Most methods dealing with challenging object manipulation scenarios, focus on hands reconstruction in isolation, ignoring physical and kinematic constraints due to object contact. Some approaches produce more realistic results by jointly reconstructing 3D hand-object interactions. However, they focus on coarse pose estimation or rely upon known hand and object shapes. We propose the first approach for realistic 3D hand-object shape and pose reconstruction from a single depth map. Unlike previous work, our voxel-based reconstruction network regresses the vertex coordinates of a hand and an object and reconstructs more realistic interaction. Our pipeline additionally predicts voxelized hand-object shapes, having a one-to-one mapping to the input voxelized depth. Thereafter, we exploit the graph nature of the hand and object shapes, by utilizing the recent GraFormer network with positional embedding to reconstruct shapes from template meshes. In addition, we show the impact of adding another GraFormer component that refines the reconstructed shapes based on the hand-object interactions and its ability to reconstruct more accurate object shapes. We perform an extensive evaluation on the HO-3D and DexYCB datasets and show that our method outperforms existing approaches in hand reconstruction and produces plausible reconstructions for the objects
摘要
三维重建手Object操作是重要的人工动作模拟领域。大多数方法在困难的物体操作场景中,都会忽略物体与手的物理和遥感约束。一些方法可以生成更加真实的结果,但是它们都是通过粗略的手形估计或者假设已知手形和物体形状来实现。我们提出了首个从单个深度图中真实重建3D手Object形状和姿势的方法。与前一些方法不同的是,我们的小节基于重建网络将手Object的Vertex坐标重建为手Object的3D形状。我们的管道还预测了手Object的小节形状,它们与输入深度图的小节形状一一对应。之后,我们利用手Object形状的图形结构,通过最近的GraFormer网络和位置嵌入来重建模板几何体。此外,我们还展示了在添加另一个GraFormer组件后,可以基于手Object交互来更加精准地重建物体形状,并且可以更加准确地重建物体。我们对HO-3D和DexYCB数据集进行了广泛的评估,并证明了我们的方法在手 reconstruction和物体重建方面的表现比现有方法更好。
For: 本研究旨在解决Scene Understanding中的Out-of-Distribution(OOD)对象问题,提高Scene Understanding的性能。* Methods: 本文提出了Panoptic Out-of Distribution Segmentation(PoDS)网络,包括一个共享背景、OODContextualModule、双对称解码器和任务特定头部。PoDS网络通过我们的准确性不符分配策略和数据增强策略来逐渐学习OOD对象,保持IN-distribution表现。* Results: 我们在Cityscapes和BDD100K两个benchmark上进行了广泛的评估,并证明了PoDS网络能够有效地解决OOD对象问题,并且大幅超越了基eline。我们还提供了数据集、代码和训练模型,并在http://pods.cs.uni-freiburg.de上公开发布。Abstract
Deep learning has led to remarkable strides in scene understanding with panoptic segmentation emerging as a key holistic scene interpretation task. However, the performance of panoptic segmentation is severely impacted in the presence of out-of-distribution (OOD) objects i.e. categories of objects that deviate from the training distribution. To overcome this limitation, we propose Panoptic Out-of Distribution Segmentation for joint pixel-level semantic in-distribution and out-of-distribution classification with instance prediction. We extend two established panoptic segmentation benchmarks, Cityscapes and BDD100K, with out-of-distribution instance segmentation annotations, propose suitable evaluation metrics, and present multiple strong baselines. Importantly, we propose the novel PoDS architecture with a shared backbone, an OOD contextual module for learning global and local OOD object cues, and dual symmetrical decoders with task-specific heads that employ our alignment-mismatch strategy for better OOD generalization. Combined with our data augmentation strategy, this approach facilitates progressive learning of out-of-distribution objects while maintaining in-distribution performance. We perform extensive evaluations that demonstrate that our proposed PoDS network effectively addresses the main challenges and substantially outperforms the baselines. We make the dataset, code, and trained models publicly available at http://pods.cs.uni-freiburg.de.
摘要
深度学习已经导致场景理解方面做出了非常出色的进步,而涵盖全场景的场景解释任务——权重分割——也在不断提高。然而,权重分割性能在不同于训练数据分布的对象(Out-of-Distribution,OOD)上受到严重的限制。为了解决这个问题,我们提出了权重分割Out-of-Distribution Segmentation(PoDS),它可以同时进行 pixel-levelsemantic 内存分类和 OOD 分类,并且可以预测实例。我们在Cityscapes和BDD100K两个已有的权重分割 benchmark上添加了 OOD 实例分类注释,并提出了适当的评价指标。我们还提出了一种新的 PoDS 架构,它包括共享背景、OOD 上下文模块和两个对称的解码器,其中每个解码器都有任务特定的头,使用我们的偏移缺失策略来提高 OOD 通用性。通过我们的数据增强策略,这种方法可以逐步学习 OOD 对象,同时保持内存分类性能。我们进行了广泛的评估,结果表明,我们的提出的 PoDS 网络可以有效地解决主要挑战,并且明显超过基线。我们将数据集、代码和训练模型公开发布在http://pods.cs.uni-freiburg.de。
Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts
results: 我们的Progressive3D框架可以实现高精度的3D内容生成,并且可以应对不同的文本至3D方法驱动不同的3D表现。实验结果表明,我们的方法可以实现精确的3D内容生成,并且可以应对复杂的提示。Abstract
Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics and is general for various text-to-3D methods driven by different 3D representations.
摘要
现代文本到3D生成方法已经实现了印象深刻的3D内容创造能力,归功于图像扩散模型和优化策略的进步。然而,当前方法在复杂的semantics prompt上难以生成正确的3D内容,即多个交互对象与不同属性绑定的提示。在这项工作中,我们提出了一个通用框架名为Progressive3D,它将整个生成分解为一系列的本地进度编辑步骤,以创造复杂提示的精确3D内容。此外,我们还提出了重叠 semantic component suppression技术,以便促进优化过程更加注重semantic differences between prompts。广泛的实验表明,我们提出的Progressive3D框架可以为复杂semantics prompt生成精确3D内容,并且可以适用于不同的文本到3D方法驱动的不同3D表示。
Multi Task Consistency Guided Source-Free Test-Time Domain Adaptation Medical Image Segmentation
results: 在 benchmark 肠图像分割任务上进行了广泛的实验,与源Domain模型直接预测相比,分割 dice 得分提高了6.27%和0.96%在RIM-ONE-r3和Drishti GS数据集上。此外,实验结果表明,我们提出的方法在与现有竞争性领域适应分割算法相比,表现出了良好的性能。Abstract
Source-free test-time adaptation for medical image segmentation aims to enhance the adaptability of segmentation models to diverse and previously unseen test sets of the target domain, which contributes to the generalizability and robustness of medical image segmentation models without access to the source domain. Ensuring consistency between target edges and paired inputs is crucial for test-time adaptation. To improve the performance of test-time domain adaptation, we propose a multi task consistency guided source-free test-time domain adaptation medical image segmentation method which ensures the consistency of the local boundary predictions and the global prototype representation. Specifically, we introduce a local boundary consistency constraint method that explores the relationship between tissue region segmentation and tissue boundary localization tasks. Additionally, we propose a global feature consistency constraint toto enhance the intra-class compactness. We conduct extensive experiments on the segmentation of benchmark fundus images. Compared to prediction directly by the source domain model, the segmentation Dice score is improved by 6.27\% and 0.96\% in RIM-ONE-r3 and Drishti GS datasets, respectively. Additionally, the results of experiments demonstrate that our proposed method outperforms existing competitive domain adaptation segmentation algorithms.
摘要
源无法测试适应技术是为医学影像分割模型提高适应性,使其在不同和未经见过的测试集上具有更高的普适性和可靠性,而无需访问源领域。保证测试领域边缘与对应的输入保持一致性是适时适应技术的关键。为了提高测试适应性的表现,我们提出了基于多任务一致性指导的源无法测试适应医学影像分割方法。这种方法通过 explore 肿瘤区域分割和肿瘤边缘定位任务之间的关系来保证本地边缘预测的一致性。此外,我们还提出了全局特征一致性约束,以提高内类紧凑度。我们对医学影像分割benchmark数据集进行了广泛的实验。与直接使用源领域模型预测相比,我们的提出方法可以提高分割 dice 分数6.27%和0.96%在RIM-ONE-r3和Drishti GS数据集中,分别。此外,实验结果还表明,我们的提出方法可以超越现有的竞争性领域适应分割算法。
results: 研究发现,感知积分主要受到刺激力谱的影响,而不是其他Physical variables。此外,研究还发现了感知积分与生成模型下的感知准确性之间的关系。Abstract
Perception is often viewed as a process that transforms physical variables, external to an observer, into internal psychological variables. Such a process can be modeled by a function coined perceptual scale. The perceptual scale can be deduced from psychophysical measurements that consist in comparing the relative differences between stimuli (i.e. difference scaling experiments). However, this approach is often overlooked by the modeling and experimentation communities. Here, we demonstrate the value of measuring the perceptual scale of classical (spatial frequency, orientation) and less classical physical variables (interpolation between textures) by embedding it in recent probabilistic modeling of perception. First, we show that the assumption that an observer has an internal representation of univariate parameters such as spatial frequency or orientation while stimuli are high-dimensional does not lead to contradictory predictions when following the theoretical framework. Second, we show that the measured perceptual scale corresponds to the transduction function hypothesized in this framework. In particular, we demonstrate that it is related to the Fisher information of the generative model that underlies perception and we test the predictions given by the generative model of different stimuli in a set a of difference scaling experiments. Our main conclusion is that the perceptual scale is mostly driven by the stimulus power spectrum. Finally, we propose that this measure of perceptual scale is a way to push further the notion of perceptual distances by estimating the perceptual geometry of images i.e. the path between images instead of simply the distance between those.
摘要
感知通常被看作将外部物理变量转换成内部心理变量的过程。这种过程可以通过一个名为感知尺度的函数来模型。感知尺度可以通过心理物理测量(比如差异检测实验)来推算。然而,这一方法经常被模型和实验社区忽视。我们在这里示出了测量感知尺度的价值,并将其嵌入到了现代感知probabilistic模型中。首先,我们表明了假设观察者有内部表征一个参数,如空间频率或方向,而 stimulus 是高维的时,不会导致矛盾的预测。其次,我们示出了测量的感知尺度与假设的转化函数相关。具体来说,我们表明了它与生成模型下的感知中的 Fisher信息相关,并在一系列差异检测实验中测试了这些预测。我们的主要结论是,感知尺度主要受 stimulus 的能量спектrum影响。最后,我们提议这种感知尺度的测量是一种可以推进感知距离的方式,而不仅仅是简单地测量图像之间的距离。
Domain-Generalized Face Anti-Spoofing with Unknown Attacks
results: 实验结果表明,我们的方法在防骗检测领域中具有优秀的性能,可以同时处理知道和未知攻击。Abstract
Although face anti-spoofing (FAS) methods have achieved remarkable performance on specific domains or attack types, few studies have focused on the simultaneous presence of domain changes and unknown attacks, which is closer to real application scenarios. To handle domain-generalized unknown attacks, we introduce a new method, DGUA-FAS, which consists of a Transformer-based feature extractor and a synthetic unknown attack sample generator (SUASG). The SUASG network simulates unknown attack samples to assist the training of the feature extractor. Experimental results show that our method achieves superior performance on domain generalization FAS with known or unknown attacks.
摘要
translate into Simplified Chinese:尽管面部防伪(FAS)方法在特定领域或攻击类型上达到了很高的表现,但很少的研究集中着重于同时面临域名变化和未知攻击,这更接近实际应用场景。为了处理域名总则未知攻击,我们介绍了一种新的方法,DGUA-FAS,它包括一个基于Transformer的特征提取器和一个生成synthetic未知攻击样本网络(SUASG)。SUASG网络模拟未知攻击样本,以帮助特征提取器的训练。实验结果表明,我们的方法在域名总则FAS中具有优秀的表现,包括知道或未知攻击。
results: 该模型在零shot匹配和下游几何估计方面实现了Superior性能,与之前的方法相比具有大幅提升。Abstract
Finding corresponding pixels within a pair of images is a fundamental computer vision task with various applications. Due to the specific requirements of different tasks like optical flow estimation and local feature matching, previous works are primarily categorized into dense matching and sparse feature matching focusing on specialized architectures along with task-specific datasets, which may somewhat hinder the generalization performance of specialized models. In this paper, we propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching). In particular, we elaborately design a cascaded GRU module for refinement by exploring the geometric similarity iteratively at multiple scales following an additional uncertainty estimation module for sparsification. To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth by generating optical flow supervision with greater intervals. As such, we are able to mix up various dense and sparse matching datasets, significantly improving the training diversity. The generalization capacity of our proposed RGM is greatly improved by learning the matching and uncertainty estimation in a two-stage manner on the large, mixed data. Superior performance is achieved for zero-shot matching and downstream geometry estimation across multiple datasets, outperforming the previous methods by a large margin.
摘要
寻找图像对应像素是计算机视觉的基本任务,具有各种应用。由于不同任务的特殊需求,以前的工作主要分为紧密匹配和稀疏特征匹配,专注于特有的建筑和任务特定的数据集,这可能会有所限制特殊模型的总体性能。在这篇论文中,我们提出了一种深度模型,称为Robust Generalist Matching(稳健通用匹配),用于紧密和稀疏匹配。特别是,我们精心设计了一个嵌入式GRU模块,通过多个缩放级别的几何相似性进行迭代修养,并附加了一个额外的不确定估计模块以实现稀疏化。为了减少人工训练样本和实际场景之间的差距,我们构建了一个新的大规模数据集,该数据集包含稀疏匹配的真实参照数据,并通过生成更大的间隔来提供更多的流动推导。因此,我们能够将不同的紧密和稀疏匹配数据集混合在一起,大幅提高训练多样性。我们的提出的RGM模型在适应零次匹配和下游几何估计方面表现出色,与之前的方法相比,具有很大的提升。
BanglaAbuseMeme: A Dataset for Bengali Abusive Meme Classification
results: 研究发现,使用多modal 信息(文本和图像)的模型可以超过单modal 模型的性能,并取得了70.51的macro F1 分数。 In addition, the study performed a qualitative error analysis of the misclassified memes for each of the best-performing models.Abstract
The dramatic increase in the use of social media platforms for information sharing has also fueled a steep growth in online abuse. A simple yet effective way of abusing individuals or communities is by creating memes, which often integrate an image with a short piece of text layered on top of it. Such harmful elements are in rampant use and are a threat to online safety. Hence it is necessary to develop efficient models to detect and flag abusive memes. The problem becomes more challenging in a low-resource setting (e.g., Bengali memes, i.e., images with Bengali text embedded on it) because of the absence of benchmark datasets on which AI models could be trained. In this paper we bridge this gap by building a Bengali meme dataset. To setup an effective benchmark we implement several baseline models for classifying abusive memes using this dataset. We observe that multimodal models that use both textual and visual information outperform unimodal models. Our best-performing model achieves a macro F1 score of 70.51. Finally, we perform a qualitative error analysis of the misclassified memes of the best-performing text-based, image-based and multimodal models.
摘要
“社交媒体平台上的信息共享量呈现出剧烈增长趋势,同时也导致了在线骚扰的减震。创建 memes 是一种简单 yet 高效的骚扰方式,通常将图片与短文字层次在上面。这些危险元素在普遍存在, threatening 在线安全。因此,需要开发高效的模型来检测和标识骚扰 memes。在低资源环境(如 Bengali memes,即图片上嵌入 Bengali 文本)下,问题更加挑战性,因为缺乏可用的标准 datasets 用于训练 AI 模型。本文填补这个空白,建立了 Bengali meme 数据集。我们实现了一些基线模型,用于使用这些数据集进行骚扰 memes 的分类。我们发现,使用文本和视觉信息的多模式模型,比单模式模型更高效。我们最佳表现的模型在 macro F1 分数上达到 70.51。最后,我们对最佳文本基于、图像基于和多模式模型的误分类照片进行质量错误分析。”
DBDNet:Partial-to-Partial Point Cloud Registration with Dual Branches Decoupling
results: 我们在 both synthetic 和实际数据集上进行了实验,并证明了我们提出的方法的效果。Abstract
Point cloud registration plays a crucial role in various computer vision tasks, and usually demands the resolution of partial overlap registration in practice. Most existing methods perform a serial calculation of rotation and translation, while jointly predicting overlap during registration, this coupling tends to degenerate the registration performance. In this paper, we propose an effective registration method with dual branches decoupling for partial-to-partial registration, dubbed as DBDNet. Specifically, we introduce a dual branches structure to eliminate mutual interference error between rotation and translation by separately creating two individual correspondence matrices. For partial-to-partial registration, we consider overlap prediction as a preordering task before the registration procedure. Accordingly, we present an overlap predictor that benefits from explicit feature interaction, which is achieved by the powerful attention mechanism to accurately predict pointwise masks. Furthermore, we design a multi-resolution feature extraction network to capture both local and global patterns thus enhancing both overlap prediction and registration module. Experimental results on both synthetic and real datasets validate the effectiveness of our proposed method.
摘要
Point cloud registration 在多种计算机视觉任务中扮演着关键角色,通常需要解决部分重叠注册问题。现有大多数方法采用串行计算旋转和平移,同时预测重叠,这种对接往往导致注册性能下降。在本文中,我们提出了一种高效的注册方法,即DBDNet,用于解决部分到部分注册问题。具体来说,我们引入了双支结构,以消除旋转和平移之间的相互干扰错误。为部分到部分注册,我们认为重叠预测是注册前置任务,因此我们提出了一种具有显著注意力机制的重叠预测器,以准确预测点 wise 面积。此外,我们设计了一种多resolution 特征提取网络,以捕捉局部和全局特征,从而提高重叠预测和注册模块的性能。实验结果表明,我们提出的方法在真实数据上得到了较好的效果。
VST++: Efficient and Stronger Visual Saliency Transformer
results: 实验结果表明,该模型在RGB、RGB-D 和 RGB-T SOD 数据集上的表现比现有方法更好,同时减少了25%的计算成本。Abstract
While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.
摘要
前些 CNN 基本模型在焦点对象检测(SOD)方面已经展示了漫天的成果,但它们在探索全球长距离相互关联的能力受限。我们之前的工作,Visual Saliency Transformer(VST),在基于 transformer 序列到序列的视角下,解决了这一约束,并同时涵盖了 RGB 和 RGB-D SOD。在 VST 模型中,我们开发了一个多任务 transformer 解码器,同时预测焦点和边框结果。此外,我们还提出了一种新的 токенupsampling 方法,称为反向 T2T,可以在 transformer 结构中简单地预测高分辨率焦点地图。基于 VST 模型,我们在这个工作中进一步提出了更加高效和强大的 VST++ 模型。为了减少 VST 模型的计算成本,我们提出了一种 Select-Integrate Attention(SIA)模块,将前景分成细化的小区域,并将背景信息集中到一个高级别的 токен中。此外,我们还设计了一种适合 depth 图的深度位编码方法,以便在低成本下使用 depth 信息。此外,我们还引入了一种带有指导性的 токен预测损失,以便为任务相关的 токен提供直接的指导。我们在不同的 transformer 基础体系上测试了我们的 VST++ 模型,并对 RGB、RGB-D 和 RGB-T SOD 测试集进行了评估。实验结果表明,我们的模型在性能和效率两个方面都有所提高,而且可以在不同的任务上进行普适的应用。
results: AVSA-Sep可以成功分离两类声音,并且通过共同训练和相互对齐提高效果。Abstract
The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.
摘要
《听视频音频分离场景》假设视频中的音频来源可见,但这会排除无法被摄像头捕捉的声音。现有方法在处理这类声音时遇到困难。本文介绍一种新的“听视频场景化分离”(AVSA-Sep)框架。它包括可见和无法被见的声音 semantic parser 和场景 Informed 分离器。AVSA-Sep 成功地分离了这两类声音,并且在共同训练和跨Modal 对齐下提高了效果。
DPF-Nutrition: Food Nutrition Estimation via Depth Prediction and Fusion
results: 我们在Nutrition5k上进行了充分的实验,并证明了DPF-Nutrition的有效性和效率。Abstract
A reasonable and balanced diet is essential for maintaining good health. With the advancements in deep learning, automated nutrition estimation method based on food images offers a promising solution for monitoring daily nutritional intake and promoting dietary health. While monocular image-based nutrition estimation is convenient, efficient, and economical, the challenge of limited accuracy remains a significant concern. To tackle this issue, we proposed DPF-Nutrition, an end-to-end nutrition estimation method using monocular images. In DPF-Nutrition, we introduced a depth prediction module to generate depth maps, thereby improving the accuracy of food portion estimation. Additionally, we designed an RGB-D fusion module that combined monocular images with the predicted depth information, resulting in better performance for nutrition estimation. To the best of our knowledge, this was the pioneering effort that integrated depth prediction and RGB-D fusion techniques in food nutrition estimation. Comprehensive experiments performed on Nutrition5k evaluated the effectiveness and efficiency of DPF-Nutrition.
摘要
一个合理和均衡的饮食是保持良好健康的必需。随着深度学习技术的发展,基于食物图像自动评估nutrition的方法提供了一个有前途的解决方案,帮助监测每日营养摄入和促进饮食健康。虽然单目图像基于nutrition评估方法方便、高效、经济,但是准确性问题仍然是一个主要挑战。为解决这个问题,我们提出了DPF-Nutrition,一种基于单目图像的综合nutrition评估方法。在DPF-Nutrition中,我们引入了深度预测模块,以生成深度地图,从而提高食物分量估算的准确性。此外,我们设计了RGB-D融合模块,将单目图像与预测的深度信息结合在一起,使nutrition评估表现更佳。据我们所知,这是食物nutrition评估中首次将深度预测和RGB-D融合技术相结合的尝试。我们在Nutrition5k上进行了广泛的实验,证明DPF-Nutrition的效果和效率。
MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision
results: 在HO3D和DexYCB dataset上进行了广泛的实验,证明了2D监控的MOHO在比较3D监控方法的情况下,获得了更高的成绩。Abstract
Previous works concerning single-view hand-held object reconstruction typically utilize supervision from 3D ground truth models, which are hard to collect in real world. In contrast, abundant videos depicting hand-object interactions can be accessed easily with low cost, although they only give partial object observations with complex occlusion. In this paper, we present MOHO to reconstruct hand-held object from a single image with multi-view supervision from hand-object videos, tackling two predominant challenges including object's self-occlusion and hand-induced occlusion. MOHO inputs semantic features indicating visible object parts and geometric embeddings provided by hand articulations as partial-to-full cues to resist object's self-occlusion, so as to recover full shape of the object. Meanwhile, a novel 2D-3D hand-occlusion-aware training scheme following the synthetic-to-real paradigm is proposed to release hand-induced occlusion. In the synthetic pre-training stage, 2D-3D hand-object correlations are constructed by supervising MOHO with rendered images to complete the hand-concealed regions of the object in both 2D and 3D space. Subsequently, MOHO is finetuned in real world by the mask-weighted volume rendering supervision adopting hand-object correlations obtained during pre-training. Extensive experiments on HO3D and DexYCB datasets demonstrate that 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin. Codes and key assets will be released soon.
摘要
previous works concerning single-view hand-held object reconstruction typically utilize supervision from 3D ground truth models, which are hard to collect in real world. In contrast, abundant videos depicting hand-object interactions can be accessed easily with low cost, although they only give partial object observations with complex occlusion. In this paper, we present MOHO to reconstruct hand-held object from a single image with multi-view supervision from hand-object videos, tackling two predominant challenges including object's self-occlusion and hand-induced occlusion. MOHO inputs semantic features indicating visible object parts and geometric embeddings provided by hand articulations as partial-to-full cues to resist object's self-occlusion, so as to recover full shape of the object. Meanwhile, a novel 2D-3D hand-occlusion-aware training scheme following the synthetic-to-real paradigm is proposed to release hand-induced occlusion. In the synthetic pre-training stage, 2D-3D hand-object correlations are constructed by supervising MOHO with rendered images to complete the hand-concealed regions of the object in both 2D and 3D space. Subsequently, MOHO is finetuned in real world by the mask-weighted volume rendering supervision adopting hand-object correlations obtained during pre-training. Extensive experiments on HO3D and DexYCB datasets demonstrate that 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin. Codes and key assets will be released soon.Here's the translation in Traditional Chinese:previous works concerning single-view hand-held object reconstruction typically utilize supervision from 3D ground truth models, which are hard to collect in real world. In contrast, abundant videos depicting hand-object interactions can be accessed easily with low cost, although they only give partial object observations with complex occlusion. In this paper, we present MOHO to reconstruct hand-held object from a single image with multi-view supervision from hand-object videos, tackling two predominant challenges including object's self-occlusion and hand-induced occlusion. MOHO inputs semantic features indicating visible object parts and geometric embeddings provided by hand articulations as partial-to-full cues to resist object's self-occlusion, so as to recover full shape of the object. Meanwhile, a novel 2D-3D hand-occlusion-aware training scheme following the synthetic-to-real paradigm is proposed to release hand-induced occlusion. In the synthetic pre-training stage, 2D-3D hand-object correlations are constructed by supervising MOHO with rendered images to complete the hand-concealed regions of the object in both 2D and 3D space. Subsequently, MOHO is finetuned in real world by the mask-weighted volume rendering supervision adopting hand-object correlations obtained during pre-training. Extensive experiments on HO3D and DexYCB datasets demonstrate that 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin. Codes and key assets will be released soon.
results: 在多个2D图像(CUB和AwA)和3D点云(ModelNet10、ModelNet40和ScanObjectNN) dataset上证明了提高ZSL性能。Abstract
Zero-shot learning (ZSL) aims to classify objects that are not observed or seen during training. It relies on class semantic description to transfer knowledge from the seen classes to the unseen classes. Existing methods of obtaining class semantics include manual attributes or automatic word vectors from language models (like word2vec). We know attribute annotation is costly, whereas automatic word-vectors are relatively noisy. To address this problem, we explore how ChatGPT, a large language model, can enhance class semantics for ZSL tasks. ChatGPT can be a helpful source to obtain text descriptions for each class containing related attributes and semantics. We use the word2vec model to get a word vector using the texts from ChatGPT. Then, we enrich word vectors by combining the word embeddings from class names and descriptions generated by ChatGPT. More specifically, we leverage ChatGPT to provide extra supervision for the class description, eventually benefiting ZSL models. We evaluate our approach on various 2D image (CUB and AwA) and 3D point cloud (ModelNet10, ModelNet40, and ScanObjectNN) datasets and show that it improves ZSL performance. Our work contributes to the ZSL literature by applying ChatGPT for class semantics enhancement and proposing a novel word vector fusion method.
摘要
zero-shot learning (ZSL) targets classifying objects that are not observed or seen during training. It relies on class semantic descriptions to transfer knowledge from seen classes to unseen classes. Existing methods of obtaining class semantics include manual attributes or automatic word vectors from language models (like word2vec). We know attribute annotation is costly, whereas automatic word-vectors are relatively noisy. To address this problem, we explore how ChatGPT, a large language model, can enhance class semantics for ZSL tasks. ChatGPT can be a helpful source to obtain text descriptions for each class containing related attributes and semantics. We use the word2vec model to get a word vector using the texts from ChatGPT. Then, we enrich word vectors by combining the word embeddings from class names and descriptions generated by ChatGPT. More specifically, we leverage ChatGPT to provide extra supervision for the class description, eventually benefiting ZSL models. We evaluate our approach on various 2D image (CUB and AwA) and 3D point cloud (ModelNet10, ModelNet40, and ScanObjectNN) datasets and show that it improves ZSL performance. Our work contributes to the ZSL literature by applying ChatGPT for class semantics enhancement and proposing a novel word vector fusion method.Here's the text with Traditional Chinese characters:zero-shot learning (ZSL) 目标是类别未在训练过的物件。它依赖类别Semantic descriptions 来传递见到类别的知识到未见到的类别。现有的类别Semantic descriptions 取得方法包括手动字段或自动的字幕Vector FROM language models (like word2vec)。我们知道字段标签是贵重的,而自动的字幕Vector 则相对较杂。为了解决这个问题,我们探索了 ChatGPT,一个大型语言模型,可以帮助提高 ZSL 模型的性能。ChatGPT 可以提供类别描述文本,每个类别都包含相关的属性和 semantics。我们使用 word2vec 模型从 ChatGPT 的文本中得到字幕Vector。然后,我们将字幕Vector 丰富化,通过结合类别名称和由 ChatGPT 生成的描述文本中的字幕。更 specifically,我们利用 ChatGPT 提供类别描述的额外监督,最终帮助 ZSL 模型提高性能。我们将我们的方法应用到多个 2D 图像 (CUB 和 AwA) 和 3D 点 cloud (ModelNet10, ModelNet40, 和 ScanObjectNN) datasets 上,并证明它可以提高 ZSL 性能。我们的工作对 ZSL 文献中的应用 ChatGPT 进行类别增强和提出了一个新的字幕融合方法,对 ZSL 领域的发展具有重要意义。
VKIE: The Application of Key Information Extraction on Video Text
results: 在一个具体定义的数据集上进行了广泛的实验,结果表明我们的解决方案可以实现很高的性能和高效的推理速度。Abstract
Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouples it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed. The code and dataset will be publicly available.
摘要
视频中EXTRACTING结构化信息是产业中的关键应用之一。本文定义了EXTRACTING视频上的层次关键信息这一重要任务。为了完成这个任务,我们将其分解成四个子任务,并提出了两种实现方案:PipVKIE和UniVKIE。PipVKIESequentially完成这四个子任务,而UniVKIE则是通过将所有子任务 integrate into one backbone来进行改进。两种方案均利用视觉、文本和坐标信息来 Representation的特征。我们在一个固定的数据集上进行了广泛的实验,结果表明我们的解决方案可以达到出色的性能和高效的推理速度。代码和数据集将公开。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Towards Abdominal 3-D Scene Rendering from Laparoscopy Surgical Videos using NeRFs
results: 实验结果表明,NeRF技术可以有效地将 Laparoscopy 视频转化为三维场景,但是该方法还需要进一步的研究以解决一些挑战。Abstract
Given that a conventional laparoscope only provides a two-dimensional (2-D) view, the detection and diagnosis of medical ailments can be challenging. To overcome the visual constraints associated with laparoscopy, the use of laparoscopic images and videos to reconstruct the three-dimensional (3-D) anatomical structure of the abdomen has proven to be a promising approach. Neural Radiance Fields (NeRFs) have recently gained attention thanks to their ability to generate photorealistic images from a 3-D static scene, thus facilitating a more comprehensive exploration of the abdomen through the synthesis of new views. This distinguishes NeRFs from alternative methods such as Simultaneous Localization and Mapping (SLAM) and depth estimation. In this paper, we present a comprehensive examination of NeRFs in the context of laparoscopy surgical videos, with the goal of rendering abdominal scenes in 3-D. Although our experimental results are promising, the proposed approach encounters substantial challenges, which require further exploration in future research.
摘要
Translated into Simplified Chinese: conventinal laparoscope 只提供二维ensional (2-D) 视图, Medical 疾病检测和诊断可以很困难。为了突破 Laparoscopy 的视觉限制,使用 Laparoscopic 图像和视频来重建 Abdomen 的三维ensional (3-D) 解剖结构已经证明是一种有前途的方法。Neural Radiance Fields (NeRFs) 最近受到了关注,因为它们可以从三维ensional (3-D) 静止场景中生成高品质图像,从而为 Abdomen 的探索提供更多的可能性。这与 Simultaneous Localization and Mapping (SLAM) 和深度估计方法不同。在这篇论文中,我们对 NeRFs 在 Laparoscopic 手术视频中的应用进行了全面的检视,目标是将 Abdomen 场景rendered 为三维ensional (3-D)。虽然我们的实验结果很有前途,但提出的方法遇到了一些挑战,需要未来的研究来解决。
results: 对比多种现有方法,该方法能够生成高质量的解决方案,并且比现有方法更有效率。Abstract
The optimal placement of sensors for environmental monitoring and disaster management is a challenging problem due to its NP-hard nature. Traditional methods for sensor placement involve exact, approximation, or heuristic approaches, with the latter being the most widely used. However, heuristic methods are limited by expert intuition and experience. Deep learning (DL) has emerged as a promising approach for generating heuristic algorithms automatically. In this paper, we introduce a novel sensor placement approach focused on learning improvement heuristics using deep reinforcement learning (RL) methods. Our approach leverages an RL formulation for learning improvement heuristics, driven by an actor-critic algorithm for training the policy network. We compare our method with several state-of-the-art approaches by conducting comprehensive experiments, demonstrating the effectiveness and superiority of our proposed approach in producing high-quality solutions. Our work presents a promising direction for applying advanced DL and RL techniques to challenging climate sensor placement problems.
摘要
“环境监控和灾害管理中的仪器位置最佳化是一个NP困难的问题。传统方法包括精确、近似或规律方法,但这些方法受到专家智慧和经验的限制。深度学习(DL)已经成为一种可能的方法,用于生成自动生成规律算法。在这篇论文中,我们提出了一种新的仪器位置方法,利用深度强化学习(RL)方法学习改善规律。我们的方法利用了RL的形式来学习改善规律,驱动actor-critic算法来训练政策网。我们对多个现有方法进行了严详的实验,展示了我们的提案方法的有效性和优势。我们的工作呈现了应用进步的DL和RL技术解决气候仪器位置问题的可能性。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Online Learning and Planning in Cognitive Hierarchies
results: 研究人员通过扩展Clark et al.(2016)的形式化框架,实现了复杂机器人系统的可靠和有效的 интеGRATION和决策过程。此外,新的框架还允许更加灵活地模型不同的决策组件之间的交互。Abstract
Complex robot behaviour typically requires the integration of multiple robotic and Artificial Intelligence (AI) techniques and components. Integrating such disparate components into a coherent system, while also ensuring global properties and behaviours, is a significant challenge for cognitive robotics. Using a formal framework to model the interactions between components can be an important step in dealing with this challenge. In this paper we extend an existing formal framework [Clark et al., 2016] to model complex integrated reasoning behaviours of robotic systems; from symbolic planning through to online learning of policies and transition systems. Furthermore the new framework allows for a more flexible modelling of the interactions between different reasoning components.
摘要
通常需要结合多种机器人和人工智能(AI)技术和组件来实现复杂的机器人行为。将这些不同的组件集成成一个一致的系统,并确保全球性和行为,是认知机器人的主要挑战。使用形式化框架来模型组件之间的交互可以是解决这个挑战的重要步骤。在这篇论文中,我们将对 Clark et al.(2016)的现有正式框架进行扩展,以模型机器人系统的复杂集成推理行为,从 симвоlic 规划到在线学习策略和转移系统。此外,新的框架还允许更加灵活地模型不同推理组件之间的交互。
Solving Hard Analogy Questions with Relation Embedding Chains
results: 本研究提出了一种将路径和关系嵌入结合的方法,并通过实验表明其可以解决困难的相似问题。Abstract
Modelling how concepts are related is a central topic in Lexical Semantics. A common strategy is to rely on knowledge graphs (KGs) such as ConceptNet, and to model the relation between two concepts as a set of paths. However, KGs are limited to a fixed set of relation types, and they are incomplete and often noisy. Another strategy is to distill relation embeddings from a fine-tuned language model. However, this is less suitable for words that are only indirectly related and it does not readily allow us to incorporate structured domain knowledge. In this paper, we aim to combine the best of both worlds. We model relations as paths but associate their edges with relation embeddings. The paths are obtained by first identifying suitable intermediate words and then selecting those words for which informative relation embeddings can be obtained. We empirically show that our proposed representations are useful for solving hard analogy questions.
摘要
模型两个概念之间的关系是lexical semantics中的一个中心话题。一种常见的策略是通过知识图(KG)such as ConceptNet,并将两个概念之间的关系表示为一组路径。然而,KGs是有限的,并且经常受到干扰和噪声的影响。另一种策略是通过精心调整的自然语言模型提取关系嵌入。然而,这并不适用于间接相关的词语,而且不能轻松地包含结构化领域知识。在这篇论文中,我们决心将这两种方法结合在一起。我们模型关系为路径,并将路径上的边与关系嵌入相关联。我们首先 identific suitable intermediate words,然后选择这些words,以便可以获取有用的关系嵌入。我们的提议的表示方法在解决困难的类比问题上表现出色。
results: 实验结果表明,该技术可以在不同的测试时适应 benchmark 上达到竞争力的分类性能。Abstract
Deep Learning models have shown remarkable performance in a broad range of vision tasks. However, they are often vulnerable against domain shifts at test-time. Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities, where a secondary task is solved at training time simultaneously with the main task, to be later used as an self-supervised proxy task at test-time. In this work, we propose a novel unsupervised TTT technique based on the maximization of Mutual Information between multi-scale feature maps and a discrete latent representation, which can be integrated to the standard training as an auxiliary clustering task. Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks.
摘要
Eliminating Reasoning via Inferring with Planning: A New Framework to Guide LLMs’ Non-linear Thinking
paper_authors: Yongqi Tong, Yifan Wang, Dawei Li, Sizhe Wang, Zi Lin, Simeng Han, Jingbo Shang for: 这研究旨在强化大语言模型(LLM)的高级逻辑能力,通过模拟人类线性思维和逻辑的混合。methods: 这研究提出了新的提示方法,即排除逻辑提示(IEP),它将排除逻辑和推理结合起来,以便LLM可以更好地模拟人类的非线性思维。results: 研究发现,IEP可以在多种任务上consistently outperform CoT,并且可以和CoT结合使用,以提高LLM的表现。此外,研究还引入了新的benchmark,即MENTAL-ABILITY REASONING BENCHMARK(MARB),以评估LLM的逻辑和语言理解能力。Abstract
Chain-of-Thought(CoT) prompting and its variants explore equipping large language models (LLMs) with high-level reasoning abilities by emulating human-like linear cognition and logic. However, the human mind is complicated and mixed with both linear and nonlinear thinking. In this work, we propose \textbf{I}nferential \textbf{E}xclusion \textbf{P}rompting (IEP), a novel prompting that combines the principles of elimination and inference in order to guide LLMs to think non-linearly. IEP guides LLMs to plan and then utilize Natural Language Inference (NLI) to deduce each possible solution's entailment relation with context, commonsense, or facts, therefore yielding a broader perspective by thinking back for inferring. This forward planning and backward eliminating process allows IEP to better simulate the complex human thinking processes compared to other CoT-based methods, which only reflect linear cognitive processes. We conducted a series of empirical studies and have corroborated that IEP consistently outperforms CoT across various tasks. Additionally, we observe that integrating IEP and CoT further improves the LLMs' performance on certain tasks, highlighting the necessity of equipping LLMs with mixed logic processes. Moreover, to better evaluate comprehensive features inherent in human logic, we introduce \textbf{M}ental-\textbf{A}bility \textbf{R}easoning \textbf{B}enchmark (MARB). The benchmark comprises six novel subtasks with a total of 9,115 questions, among which 1,685 are developed with hand-crafted rationale references. We believe both \textsc{IEP} and \textsc{MARB} can serve as a promising direction for unveiling LLMs' logic and verbal reasoning abilities and drive further advancements. \textsc{MARB} will be available at ~\texttt{anonymity link} soon.
摘要
Chain-of-Thought(CoT)提示和其变种探索将大型语言模型(LLM)具备高级思维能力,通过模拟人类线性认知和逻辑。然而,人类思维是复杂的,混合了线性和非线性思维。在这项工作中,我们提出了《排除并推理》(IEP)提示,它结合排除和推理的原理,以引导 LLM 进行非线性思维。IEP 使 LLM 可以规划,然后通过自然语言推理(NLI)来推理每个可能解的上下文、通用智慧和事实的关系,从而获得更广泛的视野。这种前置规划和后置排除过程使 IEP 更能模拟人类思维过程,相比其他 CoT 基于方法。我们进行了一系列实验研究,并证明 IEP 在多种任务上表现出色。此外,我们发现将 IEP 和 CoT 集成可以进一步提高 LLMS 的表现,强调了训练 LLMs 的混合逻辑过程的必要性。此外,为了更好地评估人类逻辑的全面特征,我们引入了《MENTAL-ABILITY REASONING BENCHMARK》(MARB)。 MARB 包括六个新的任务,共计 9,115 个问题,其中 1,685 个问题采用了手动制作的 rational references。我们认为 IEP 和 MARB 都可以成为探索 LLMs 逻辑和语言逻辑能力的有希望的方向,并驱动进一步的进步。MARB 将在 ~\texttt{anonymity link} 上公开。
Opportunities for Adaptive Experiments to Enable Continuous Improvement that Trades-off Instructor and Researcher Incentives
results: 这篇论文的实验结果表明,使用adaptive experimentation方法可以更好地支持学生的需求,并提高学生的学习效果。Abstract
Randomized experimental comparisons of alternative pedagogical strategies could provide useful empirical evidence in instructors' decision-making. However, traditional experiments do not have a clear and simple pathway to using data rapidly to try to increase the chances that students in an experiment get the best conditions. Drawing inspiration from the use of machine learning and experimentation in product development at leading technology companies, we explore how adaptive experimentation might help in continuous course improvement. In adaptive experiments, as different arms/conditions are deployed to students, data is analyzed and used to change the experience for future students. This can be done using machine learning algorithms to identify which actions are more promising for improving student experience or outcomes. This algorithm can then dynamically deploy the most effective conditions to future students, resulting in better support for students' needs. We illustrate the approach with a case study providing a side-by-side comparison of traditional and adaptive experimentation of self-explanation prompts in online homework problems in a CS1 course. This provides a first step in exploring the future of how this methodology can be useful in bridging research and practice in doing continuous improvement.
摘要
随机实验比较不同的教学策略可以提供有用的实际证据,帮助教师做出决策。然而,传统的实验没有一个明确的和简单的数据使用路径,这限制了学生在实验中获得最佳条件的机会。我们从技术公司的产品开发中使用机器学习和实验的经验而来,探讨如何使用适应试验来促进课程不断改进。在适应试验中,不同的臂/条件在学生面前采用,并分析数据,以改善未来学生的经验。这可以使用机器学习算法来确定哪些行动更有前途的提高学生体验或成绩。这个算法然后会在未来学生面前动态部署最有效的条件,从而提供更好的学生需求支持。我们通过一个案例研究,对传统和适应试验自适应提示在线作业问题的比较,以示方法的可行性。这是继续改进的未来的一个初步探索。
The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis
methods: 研究者通过审查 189 篇同行评审文章,探讨 SA 在不同领域中的应用和模型,以及数据集的问题。
results: 研究发现 SA 在不同领域中的定义和应用存在差异,导致可能的挑战和偏见。为解决这问题,研究者提出了一个伦理卡,以帮助实践者在使用 SA 时确保公正使用。Abstract
We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological literature on sentiment, we unveil distinct conceptualizations of this term in domains such as finance, government, and medicine. Our study exposes a lack of explicit definitions and frameworks for characterizing sentiment, resulting in potential challenges and biases. To tackle this issue, we propose an ethics sheet encompassing critical inquiries to guide practitioners in ensuring equitable utilization of SA. Our findings underscore the significance of adopting an interdisciplinary approach to defining sentiment in SA and offer a pragmatic solution for its implementation.
摘要
我们进行了一个关于社会技术方面的情感分析(SA)的调查, kritically examining 189 peer-reviewed papers on their applications, models, and datasets。我们的调查源于认识到SA已成为多种社会技术系统的重要组成部分,影响社会和技术用户。通过探究社会学和技术文献中的情感概念,我们揭示了不同领域中情感的不同定义和概念化。我们的研究发现了情感定义和框架的明确性不足,可能导致挑战和偏见。为解决这个问题,我们提议一份伦理宣言,涵盖了重要的伦理问题,以帮助实践者在使用SA时确保公正使用。我们的发现表明了采用多科学方法来定义情感在SA中的重要性,并提供了一个实用的解决方案。
A Unifying Framework for Learning Argumentation Semantics
results: 经验证试验表明,该框架可以在论证计算中具有较高的性能,并且可以在人机对话中提供更加可靠的结果。Abstract
Argumentation is a very active research field of Artificial Intelligence concerned with the representation and evaluation of arguments used in dialogues between humans and/or artificial agents. Acceptability semantics of formal argumentation systems define the criteria for the acceptance or rejection of arguments. Several software systems, known as argumentation solvers, have been developed to compute the accepted/rejected arguments using such criteria. These include systems that learn to identify the accepted arguments using non-interpretable methods. In this paper we present a novel framework, which uses an Inductive Logic Programming approach to learn the acceptability semantics for several abstract and structured argumentation frameworks in an interpretable way. Through an empirical evaluation we show that our framework outperforms existing argumentation solvers, thus opening up new future research directions in the area of formal argumentation and human-machine dialogues.
摘要
争议是人工智能的一个非常活跃的研究领域,涉及对人类和/或人工代理人之间的对话中使用的论据的表示和评估。正式争议系统的 Acceptability semantics 定义了论据的接受或拒绝的标准。一些称为争议解决器的软件系统已经被开发出来计算使用这些标准来接受或拒绝论据。这些系统包括使用非可解释的方法来识别接受的论据的学习系统。在这篇论文中,我们提出了一种新的框架,使用逻辑编程方法来学习多种抽象和结构化争议框架的接受可能性,并在实验评估中证明了我们的框架可以在接受可能性评估方面超越现有的争议解决器,从而开启了新的未来研究方向在正式争议和人机对话领域。
Preference Optimization for Molecular Language Models
paper_authors: Ryan Park, Ryan Theisen, Navriti Sahni, Marcel Patek, Anna Cichońska, Rayees Rahman
for: 用于生成新的化学结构
methods: 使用直接偏好优化精度调整
results: 高效、简单、有效地与化学家喜好Alignment of generated moleculesAbstract
Molecular language modeling is an effective approach to generating novel chemical structures. However, these models do not \emph{a priori} encode certain preferences a chemist may desire. We investigate the use of fine-tuning using Direct Preference Optimization to better align generated molecules with chemist preferences. Our findings suggest that this approach is simple, efficient, and highly effective.
摘要
分子语言模型可以有效地生成新的化学结构。然而,这些模型没有先验的编码化学家可能愿望的偏好。我们研究了使用直接偏好优化来更好地将生成的分子与化学家的偏好相Alignment。我们发现这种方法简单、高效并有高效果。Here's a word-for-word translation:分子语言模型可以有效地生成新的化学结构。然而,这些模型没有先验的编码化学家可能愿望的偏好。我们研究了使用直接偏好优化来更好地将生成的分子与化学家的偏好相Alignment。我们发现这种方法简单、高效并有高效果。
Document-Level Language Models for Machine Translation
results: 在四种多样化的翻译任务上进行了全面的评估,并显示了substantially 提高的文档指向得分,同时也更加计算效率。但是,我们还发现,通过回译来获得更好的结果,但是需要重新训练翻译系统。此外,我们还探讨了大语言模型的混合,并发现可能在使用大语言模型时存在强大的潜在性。Abstract
Despite the known limitations, most machine translation systems today still operate on the sentence-level. One reason for this is, that most parallel training data is only sentence-level aligned, without document-level meta information available. In this work, we set out to build context-aware translation systems utilizing document-level monolingual data instead. This can be achieved by combining any existing sentence-level translation model with a document-level language model. We improve existing approaches by leveraging recent advancements in model combination. Additionally, we propose novel weighting techniques that make the system combination more flexible and significantly reduce computational overhead. In a comprehensive evaluation on four diverse translation tasks, we show that our extensions improve document-targeted scores substantially and are also computationally more efficient. However, we also find that in most scenarios, back-translation gives even better results, at the cost of having to re-train the translation system. Finally, we explore language model fusion in the light of recent advancements in large language models. Our findings suggest that there might be strong potential in utilizing large language models via model combination.
摘要
尽管现有的机器翻译系统 todavía 以句子为单位运行,一个原因是因为大多数平行训练数据只有句子水平的对齐,没有文档水平的元信息可用。在这项工作中,我们设想建立了文本上下文感知的翻译系统,使用文档水平的独立语言模型。我们提高了现有的方法,利用最新的模型组合技术。此外,我们提出了新的权重技巧,使系统组合更加灵活,并显著减少计算负担。在四种多样化的翻译任务上进行了全面的评估,我们发现我们的扩展可以大幅提高文档目标得分,并且计算更加高效。然而,我们也发现,在大多数情况下,回传翻译能够提供更好的结果,但是需要重新训练翻译系统。最后,我们探讨了大语言模型的集成,我们发现大语言模型可以通过模型组合来提供强大的潜在力。
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
methods: 这篇论文使用了一种新的二阶优化器 named Jorge,它通过简化预conditioning步骤,从而大大减少了计算成本,使其在GPU上实现高效。
results: 实验结果表明,Jorge可以与现有的优化器,如SGD、AdamW和Shampoo等比肩,并在多个深度学习模型上显示出更高的效率和性能。Abstract
Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
摘要
尽管第二顺序优化器在深度学习中的更好的整合性,但由于计算成本高涨,使得它们在实际应用中较少使用。在这篇论文中,我们介绍了 Jorge,一种第二顺序优化器,它可以同时具有第一顺序优化器的快速整合和高效计算性。我们通过完全抛弃矩阵逆计算,使得 Jorge 在 GPU 上具有高效的墙 clock 时间。此外,我们还提出了一种确定 Jorge 的超参数的方法,通过对已经优化的 SGD 基线进行调整,以此减少调整努力。我们的实验表明,使用 Jorge 可以获得明显的优势,在多种深度学习模型上,在样本效率和墙 clock 时间两个方面都超过了状态元优化器,如 SGD、AdamW 和 Shampoo。
Fact-based Agent modeling for Multi-Agent Reinforcement Learning
results: 在多智能体粒子环境(MPE)中比基eline方法高效地提高agent政策学习效率,在复杂的竞争合作混合enario中实现更高的返点。Abstract
In multi-agent systems, agents need to interact and collaborate with other agents in environments. Agent modeling is crucial to facilitate agent interactions and make adaptive cooperation strategies. However, it is challenging for agents to model the beliefs, behaviors, and intentions of other agents in non-stationary environment where all agent policies are learned simultaneously. In addition, the existing methods realize agent modeling through behavior cloning which assume that the local information of other agents can be accessed during execution or training. However, this assumption is infeasible in unknown scenarios characterized by unknown agents, such as competition teams, unreliable communication and federated learning due to privacy concerns. To eliminate this assumption and achieve agent modeling in unknown scenarios, Fact-based Agent modeling (FAM) method is proposed in which fact-based belief inference (FBI) network models other agents in partially observable environment only based on its local information. The reward and observation obtained by agents after taking actions are called facts, and FAM uses facts as reconstruction target to learn the policy representation of other agents through a variational autoencoder. We evaluate FAM on various Multiagent Particle Environment (MPE) and compare the results with several state-of-the-art MARL algorithms. Experimental results show that compared with baseline methods, FAM can effectively improve the efficiency of agent policy learning by making adaptive cooperation strategies in multi-agent reinforcement learning tasks, while achieving higher returns in complex competitive-cooperative mixed scenarios.
摘要
在多代理系统中,代理需要互动和合作,在环境中进行交互。代理模型是重要的,以便促进代理之间的交互和适应合作策略。然而,在非站ARY环境中,所有代理策略都是同时学习的,对于代理来模型别人的信念、行为和意图是挑战。此外,现有的方法通过行为做副本来实现代理模型,假设在执行或训练中可以访问其他代理的本地信息。然而,这个假设在未知场景中是不可能的,例如竞争队伍、不可靠的通信和联合学习中的隐私问题。为了绕过这个假设并实现代理模型在未知场景中,我们提出了基于事实的代理模型(FAM)方法。FAM使用事实(即代理所获得的奖励和观察)为重建目标,通过变分自动编码器来学习别人的策略表示。我们在多代理粒子环境(MPE)上进行了评估,并与一些现有的 MARL 算法进行了比较。实验结果表明,相比基eline方法,FAM可以更好地提高代理策略学习效率,在多代理束缚学习任务中实现更高的返回,而且在复杂的竞争-合作混合场景中 achieve higher returns。
Enhancing the Performance of Automated Grade Prediction in MOOC using Graph Representation Learning
paper_authors: Soheila Farokhi, Aswani Yaramala, Jiangtao Huang, Muhammad F. A. Khan, Xiaojun Qi, Hamid Karimi for:The paper is written for the purpose of enhancing the performance of predictive machine learning models in student assignment grade prediction for MOOCs.methods:The paper uses graph embedding techniques to extract latent structural information encoded in the interactions between entities in the MOOC dataset, without requiring ground truth labels.results:The paper demonstrates that structural features can significantly improve the predictive performance of downstream assessment tasks, and the code and data are available in \url{https://github.com/DSAatUSU/MOOPer_grade_prediction}.Abstract
In recent years, Massive Open Online Courses (MOOCs) have gained significant traction as a rapidly growing phenomenon in online learning. Unlike traditional classrooms, MOOCs offer a unique opportunity to cater to a diverse audience from different backgrounds and geographical locations. Renowned universities and MOOC-specific providers, such as Coursera, offer MOOC courses on various subjects. Automated assessment tasks like grade and early dropout predictions are necessary due to the high enrollment and limited direct interaction between teachers and learners. However, current automated assessment approaches overlook the structural links between different entities involved in the downstream tasks, such as the students and courses. Our hypothesis suggests that these structural relationships, manifested through an interaction graph, contain valuable information that can enhance the performance of the task at hand. To validate this, we construct a unique knowledge graph for a large MOOC dataset, which will be publicly available to the research community. Furthermore, we utilize graph embedding techniques to extract latent structural information encoded in the interactions between entities in the dataset. These techniques do not require ground truth labels and can be utilized for various tasks. Finally, by combining entity-specific features, behavioral features, and extracted structural features, we enhance the performance of predictive machine learning models in student assignment grade prediction. Our experiments demonstrate that structural features can significantly improve the predictive performance of downstream assessment tasks. The code and data are available in \url{https://github.com/DSAatUSU/MOOPer_grade_prediction}
摘要
近年来,大规模在线开放课程(MOOC)在在线学习中得到了广泛的应用和发展。不同于传统的教室,MOOCs为不同背景和地理位置的学生提供了独特的学习机会。知名大学和MOOC专门提供者,如 Coursera,为多种主题的MOOC课程。由于大量报名和教师与学生之间的直接交互有限,因此自动评估任务如学生的评价和早期退出预测变得必要。然而,当前的自动评估方法忽略了学生和课程之间的结构关系。我们的假设是,这些结构关系,通过互动图表示出来,含有价值信息,可以提高任务的表现。为此,我们构建了一个大 MOOC 数据集的专用知识图,该图将在研究社区中公开。此外,我们利用图像技术来提取数据集中互动图中所隐藏的结构信息。这些技术不需要标注数据,可以用于多种任务。最后,我们将实体特征、行为特征和提取的结构特征相结合,提高预测机器学习模型的学生评价分数预测性能。我们的实验表明,结构特征可以显著提高下游评估任务的预测性能。代码和数据可以在 中找到。
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
results: 我们通过图像生成、编辑和注意力可视化等方式进行了广泛的量化比较,demonstrating that our method can learn more semantically disentangled concepts with enhanced word-concept correlation。此外,我们还介绍了一个新的数据集和评价协议,专门为这种学习对象级概念的新任务。Abstract
Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.
摘要
文本倒转,一种快速学习方法,学习一个新的"词"来表示图像风格和外观,以便将其 integrate into natural language sentences 生成新的合成图像。然而,在一个场景中identifying和integrating多个对象水平的概念 pose significant challenges, Even when embeddings for individual concepts are available. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularization techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualization with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.
Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm
results: 我们的实验表明,Targeted DeepFool算法可以在不同的深度神经网络架构上实现高效率和图像质量保持,而且可以增强模型的鲁棒性。 results show that one of the deep convolutional neural network architectures, AlexNet, and one of the state-of-the-art model Vision Transformer exhibit high robustness to getting fooled.Abstract
Deep neural networks (DNNs) have significantly advanced various domains, but their vulnerability to adversarial attacks poses serious concerns. Understanding these vulnerabilities and developing effective defense mechanisms is crucial. DeepFool, an algorithm proposed by Moosavi-Dezfooli et al. (2016), finds minimal perturbations to misclassify input images. However, DeepFool lacks a targeted approach, making it less effective in specific attack scenarios. Also, in previous related works, researchers primarily focus on success, not considering how much an image is getting distorted; the integrity of the image quality, and the confidence level to misclassifying. So, in this paper, we propose Targeted DeepFool, an augmented version of DeepFool that allows targeting specific classes for misclassification. We also introduce a minimum confidence score requirement hyperparameter to enhance flexibility. Our experiments demonstrate the effectiveness and efficiency of the proposed method across different deep neural network architectures while preserving image integrity as much as possible. Results show that one of the deep convolutional neural network architectures, AlexNet, and one of the state-of-the-art model Vision Transformer exhibit high robustness to getting fooled. Our code will be made public when publishing the paper.
摘要
深度神经网络(DNNs)在不同领域中得到了 significiant advancement,但它们受到了敌意攻击的威胁,这种威胁的存在对于理解和开发有效防御机制是非常重要。DeepFool算法,由Moosavi-Dezfooli et al.(2016)提出,可以在输入图像上发现微小的扰动,以让图像被误分类。然而,DeepFool算法缺乏目标化方法,这使得其在特定攻击enario下效果较差。此外,在先前的相关研究中,研究人员主要关注成功,而不是图像的纯度和误分类的信息量。因此,在这篇论文中,我们提出了Targeted DeepFool算法,这是对DeepFool算法的扩展,可以对特定的类进行误分类。我们还引入了最小信任分数的启用参数,以提高灵活性。我们的实验表明,提议的方法可以在不同的深度神经网络架构上进行效果和效率的混合,同时保持图像的纯度。结果显示,AlexNet和一种state-of-the-art模型Vision Transformer在深度神经网络架构上具有高度的抗攻击能力。我们将代码公开时出版论文。
A Unified Approach to Domain Incremental Learning with Memory: Theory and Algorithm
For: 本研究旨在提出一个统一架构,以应对不同领域的渐进式学习问题,并且仅从先前领域中获取一小部分的数据(即记忆)进行学习。* Methods: 本研究提出了一个统一架构, named Unified Domain Incremental Learning (UDIL),它整合了多种现有的方法,并且通过在训练过程中适应不同的参数,以获得最紧密的一致 bound。* Results: 实验结果显示,UDIL 比先前的领域渐进式学习方法在both synthetic和实际数据集上表现更好,并且可以适应不同的领域。Abstract
Domain incremental learning aims to adapt to a sequence of domains with access to only a small subset of data (i.e., memory) from previous domains. Various methods have been proposed for this problem, but it is still unclear how they are related and when practitioners should choose one method over another. In response, we propose a unified framework, dubbed Unified Domain Incremental Learning (UDIL), for domain incremental learning with memory. Our UDIL **unifies** various existing methods, and our theoretical analysis shows that UDIL always achieves a tighter generalization error bound compared to these methods. The key insight is that different existing methods correspond to our bound with different **fixed** coefficients; based on insights from this unification, our UDIL allows **adaptive** coefficients during training, thereby always achieving the tightest bound. Empirical results show that our UDIL outperforms the state-of-the-art domain incremental learning methods on both synthetic and real-world datasets. Code will be available at https://github.com/Wang-ML-Lab/unified-continual-learning.
摘要
域incremental learning aimsto adapt to a sequence of domains with only a small subset of data (i.e., memory) from previous domains. Various methods have been proposed for this problem, but it is still unclear how they are related and when practitioners should choose one method over another. In response, we propose a unified framework, called Unified Domain Incremental Learning (UDIL), for domain incremental learning with memory. Our UDIL unifies various existing methods, and our theoretical analysis shows that UDIL always achieves a tighter generalization error bound compared to these methods. The key insight is that different existing methods correspond to our bound with different fixed coefficients; based on insights from this unification, our UDIL allows adaptive coefficients during training, thereby always achieving the tightest bound. Empirical results show that our UDIL outperforms the state-of-the-art domain incremental learning methods on both synthetic and real-world datasets. Code will be available at https://github.com/Wang-ML-Lab/unified-continual-learning.Here's the translation of the highlighted phrases:* **unifies**: 统一* **fixed**: 固定的* **adaptive**: 可变的* **tighter**: 更紧的* **generalization error bound**: 泛化误差 bound* **state-of-the-art**: 现有的最佳方法* **synthetic**: sintetic* **real-world**: 实际的
Few-Shot In-Context Imitation Learning via Implicit Graph Alignment
results: 实验结果显示,我们的方法可以高效地完成几种实际生活中的每日任务,并在比较基eline上表现出色。视频可以在我们项目网站(https://www.robot-learning.uk/implicit-graph-alignment)中找到。Abstract
Consider the following problem: given a few demonstrations of a task across a few different objects, how can a robot learn to perform that same task on new, previously unseen objects? This is challenging because the large variety of objects within a class makes it difficult to infer the task-relevant relationship between the new objects and the objects in the demonstrations. We address this by formulating imitation learning as a conditional alignment problem between graph representations of objects. Consequently, we show that this conditioning allows for in-context learning, where a robot can perform a task on a set of new objects immediately after the demonstrations, without any prior knowledge about the object class or any further training. In our experiments, we explore and validate our design choices, and we show that our method is highly effective for few-shot learning of several real-world, everyday tasks, whilst outperforming baselines. Videos are available on our project webpage at https://www.robot-learning.uk/implicit-graph-alignment.
摘要
问题如下:给定一些对象的几个示例任务,如何使 robot 能够在新、未经见过的对象上完成同样的任务?这是因为对象类中的巨量对象关系使得推断任务相关关系 между新对象和示例对象困难。我们解决这个问题,通过将仿真学定义为对象图表示的条件对Alignment问题。因此,我们表明,这种conditioning允许机器人在示例后立即在新对象上进行任务,不需要对对象类或进一步训练。在我们的实验中,我们探索和验证我们的设计选择,并证明我们的方法高效地实现了几个真实世界、日常任务的少量学习,并超越基elines。视频可以在我们项目网站上找到:https://www.robot-learning.uk/implicit-graph-alignment。
An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes
results: 作者证明了这种方法的有效性和完整性,并在现有的benchmark集和一个新的劳动ious benchmark集上进行了比较,得到的结果表明该方法在现有的 benchmark 上比 state-of-the-art 的方法更高效。Abstract
Algebraic data types (ADTs) are a construct classically found in functional programming languages that capture data structures like enumerated types, lists, and trees. In recent years, interest in ADTs has increased. For example, popular programming languages, like Python, have added support for ADTs. Automated reasoning about ADTs can be done using satisfiability modulo theories (SMT) solving, an extension of the Boolean satisfiability problem with constraints over first-order structures. Unfortunately, SMT solvers that support ADTs do not scale as state-of-the-art approaches all use variations of the same \emph{lazy} approach. In this paper, we present an SMT solver that takes a fundamentally different approach, an \emph{eager} approach. Specifically, our solver reduces ADT queries to a simpler logical theory, uninterpreted functions (UF), and then uses an existing solver on the reduced query. We prove the soundness and completeness of our approach and demonstrate that it outperforms the state-of-theart on existing benchmarks, as well as a new, more challenging benchmark set from the planning domain.
摘要
алгебраические данные типы (ADTs) 是一种在函数编程语言中出现的构造,用于表示枚举类型、列表和树等数据结构。在最近几年中,关于 ADTs 的兴趣增加了。例如,流行编程语言如 Python 也添加了对 ADTs 的支持。通过使用满足性模ulo理论 (SMT) 解决方案,可以自动进行 ADTs 的逻辑推理。然而,现有的 SMT 解决方案都是基于同样的怠慢(lazy)方法,我们则提出了一种不同的积极(eager)方法。具体来说,我们的解决方案将 ADT 查询降到了一个更简单的逻辑理论,未解释函数 (UF),然后使用现有的解决方案对减少后的查询进行解决。我们证明了我们的方法的有效性和完整性,并证明其在现有的benchmark中以及一个新的、更加挑战性的benchmark集中的性能比例较好。
Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
results: 这篇论文使用了D-Wave AQC处理 sintetic和实际数据,并证明了这种方法可以更好地识别歧义的解和数据点。Abstract
Adiabatic quantum computing (AQC) is a promising quantum computing approach for discrete and often NP-hard optimization problems. Current AQCs allow to implement problems of research interest, which has sparked the development of quantum representations for many machine learning and computer vision tasks. Despite requiring multiple measurements from the noisy AQC, current approaches only utilize the best measurement, discarding information contained in the remaining ones. In this work, we explore the potential of using this information for probabilistic balanced k-means clustering. Instead of discarding non-optimal solutions, we propose to use them to compute calibrated posterior probabilities with little additional compute cost. This allows us to identify ambiguous solutions and data points, which we demonstrate on a D-Wave AQC on synthetic and real data.
摘要
Fairer and More Accurate Tabular Models Through NAS
results: 研究发现,尝试单独优化模型的准确率可能会导致公正性问题,而同时优化模型的准确率和公正性可以共同优化模型的性能。Abstract
Making models algorithmically fairer in tabular data has been long studied, with techniques typically oriented towards fixes which usually take a neural model with an undesirable outcome and make changes to how the data are ingested, what the model weights are, or how outputs are processed. We employ an emergent and different strategy where we consider updating the model's architecture and training hyperparameters to find an entirely new model with better outcomes from the beginning of the debiasing procedure. In this work, we propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data. We conduct extensive exploration of architectural and hyperparameter spaces (MLP, ResNet, and FT-Transformer) across diverse datasets, demonstrating the dependence of accuracy and fairness metrics of model predictions on hyperparameter combinations. We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns. We propose a novel approach that jointly optimizes architectural and training hyperparameters in a multi-objective constraint of both accuracy and fairness. We produce architectures that consistently Pareto dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both, all of this while being Pareto-optimal over hyperparameters achieved through single-objective (accuracy) optimization runs. This research underscores the promise of automating fairness and accuracy optimization in deep learning models.
摘要
使深度学习模型更加公平在表格数据上进行研究已经很长时间了,通常采用的技术是对现有的神经网络模型进行修改,以改善数据入口方式、模型权重或输出处理方式。我们采用一种不同的策略,即对模型的建构和训练超参数进行更新,以找到一个从头开始的全新模型,以提高结果的公平性。在这项工作中,我们提出使用多目标神经网络搜索(NAS)和超参数优化(HPO)来优化模型的建构和超参数。我们在多个数据集上进行了广泛的建构和超参数空间的探索(包括MLP、ResNet和FT-Transformer),并证明了模型预测结果中的公平性和准确性指标之间的依赖关系。我们发现,通过solely使用NAS优化模型的准确性,通常无法自动解决公平性问题。我们提出了一种新的方法,即同时优化建构和超参数,以实现多目标约束中的准确性和公平性两个目标的 JOINT 优化。我们生成了一系列可以同时dominates state-of-the-art偏见缓解方法的建构,并且这些建构都是通过多目标约束来实现的。这些研究表明了自动化深度学习模型的公平性和准确性优化的推荐。
paper_authors: Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Erin Grant, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi, Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O’Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, Thomas L. Griffiths
for: The paper aims to improve communication between research communities studying representational alignment in cognitive science, neuroscience, and machine learning, by proposing a unifying framework that can serve as a common language for these fields.
methods: The paper surveys the literature from these fields and demonstrates how prior work fits into the proposed framework.
results: The paper identifies open problems in representational alignment where progress can benefit all three fields, and hopes to catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems.Here’s the same information in Simplified Chinese text:
results: 论文标识了表征对Alignment中的开放问题,希望通过跨领域合作,加速所有研究信息处理系统的进步。Abstract
Biological and artificial information processing systems form representations of the world that they can use to categorize, reason, plan, navigate, and make decisions. To what extent do the representations formed by these diverse systems agree? Can diverging representations still lead to the same behaviors? And how can systems modify their representations to better match those of another system? These questions pertaining to the study of \textbf{\emph{representational alignment} are at the heart of some of the most active research areas in contemporary cognitive science, neuroscience, and machine learning. Unfortunately, there is limited knowledge-transfer between research communities interested in representational alignment, and much of the progress in one field ends up being rediscovered independently in another, when greater cross-field communication would be advantageous. To improve communication between fields, we propose a unifying framework that can serve as a common language between researchers studying representational alignment. We survey the literature from the fields of cognitive science, neuroscience, and machine learning, and demonstrate how prior work fits into this framework. Finally, we lay out open problems in representational alignment where progress can benefit all three fields. We hope that our work can catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems. We note that this is a working paper and encourage readers to reach out with their suggestions for future revisions.
摘要
Translation notes:* 表征对齐 (representational alignment) is a term used to describe the study of how different information processing systems form representations of the world and how those representations can be aligned to improve communication and collaboration between systems.* 生物学的信息处理系统 (biological information processing systems) refers to the systems found in living organisms, such as the human brain, that process and analyze information from the environment.* 人工的信息处理系统 (artificial information processing systems) refers to the systems created by humans, such as computers and machine learning algorithms, that process and analyze information.* 形象 (representations) refers to the internal mental or computational models that systems use to represent the world and make decisions.* 类别 (categories) refers to the ways in which systems group and classify objects or concepts in the world.* 理解 (reasoning) refers to the processes by which systems draw conclusions or make decisions based on the information they have.* 规划 (planning) refers to the processes by which systems create and execute a plan to achieve a goal.* 导航 (navigation) refers to the processes by which systems move through the world and avoid obstacles.* 决策 (decision-making) refers to the processes by which systems choose between different options or courses of action.
A comprehensible analysis of the efficacy of Ensemble Models for Bug Prediction
results: 我们的实验结果表明,集成AI模型可以在预测Java类库中存在bug的可能性方面超过单个AI模型的结果。我们还提供了因素的分析,以便更好地理解 ensemble AI 模型的性能提升的原因。Abstract
The correctness of software systems is vital for their effective operation. It makes discovering and fixing software bugs an important development task. The increasing use of Artificial Intelligence (AI) techniques in Software Engineering led to the development of a number of techniques that can assist software developers in identifying potential bugs in code. In this paper, we present a comprehensible comparison and analysis of the efficacy of two AI-based approaches, namely single AI models and ensemble AI models, for predicting the probability of a Java class being buggy. We used two open-source Apache Commons Project's Java components for training and evaluating the models. Our experimental findings indicate that the ensemble of AI models can outperform the results of applying individual AI models. We also offer insight into the factors that contribute to the enhanced performance of the ensemble AI model. The presented results demonstrate the potential of using ensemble AI models to enhance bug prediction results, which could ultimately result in more reliable software systems.
摘要
软件系统的正确性是其效果运行的关键。找到和修复软件漏洞是软件开发中重要的任务。随着人工智能(AI)技术在软件工程中的广泛应用,出现了一些可以帮助软件开发人员找到代码中潜在的漏洞的技术。在这篇论文中,我们提供了可读性比较和分析,探讨使用单个AI模型和 ensemble AI模型来预测Java类的可能性。我们使用了两个开源Apache Commons Project的Java组件来训练和测试模型。我们的实验结果表明, ensemble AI模型可以在应用单个AI模型的情况下出perform better。我们还提供了影响ensemble AI模型的表现的因素。该结果表明,使用ensemble AI模型可以提高漏洞预测结果,从而导致更可靠的软件系统。
DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning
results: 论文通过使用 LLMs 生成和修改 ‘图文计划’(在一个 плаanner-auditor 反馈循环中),以及使用 DiagramGLIGEN 和文本标签渲染模块来生成图文,实现了更高的准确率和质量。Abstract
Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows, connection lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework that leverages the layout guidance capabilities of LLMs (e.g., GPT-4) to generate more accurate open-domain, open-platform diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts. In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show quantitatively and qualitatively that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis including open-domain diagram generation, vector graphic diagram generation in different platforms, human-in-the-loop diagram plan editing, and multimodal planner/auditor LLMs (e.g., GPT-4Vision). We hope our work can inspire further research on diagram generation via T2I models and LLMs.
摘要
TEXT-TO-IMAGE(T2I)生成技术在过去几年内有了很大的发展。然而,有很少的研究集中于使用 T2I 模型生成图文。图文是一种使用结构rich和空间复杂的视觉表示方式,用于展示信息(例如,密集的对象、文本标签、指向箭头、连接线等)。现有的 T2I 模型通常在图文生成中存在缺陷,因为它们缺乏细化的对象布局控制,特别是当多个对象密集连接并且有复杂的关系(如箭头/线)时。为解决这个漏洞,我们提出了 DiagrammerGPT,一种新的两stage T2I 生成框架。在第一stage中,我们使用 LLMs(例如 GPT-4)来生成和反复修改 '图文计划'(在计划-审查器反馈循环中),该计划描述了所有对象(包括物体和文本标签)、它们之间的关系(如箭头或线)以及它们的包围盒布局。在第二stage中,我们使用 DiagramGLIGEN 和文本标签渲染模块来生成图文,按照图文计划进行。为了评估 T2I 生成任务,我们提出了 AI2D-Caption,一个密集注释的图文数据集,建立在 AI2D 数据集之上。我们表明了量化和质量上,我们的 DiagrammerGPT 框架可以生成更加准确的图文,超过现有的 T2I 模型。此外,我们还提供了广泛的分析,包括开放平台图文生成、vector graphic diagram生成、人工循环图文计划编辑和多Modal LLMs(例如 GPT-4Vision)。我们希望我们的工作可以鼓励更多的研究人员通过 T2I 模型和 LLMs 来生成图文。
SHARCS: Efficient Transformers through Routing with Dynamic Width Sub-networks
results: SHARCS可以提高推理速度,并且可以保持精度水平,实际测试中SHARCS可以提高推理速度2倍,但是精度下降不значительAbstract
We introduce SHARCS for adaptive inference that takes into account the hardness of input samples. SHARCS can train a router on any transformer network, enabling the model to direct different samples to sub-networks with varying widths. Our experiments demonstrate that: (1) SHARCS outperforms or complements existing per-sample adaptive inference methods across various classification tasks in terms of accuracy vs. FLOPs; (2) SHARCS generalizes across different architectures and can be even applied to compressed and efficient transformer encoders to further improve their efficiency; (3) SHARCS can provide a 2 times inference speed up at an insignificant drop in accuracy.
摘要
我们介绍SHARCS,一种适应推理的方法,考虑到输入样本的困难程度。SHARCS可以在任何transformer网络上训练路由器,让模型将不同的样本分配到不同宽度的子网络上。我们的实验表明:(1)SHARCS与现有的每个样本适应推理方法相比,在不同的分类任务中获得更高的精度和FLOPs的调整;(2)SHARCS可以适用于不同的架构,并且可以进一步改善压缩和高效的transformerEncoder的效率;(3)SHARCS可以提供2倍的推理速度,而无需对精度造成显著的损失。
A Cautionary Tale: On the Role of Reference Data in Empirical Privacy Defenses
paper_authors: Caelin G. Kaplan, Chuan Xu, Othmane Marfoq, Giovanni Neglia, Anderson Santana de Oliveira
for: This paper focuses on developing effective privacy-preserving machine learning methods that can provide satisfactory levels of training data privacy without significantly compromising model utility.
methods: The proposed method is based on an empirical risk minimization approach with a constraint on the generalization error, which is evaluated as a weighted empirical risk minimization (WERM) over the training and reference datasets.
results: The proposed method outperforms existing state-of-the-art empirical privacy defenses using reference data for nearly all relative privacy levels of reference and training data, and demonstrates the importance of considering the triad of model utility, training data privacy, and reference data privacy when comparing privacy defenses.Abstract
Within the realm of privacy-preserving machine learning, empirical privacy defenses have been proposed as a solution to achieve satisfactory levels of training data privacy without a significant drop in model utility. Most existing defenses against membership inference attacks assume access to reference data, defined as an additional dataset coming from the same (or a similar) underlying distribution as training data. Despite the common use of reference data, previous works are notably reticent about defining and evaluating reference data privacy. As gains in model utility and/or training data privacy may come at the expense of reference data privacy, it is essential that all three aspects are duly considered. In this paper, we first examine the availability of reference data and its privacy treatment in previous works and demonstrate its necessity for fairly comparing defenses. Second, we propose a baseline defense that enables the utility-privacy tradeoff with respect to both training and reference data to be easily understood. Our method is formulated as an empirical risk minimization with a constraint on the generalization error, which, in practice, can be evaluated as a weighted empirical risk minimization (WERM) over the training and reference datasets. Although we conceived of WERM as a simple baseline, our experiments show that, surprisingly, it outperforms the most well-studied and current state-of-the-art empirical privacy defenses using reference data for nearly all relative privacy levels of reference and training data. Our investigation also reveals that these existing methods are unable to effectively trade off reference data privacy for model utility and/or training data privacy. Overall, our work highlights the need for a proper evaluation of the triad model utility / training data privacy / reference data privacy when comparing privacy defenses.
摘要
在隐私保护机器学习领域,验证性隐私防御被提出为实现训练数据隐私的解决方案,而不导致模型性能下降。大多数现有的防御机制假设有访问参考数据,定义为训练数据所处的同一个(或类似)分布下的另一个数据集。尽管参考数据广泛使用,但前一些作品却不够明确地定义和评估参考数据隐私。因为获得模型性能和/或训练数据隐私的增进可能会导致参考数据隐私的损害,因此必须同时考虑这三个方面。在这篇论文中,我们首先检查参考数据的可用性和隐私处理方法,并证明其必要性以便比较防御机制。其次,我们提出一种基准防御方法,允许模型性能和训练数据隐私之间的利用率评估,并且可以通过将总体化风险最小化问题转化为权重加总风险最小化问题(WERM)来实现。虽然我们视WERM为简单的基准方法,但我们的实验表明,它在大多数参考数据隐私水平下能够超越目前最具有研究价值和状态艺术的Empirical Privacy防御方法。我们的调查也表明,这些现有方法无法有效地考虑参考数据隐私和训练数据隐私之间的贝叶率。总的来说,我们的工作强调了评估模型性能、训练数据隐私和参考数据隐私的三元模型在比较防御机制时的重要性。
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification
results: 经验表明,提出的方法可以带来remarkable的性能提升,最好的结果在CN-Celeb评测集上实现了14.6%的相对性能下降。Abstract
Data augmentation is vital to the generalization ability and robustness of deep neural networks (DNNs) models. Existing augmentation methods for speaker verification manipulate the raw signal, which are time-consuming and the augmented samples lack diversity. In this paper, we present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification, which can generate diversified training samples in speaker embedding space with negligible extra computing cost. Firstly, we augment training samples by perturbing speaker embeddings along semantic directions, which are obtained from speaker-wise covariance matrices. Secondly, accurate covariance matrices are estimated from robust speaker embeddings during training, so we introduce difficultyaware additive margin softmax (DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the number of augmented samples goes to infinity and derive a closed-form upper bound of the expected loss with DASA, which achieves compatibility and efficiency. Extensive experiments demonstrate the proposed approach can achieve a remarkable performance improvement. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
摘要
<> traducedata augmentation是深度神经网络(DNN)模型的重要组成部分,它们可以提高模型的通用能力和鲁棒性。现有的增强方法 дляspeaker verification通常是对原始信号进行 manipulate,这些方法需要较多的计算时间,并且增强的样本缺乏多样性。在这篇论文中,我们提出了一种新的困难相关Semantic Augmentation(DASA)方法,可以在speaker embedding空间中生成多样化的训练样本,而且计算成本几乎为零。首先,我们在训练样本中进行增强,通过在语意方向上偏移speaker embeddings来生成多样化的训练样本。其次,我们在训练过程中对robust speaker embeddings进行估计,以获得高精度的covariance矩阵。最后,我们引入difficulty-aware additive margin softmax(DAAM-Softmax)来获取最佳的speaker embeddings。最后,我们假设增强的样本数量为无穷大,并 deriveclosed-form upper bound of the expected loss with DASA,这个目标函数可以实现compatibility和效率。我们的实验表明,提出的方法可以获得显著的性能改进。最好的结果实现了CN-Celeb评估集上的14.6%相对减少EER指标。Note: Some words and phrases in the text have been modified to better fit the Simplified Chinese language, but the overall meaning and content of the text remain the same.
paper_authors: Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman
for: 提高基本模型的性能 для质量任务
methods: 结合人工反馈来推导多样性度量
results: 比现有多样性算法提高自动多样性发现能力,并与人工定义多样性度量匹配搜索能力Abstract
Reinforcement learning from human feedback (RLHF) has exhibited the potential to enhance the performance of foundation models for qualitative tasks. Despite its promise, its efficacy is often restricted when conceptualized merely as a mechanism to maximize learned reward models of averaged human preferences, especially in areas such as image generation which demand diverse model responses. Meanwhile, quality diversity (QD) algorithms, dedicated to seeking diverse, high-quality solutions, are often constrained by the dependency on manually defined diversity metrics. Interestingly, such limitations of RLHF and QD can be overcome by blending insights from both. This paper introduces Quality Diversity through Human Feedback (QDHF), which employs human feedback for inferring diversity metrics, expanding the applicability of QD algorithms. Empirical results reveal that QDHF outperforms existing QD methods regarding automatic diversity discovery, and matches the search capabilities of QD with human-constructed metrics. Notably, when deployed for a latent space illumination task, QDHF markedly enhances the diversity of images generated by a Diffusion model. The study concludes with an in-depth analysis of QDHF's sample efficiency and the quality of its derived diversity metrics, emphasizing its promise for enhancing exploration and diversity in optimization for complex, open-ended tasks.
摘要
<>传递人类反馈学习(RLHF)已经展示了改进基础模型的表现能力。尽管它的潜力很大,但它的效果往往受限于仅视为提高学习得到的奖励模型的均值人类喜好,特别是在图像生成等需要多样化模型响应的领域。同时,质量多样性(QD)算法,专门寻找多样、高质量的解决方案,经常受到手动定义多样性度量的限制。 Curiously, RLHF和QD的局限性可以通过两者的洞察得到解决。这篇论文介绍了基于人类反馈的质量多样性(QDHF),它利用人类反馈来推断多样性度量,扩展了QD算法的适用范围。实验结果表明,QDHF在自动多样性发现方面表现出色,与人工定义多样性度量相当。尤其是在使用了一种扩散模型进行隐藏空间照明任务时,QDHF明显提高了生成的图像的多样性。研究结束于对QDHF的样本效率和获得的多样性度量的深入分析,强调它在复杂、开端任务中的探索和多样性提高的抢夺。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling
paper_authors: Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, Yi Liang, Bo Pang, Michael Bendersky, Radu Soricut
results: 这篇论文的结果显示,AdaLink在文本仅和多媒体任务上均实现了与SoTA intrusive PEFT(LoRA)和FT的竞争性表现。此外,这篇论文还进行了对不同训练 режи和执行环境的测试,以确保AdaLink在实际应用中具有可靠性和稳定性。Abstract
Large language models (LLMs) and vision language models (VLMs) demonstrate excellent performance on a wide range of tasks by scaling up parameter counts from O(10^9) to O(10^{12}) levels and further beyond. These large scales make it impossible to adapt and deploy fully specialized models given a task of interest. Parameter-efficient fine-tuning (PEFT) emerges as a promising direction to tackle the adaptation and serving challenges for such large models. We categorize PEFT techniques into two types: intrusive and non-intrusive. Intrusive PEFT techniques directly change a model's internal architecture. Though more flexible, they introduce significant complexities for training and serving. Non-intrusive PEFT techniques leave the internal architecture unchanged and only adapt model-external parameters, such as embeddings for input. In this work, we describe AdaLink as a non-intrusive PEFT technique that achieves competitive performance compared to SoTA intrusive PEFT (LoRA) and full model fine-tuning (FT) on various tasks. We evaluate using both text-only and multimodal tasks, with experiments that account for both parameter-count scaling and training regime (with and without instruction tuning).
摘要
Position Interpolation Improves ALiBi Extrapolation
results: 位 interpolator显著提高了预训练模型在语言模型和摘要回传任务中的推断能力。Abstract
Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
摘要
线性位置 interpolate 帮助预训练模型使用旋转位置嵌入 (RoPE) 来推断更长的序列长度。我们提议使用线性位置 interpolate 来扩展使用 Attention with Linear Biases (ALiBi) 模型的推断范围。我们发现位置 interpolate 对于上游语言模型和下游摘要和检索任务的推断能力有显著改善。
Unveiling the Siren’s Song: Towards Reliable Fact-Conflicting Hallucination Detection
for: The paper is written for evaluating the factuality of text generated by large language models (LLMs) and developing a benchmark for detecting fact-conflicting hallucinations in these models.
methods: The paper introduces a new benchmark called FactCHD, which assimilates a large-scale dataset of factuality patterns and incorporates fact-based chains of evidence to facilitate comprehensive factual reasoning. The authors also present a new method called TRUTH-TRIANGULATOR, which synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2 to yield more credible detection.
results: The paper demonstrates the effectiveness of the FactCHD benchmark and shows that current methods fall short of faithfully detecting factual errors. The authors also present results from using TRUTH-TRIANGULATOR, which shows improved detection performance compared to existing methods.Abstract
Large Language Models (LLMs), such as ChatGPT/GPT-4, have garnered widespread attention owing to their myriad of practical applications, yet their adoption has been constrained by issues of fact-conflicting hallucinations across web platforms. The assessment of factuality in text, produced by LLMs, remains inadequately explored, extending not only to the judgment of vanilla facts but also encompassing the evaluation of factual errors emerging in complex inferential tasks like multi-hop, and etc. In response, we introduce FactCHD, a fact-conflicting hallucination detection benchmark meticulously designed for LLMs. Functioning as a pivotal tool in evaluating factuality within "Query-Respons" contexts, our benchmark assimilates a large-scale dataset, encapsulating a broad spectrum of factuality patterns, such as vanilla, multi-hops, comparison, and set-operation patterns. A distinctive feature of our benchmark is its incorporation of fact-based chains of evidence, thereby facilitating comprehensive and conducive factual reasoning throughout the assessment process. We evaluate multiple LLMs, demonstrating the effectiveness of the benchmark and current methods fall short of faithfully detecting factual errors. Furthermore, we present TRUTH-TRIANGULATOR that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. The benchmark dataset and source code will be made available in https://github.com/zjunlp/FactCHD.
摘要
大型语言模型(LLMs),如ChatGPT/GPT-4,在实际应用方面引起了广泛关注,但其普及受到了网络平台上的事实冲突报告的限制。评估 LLMS 生成的文本中的事实真实性仍然不充分探讨,包括单纯的事实以及在复杂的推理任务中出现的事实错误。为此,我们提出了 FactCHD,一个特别设计 для LLMS 的事实冲突报告 benchmark。作为评估“查询-回答”上的事实真实性的重要工具,我们的 benchmark 集成了一个大规模的数据集,包括多种事实真实性模式,如简单、多步、比较和集成模式。我们的 benchmark 的一个特点是通过 incorporating fact-based chains of evidence,以便在评估过程中进行全面和有利的事实理解。我们测试了多个 LLMS,并证明了我们的 benchmark 和现有方法无法准确检测事实错误。此外,我们还提出了 TRUTH-TRIANGULATOR,一种基于 tool-enhanced ChatGPT 和 LoRA-tuning 的 Llama2 的方法,以便通过合并预测结果和证据来提供更可靠的检测。我们的 benchmark 数据集和源代码将在 GitHub 上发布。
DHOT-GM: Robust Graph Matching Using A Differentiable Hierarchical Optimal Transport Framework
results: 我们通过对多个图像匹配任务进行实验,发现我们的方法比前 существу的方法更高效和更稳定。在匹配过程中,我们可以通过调整可导的优化交通距离来控制匹配的精度和稳定性。Abstract
Graph matching is one of the most significant graph analytic tasks in practice, which aims to find the node correspondence across different graphs. Most existing approaches rely on adjacency matrices or node embeddings when matching graphs, whose performances are often sub-optimal because of not fully leveraging the multi-modal information hidden in graphs, such as node attributes, subgraph structures, etc. In this study, we propose a novel and effective graph matching method based on a differentiable hierarchical optimal transport (HOT) framework, called DHOT-GM. Essentially, our method represents each graph as a set of relational matrices corresponding to the information of different modalities. Given two graphs, we enumerate all relational matrix pairs and obtain their matching results, and accordingly, infer the node correspondence by the weighted averaging of the matching results. This method can be implemented as computing the HOT distance between the two graphs -- each matching result is an optimal transport plan associated with the Gromov-Wasserstein (GW) distance between two relational matrices, and the weights of all matching results are the elements of an upper-level optimal transport plan defined on the matrix sets. We propose a bi-level optimization algorithm to compute the HOT distance in a differentiable way, making the significance of the relational matrices adjustable. Experiments on various graph matching tasks demonstrate the superiority and robustness of our method compared to state-of-the-art approaches.
摘要
GRAPH MATCHING 是一个非常重要的图分析任务,旨在找到不同图中节点的对应关系。现有的大多数方法都基于图邻接矩阵或节点嵌入, whose performances are often sub-optimal because of not fully leveraging the multi-modal information hidden in graphs, such as node attributes, subgraph structures, etc. 在这种研究中,我们提出了一种新的和有效的图 matching方法,基于可微的层次优先 transport(HOT)框架,称为DHOT-GM。本方法将每个图表示为不同特征Modalities的信息的集合。给定两个图,我们会枚举所有的关系矩阵对,并计算它们的匹配结果,然后根据匹配结果的权重,进行节点对应。这种方法可以视为计算HOT距离 между两个图,每个匹配结果是一个优先transport plan相关的Gromov-Wasserstein(GW)距离 между两个关系矩阵,并且权重的所有匹配结果的元素是一个上级优先transport plan定义在矩阵集上。我们提出了一种二级优化算法来计算HOT距离,使得关系矩阵的重要性可调。实验结果表明,我们的方法比现有的方法更高效和Robust。
Black-Box Training Data Identification in GANs via Detector Networks
results: 我们在多种图像和表格数据集上,以及不同的攻击和 GAN 架构上,发现了非常有趣的隐私攻击。然而,与其他生成和分类模型相比,GAN 的攻击成功率仍然相对较低。这留下了一个有趣的问题:是 GAN 更加隐私,或者需要更强的攻击?Abstract
Since their inception Generative Adversarial Networks (GANs) have been popular generative models across images, audio, video, and tabular data. In this paper we study whether given access to a trained GAN, as well as fresh samples from the underlying distribution, if it is possible for an attacker to efficiently identify if a given point is a member of the GAN's training data. This is of interest for both reasons related to copyright, where a user may want to determine if their copyrighted data has been used to train a GAN, and in the study of data privacy, where the ability to detect training set membership is known as a membership inference attack. Unlike the majority of prior work this paper investigates the privacy implications of using GANs in black-box settings, where the attack only has access to samples from the generator, rather than access to the discriminator as well. We introduce a suite of membership inference attacks against GANs in the black-box setting and evaluate our attacks on image GANs trained on the CIFAR10 dataset and tabular GANs trained on genomic data. Our most successful attack, called The Detector, involve training a second network to score samples based on their likelihood of being generated by the GAN, as opposed to a fresh sample from the distribution. We prove under a simple model of the generator that the detector is an approximately optimal membership inference attack. Across a wide range of tabular and image datasets, attacks, and GAN architectures, we find that adversaries can orchestrate non-trivial privacy attacks when provided with access to samples from the generator. At the same time, the attack success achievable against GANs still appears to be lower compared to other generative and discriminative models; this leaves the intriguing open question of whether GANs are in fact more private, or if it is a matter of developing stronger attacks.
摘要
自它们的出现以来,生成对抗网络(GANs)已成为图像、音频、视频和表格数据上广泛使用的生成模型。在这篇论文中,我们研究了给定一个已经训练过GAN的攻击者,以及新的样本从下面分布中获得的情况下,是否可以高效地判断一个点是否属于GAN的训练数据。这对于版权和数据隐私具有重要的意义,因为用户可能想要确定他们的版权数据是否被用来训练GAN,而且在数据隐私方面,能够检测训练集成员是一种称为会员推理攻击的能力。与大多数前期工作不同,本文 investigate GANs在黑盒设置下的隐私问题,攻击者只有Generator的样本而不具有Discriminator的访问权。我们介绍了一组黑盒成员推理攻击,并对图像GAN在CIFAR10数据集和表格GAN在生物数据集进行了评估。我们最成功的攻击方法叫做检测器,它通过训练一个第二个网络来评估样本是否由GAN生成,而不是一个新的样本从分布中。我们证明在一个简单的生成器模型下,检测器是一种相对优化的会员推理攻击。在各种图像和表格数据集、攻击和GAN架构下,我们发现攻击者可以通过Generator的样本进行非常复杂的隐私攻击。尽管GANs在隐私方面的攻击仍然比其他生成和判断模型低,但这仍然留下了一个惊喜的问题:GANs是否更安全,或者是需要更强的攻击。
Machine Learning-based Nutrient Application’s Timeline Recommendation for Smart Agriculture: A Large-Scale Data Mining Approach
results: 研究发现,基于天气和土壤特点进行调整的肥料应用量可以提高作物产量,同时降低肥料投用量。该方法也被证明可靠和可扩展。Abstract
This study addresses the vital role of data analytics in monitoring fertiliser applications in crop cultivation. Inaccurate fertiliser application decisions can lead to costly consequences, hinder food production, and cause environmental harm. We propose a solution to predict nutrient application by determining required fertiliser quantities for an entire season. The proposed solution recommends adjusting fertiliser amounts based on weather conditions and soil characteristics to promote cost-effective and environmentally friendly agriculture. The collected dataset is high-dimensional and heterogeneous. Our research examines large-scale heterogeneous datasets in the context of the decision-making process, encompassing data collection and analysis. We also study the impact of fertiliser applications combined with weather data on crop yield, using the winter wheat crop as a case study. By understanding local contextual and geographic factors, we aspire to stabilise or even reduce the demand for agricultural nutrients while enhancing crop development. The proposed approach is proven to be efficient and scalable, as it is validated using a real-world and large dataset.
摘要
The dataset used in this study is high-dimensional and heterogeneous, and we examine the impact of fertilizer applications combined with weather data on crop yield using the winter wheat crop as a case study. By understanding local contextual and geographic factors, we aim to stabilize or even reduce the demand for agricultural nutrients while enhancing crop development.Our proposed approach is efficient and scalable, as it is validated using a real-world and large dataset. This study demonstrates the potential of data analytics in optimizing fertilizer applications and promoting sustainable agriculture practices.
Is Channel Independent strategy optimal for Time Series Forecasting?
results: 这篇论文的实验结果显示,CSC策略可以提高CI策略的性能,同时减少参数的数量,例如在电力集成数据集上减少了10倍以上。CR方法也可以与基准模型竞争。此外,论文还讨论了是否使用历史时间序列中的同一个通道的历史值来预测未来值。Abstract
There has been an emergence of various models for long-term time series forecasting. Recent studies have demonstrated that a single linear layer, using Channel Dependent (CD) or Channel Independent (CI) modeling, can even outperform a large number of sophisticated models. However, current research primarily considers CD and CI as two complementary yet mutually exclusive approaches, unable to harness these two extremes simultaneously. And it is also a challenging issue that both CD and CI are static strategies that cannot be determined to be optimal for a specific dataset without extensive experiments. In this paper, we reconsider whether the current CI strategy is the best solution for time series forecasting. First, we propose a simple yet effective strategy called CSC, which stands for $\mathbf{C}$hannel $\mathbf{S}$elf-$\mathbf{C}$lustering strategy, for linear models. Our Channel Self-Clustering (CSC) enhances CI strategy's performance improvements while reducing parameter size, for exmpale by over 10 times on electricity dataset, and significantly cutting training time. Second, we further propose Channel Rearrangement (CR), a method for deep models inspired by the self-clustering. CR attains competitive performance against baselines. Finally, we also discuss whether it is best to forecast the future values using the historical values of the same channel as inputs. We hope our findings and methods could inspire new solutions beyond CD/CI.
摘要
有些新的模型已经出现了用于长期时间序预测。最近的研究表明,一个单一的线性层,使用通道依赖(CD)或通道独立(CI)的方法,可以超越许多复杂的模型。然而,当前的研究主要考虑CD和CI为两种 complementary yet mutually exclusive的方法,无法同时挖掘这两种极点。此外,CD和CI都是静态策略,无法确定是特定数据集的优化方法。在这篇论文中,我们重新考虑现在CI策略是时间序预测的最佳解决方案。首先,我们提出了一种简单 yet effective的策略,称为$\mathbf{C}$hannel $\mathbf{S}$elf-$\mathbf{C}$lustering(CSC)策略,用于线性模型。我们的通道自我凝聚(CSC)策略可以提高CI策略的性能改进,同时减少参数大小,例如在电力数据集上减少了10倍以上。其次,我们还提出了一种受到自我凝聚启发的深度模型策略,称为通道重新排序(CR)策略。CR策略可以与基elines相比。最后,我们还讨论了是否应该使用历史值的同一个通道作为预测未来值的输入。我们希望我们的发现和方法可以激发新的解决方案 beyond CD/CI。
A General Theoretical Paradigm to Understand Learning from Human Preferences
paper_authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
for: 本文旨在理解现代学习从人类偏好中学习(RLHF)的实际算法。
methods: 本文使用直接偏好优化(DPO)方法,并进行了深入的理论分析。
results: 研究发现,使用新的通用目标函数$\Psi$PO可以减少RLHF和DPO中的两个重要假设,并且可以提供更多的性能保证。此外,在一些示例中,使用 $\Psi$PO 可以实现更高的效率和更好的表现。Abstract
The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
摘要
<>将文本翻译为简化字的中文。<>现有的学习从人类偏好(RLHF)的广泛部署都 rely 于两个重要的近似:第一个假设每个对比的偏好可以被替换为点 wise 奖励。第二个假设一个基于这些点 wise 奖励的奖励模型可以从收集的数据中泛化到非收集数据。 reciently,Direct Preference Optimization(DPO)被提出,它不需要奖励模型的训练阶段,直接从收集数据中学习策略。然而,这个方法仍然很重视第一个近似。 在这篇论文中,我们尝试了更深入的理论理解这些实际算法。我们 derive 了一个新的通用目标函数called $\Psi$PO,它表示在对比上学习人类偏好的情况下,不需要两个近似。这个新的通用目标函数允许我们对 RLHF 和 DPO(作为 $\Psi$PO 的特殊情况)进行深入的分析,并识别它们的潜在弱点。然后,我们考虑了 $\Psi$PO 中 $\Psi$ 设置为标识函数的特殊情况,可以 derive 一个高效的优化过程,证明性能保证和在一些示例中证明其超越 DPO 的实际性。
SegmATRon: Embodied Adaptive Semantic Segmentation for Indoor Environment
results: 研究表明,通过使用代理人的行动在室内环境中获取更多图像,可以提高 semantic segmentation 的质量。Here’s the breakdown of each point in English:
for: The paper proposes an adaptive transformer model for embodied image semantic segmentation.
methods: The paper uses a hybrid multicomponent loss function to adapt the model weights during inference on multiple images.
results: The study shows that obtaining additional images using the agent’s actions in an indoor environment can improve the quality of semantic segmentation.Abstract
This paper presents an adaptive transformer model named SegmATRon for embodied image semantic segmentation. Its distinctive feature is the adaptation of model weights during inference on several images using a hybrid multicomponent loss function. We studied this model on datasets collected in the photorealistic Habitat and the synthetic AI2-THOR Simulators. We showed that obtaining additional images using the agent's actions in an indoor environment can improve the quality of semantic segmentation. The code of the proposed approach and datasets are publicly available at https://github.com/wingrune/SegmATRon.
摘要
这篇论文提出了一种适应器模型,名为SegmATRon,用于具体图像 semantic segmentation。它的特点是在推理过程中对模型参数进行适应,使用混合多组件损失函数。我们在 Habitat 和 AI2-THOR sintética simulators 上进行了研究,并证明了通过使用机器人的行为在室内环境中获取更多图像可以提高 semantic segmentation 的质量。代码和数据集可以在 https://github.com/wingrune/SegmATRon 上公开获取。
Multi-view Contrastive Learning for Entity Typing over Knowledge Graphs
results: compared to the state-of-the-art, MCLET 方法在实验中显示出了强大的表现。Abstract
Knowledge graph entity typing (KGET) aims at inferring plausible types of entities in knowledge graphs. Existing approaches to KGET focus on how to better encode the knowledge provided by the neighbors and types of an entity into its representation. However, they ignore the semantic knowledge provided by the way in which types can be clustered together. In this paper, we propose a novel method called Multi-view Contrastive Learning for knowledge graph Entity Typing (MCLET), which effectively encodes the coarse-grained knowledge provided by clusters into entity and type embeddings. MCLET is composed of three modules: i) Multi-view Generation and Encoder module, which encodes structured information from entity-type, entity-cluster and cluster-type views; ii) Cross-view Contrastive Learning module, which encourages different views to collaboratively improve view-specific representations of entities and types; iii) Entity Typing Prediction module, which integrates multi-head attention and a Mixture-of-Experts strategy to infer missing entity types. Extensive experiments show the strong performance of MCLET compared to the state-of-the-art
摘要
知识图Entity类型推断(KGET)目标在于推断知识图中实体的可能性类型。现有的KGET方法主要关注如何更好地编码实体周围的知识和类型到其表示中。然而,它们忽略了实体和类型之间的 semantics 知识,即类型之间的聚合知识。本文提出了一种新的方法called Multi-view Contrastive Learning for knowledge graph Entity Typing(MCLET),它可以有效地编码实体和类型的聚合知识到实体和类型表示中。MCLET包括以下三个模块:1. 多视图生成和编码模块(Multi-view Generation and Encoder module):编码实体-类型、实体-团队和团队-类型的结构信息。2. 交叉视图对比学习模块(Cross-view Contrastive Learning module):鼓励不同视图之间的信息进行协同改进视图特定的实体和类型表示。3. 实体类型预测模块(Entity Typing Prediction module):通过多头注意力和 Mixture-of-Experts 策略来预测缺失的实体类型。广泛的实验表明MCLET在比较顶尖方法的情况下显示出了强大的表现。
paper_authors: Abhishek Vivekanandan, Ahmed Abouelazm, Philip Schörner, J. Marius Zöllner
for: 预测交通actor的运动轨迹,以实现自动驾驶车辆大规模部署。
methods: 引入非 Parametric 剪除层和注意力层,以整合定义的知识优化。
results: 实现了遵循物理法律和驾驶环境几何学的轨迹预测,提供了安全可靠的运动预测结果,是实现自动驾驶车辆安全有效的关键。Abstract
Accurately forecasting the motion of traffic actors is crucial for the deployment of autonomous vehicles at a large scale. Current trajectory forecasting approaches primarily concentrate on optimizing a loss function with a specific metric, which can result in predictions that do not adhere to physical laws or violate external constraints. Our objective is to incorporate explicit knowledge priors that allow a network to forecast future trajectories in compliance with both the kinematic constraints of a vehicle and the geometry of the driving environment. To achieve this, we introduce a non-parametric pruning layer and attention layers to integrate the defined knowledge priors. Our proposed method is designed to ensure reachability guarantees for traffic actors in both complex and dynamic situations. By conditioning the network to follow physical laws, we can obtain accurate and safe predictions, essential for maintaining autonomous vehicles' safety and efficiency in real-world settings.In summary, this paper presents concepts that prevent off-road predictions for safe and reliable motion forecasting by incorporating knowledge priors into the training process.
摘要
<>对于大规模自动驾驶 vehicles 的部署, точно预测交通actor 的运动是非常重要的。当前的轨迹预测方法主要集中在优化特定的损失函数中,可能会导致预测不符合物理法律或违反外部约束。我们的目标是将显式知识假设纳入网络中,以预测未来的轨迹,并且遵循车辆的动力学约束和驾驶环境的几何结构。为 achieve 这一目标,我们引入非参数化剪除层和注意层,以整合定义的知识假设。我们的提出方法旨在保证交通actor 的可达性,并在复杂和动态情况下提供可靠的预测。通过使网络遵循物理法律,我们可以获得高度准确和安全的预测结果,这是保持自动驾驶 vehicle 的安全和效率在实际场景中的关键。总之,本文提出了避免脱离路径的安全和可靠轨迹预测方法,通过在训练过程中纳入知识假设。
Sociotechnical Safety Evaluation of Generative AI Systems
paper_authors: Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac
for: 评估生成AI系统的安全性
methods: 提出三层框架,包括能力评估、系统安全原则和人类互动的评估
results: 发现现有评估缺陷,并提出了解决方案,包括实践步骤和不同角色的责任Abstract
Generative AI systems produce a range of risks. To ensure the safety of generative AI systems, these risks must be evaluated. In this paper, we make two main contributions toward establishing such evaluations. First, we propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks. This framework encompasses capability evaluations, which are the main current approach to safety evaluation. It then reaches further by building on system safety principles, particularly the insight that context determines whether a given capability may cause harm. To account for relevant context, our framework adds human interaction and systemic impacts as additional layers of evaluation. Second, we survey the current state of safety evaluation of generative AI systems and create a repository of existing evaluations. Three salient evaluation gaps emerge from this analysis. We propose ways forward to closing these gaps, outlining practical steps as well as roles and responsibilities for different actors. Sociotechnical safety evaluation is a tractable approach to the robust and comprehensive safety evaluation of generative AI systems.
摘要
生成AI系统的应用涉及到一系列的风险。为确保生成AI系统的安全,这些风险必须进行评估。在这篇论文中,我们提出了两个主要贡献,以便建立这些评估。首先,我们提议一种三层结构的框架,它采用一种结构化的社会技术方法来评估这些风险。这个框架包括功能评估,这是当前主要的安全评估方法。然后,我们又基于系统安全原则,尤其是认为上下文决定了一个给定的功能是否会带来害。为了考虑相关的上下文,我们的框架添加了人机交互和系统影响作为其他两个层次评估。其次,我们对生成AI系统的安全评估状况进行了调查,并创建了一个库存的评估。三个突出的评估漏洞出现在这种分析中。我们提出了方法来填充这些漏洞,并详细介绍了不同角色的角色和责任。社会技术安全评估是一种可行的方法,以确保生成AI系统的安全评估是全面和可靠的。
InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation
results: InfoDiffusion 在生成质量和多样性方面表现出色,同时 sampling efficiency 也高于基eline模型。Abstract
Diffusion models have garnered considerable interest in the field of text generation. Several studies have explored text diffusion models with different structures and applied them to various tasks, including named entity recognition and summarization. However, there exists a notable disparity between the "easy-first" text generation process of current diffusion models and the "keyword-first" natural text generation process of humans, which has received limited attention. To bridge this gap, we propose InfoDiffusion, a non-autoregressive text diffusion model. Our approach introduces a "keyinfo-first" generation strategy and incorporates a noise schedule based on the amount of text information. In addition, InfoDiffusion combines self-conditioning with a newly proposed partially noising model structure. Experimental results show that InfoDiffusion outperforms the baseline model in terms of generation quality and diversity, as well as exhibiting higher sampling efficiency.
摘要
diffusion模型在文本生成领域已经引起了广泛的关注。多个研究已经探索了不同结构的 diffusion模型,并应用于名称识别和概要summarization等任务。然而,现有的"易先"文本生成过程和人类的"关键词-先"自然文本生成过程之间存在显著的差距,这一点很少得到了关注。为了bridging这个差距,我们提出了InfoDiffusion,一种非自适应文本diffusion模型。我们的方法采用了"关键信息-先"生成策略,并在文本信息的量 bases on a noise schedule。此外,InfoDiffusion还结合了自conditioning和一种新提出的部分噪音模型结构。实验结果表明,InfoDiffusion在生成质量和多样性方面都超过了基eline模型,同时也显示了更高的采样效率。
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
for: This paper aims to improve the stability and generalization of AI assistants based on language models (LLMs) by proposing a novel approach to Reinforcement Learning from Human Feedback (RLHF).
methods: The proposed approach uses a combination of data classification and adaptive exploration to learn a consistent policy across various domains. It deliberately maximizes performance variance and allocates more learning capacity to challenging data.
results: The experimental results show that the proposed approach significantly enhances training stability and model generalization, outperforming traditional RL methods that exploit shortcuts and overlook challenging samples.Abstract
The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
摘要
成功的语言模型基于AI助手(LLMs)取决于人类反馈学习(RLHF),它使得生成响应更加与人类偏好相align。作为普遍的AI助手,人们对它们的表现具有增加的期望,但previous work表明,使用学习策略(RL)时,经常利用短cut掉到达高奖励,而忽略了复杂的样本。这种围绕快速奖励的专注会使训练不稳定并导致模型无法在新、未看到的数据上Generalize。在这项工作中,我们提出了一种新的方法,可以通过RL在不同的数据组或领域中学习一个一致的策略。由于获得组注释的困难,我们的方法会自动将数据分类为不同的组,故意增加性能的变化。然后,我们会优化策略,以便在复杂的组中表现好。最后,我们通过已有的组来适应性地调整探索空间,将更多的学习资源分配给更加复杂的数据,避免模型在简单的数据上过度优化。实验结果表明,我们的方法可以显著提高训练稳定性和模型泛化。
A Multi-Scale Decomposition MLP-Mixer for Time Series Analysis
results: 这篇论文的实验结果显示,使用MSD-Mixer方法可以在不同的时间序列分析任务(长期和短期预测、填充、侦测异常和分类)中,与其他现有的任务特定和任务通用方法相比,实现了更好的性能。Abstract
Time series data, often characterized by unique composition and complex multi-scale temporal variations, requires special consideration of decomposition and multi-scale modeling in its analysis. Existing deep learning methods on this best fit to only univariate time series, and have not sufficiently accounted for sub-series level modeling and decomposition completeness. To address this, we propose MSD-Mixer, a Multi-Scale Decomposition MLP-Mixer which learns to explicitly decompose the input time series into different components, and represents the components in different layers. To handle multi-scale temporal patterns and inter-channel dependencies, we propose a novel temporal patching approach to model the time series as multi-scale sub-series, i.e., patches, and employ MLPs to mix intra- and inter-patch variations and channel-wise correlations. In addition, we propose a loss function to constrain both the magnitude and autocorrelation of the decomposition residual for decomposition completeness. Through extensive experiments on various real-world datasets for five common time series analysis tasks (long- and short-term forecasting, imputation, anomaly detection, and classification), we demonstrate that MSD-Mixer consistently achieves significantly better performance in comparison with other state-of-the-art task-general and task-specific approaches.
摘要
时间序列数据,经常具有独特的组成和复杂的多尺度时间变化,需要特殊地对它进行分解和多尺度模型化的分析。现有的深度学习方法只适用于单变量时间序列,并未充分考虑分解完整性和子序列水平的模型化。为解决这个问题,我们提议了MSD-Mixer,一种多尺度分解MLP-Mixer,可以显式地将输入时间序列分解成不同的组成部分,并将这些部分在不同层中表示。同时,我们提出了一种新的时间补丁方法,可以模型时间序列为多尺度子序列,即补丁,并使用MLP来混合内部和外部补丁的变化和通道之间的相关性。此外,我们还提出了一种约束分解 оста异量和自相关的损失函数,以确保分解完整性。经过对各种实际世界数据集的多种时间序列分析任务(长期和短期预测、填充、异常检测和分类)的广泛实验,我们示示了MSD-Mixer在与其他当前领域的任务特定和任务通用方法相比,具有显著更好的表现。
Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition
paper_authors: Andrés Tello, Victoria Degeler, Alexander Lazovik
for: The paper aims to raise awareness about the issue of accuracy overestimation in Human Activity Recognition (HAR) studies due to biased data segmentation and evaluation methods.
methods: The paper uses sliding windows for data segmentation and standard random k-fold cross validation, which are common approaches in state-of-the-art HAR studies, but can lead to biased results.
results: The paper shows that these biased methods can produce lower accuracies than correct unbiased methods, and that the problem persists independently of the method or dataset used.Here are the three points in Simplified Chinese text:
results: 这篇论文显示,这些偏见方法可能会生成较低的准确度,而正确的不偏见方法更难在科学期刊上发表。Abstract
Today, there are standard and well established procedures within the Human Activity Recognition (HAR) pipeline. However, some of these conventional approaches lead to accuracy overestimation. In particular, sliding windows for data segmentation followed by standard random k-fold cross validation, produce biased results. An analysis of previous literature and present-day studies, surprisingly, shows that these are common approaches in state-of-the-art studies on HAR. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Otherwise, publications of biased results lead to papers that report lower accuracies, with correct unbiased methods, harder to publish. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
摘要
今天,人活动识别(HAR)管道中有标准化和确定的程序。然而,这些传统方法会导致准确性过估。具体来说,使用滑块窗口进行数据分割,然后使用标准随机k-叶值验证,会产生偏见结果。历史和当代研究的分析表明,这些是现代HAR研究中最常见的方法。重要的是,我们需要在科学社区中启示这个问题,以避免这种偏见的负面影响。否则,使用偏见方法的出版物将导致准确方法更难于发表。我们通过不同的数据集和不同的分类模型进行了多个实验,证明了这个问题的存在,并证明它独立于方法或数据集。
A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs
for: 本文旨在提出和评估大规模知识图(KG)中 semi-inductive link prediction(LP)模型的一个大规模 benchmark。
methods: 本文使用了 Wikidata5M 作为基础,并提供了三种 LP 任务:推导式(k-shot)、辅助式(transductive)和零批式(0-shot),每种任务都有不同的可用信息,包括 KG 结构、文本提及和实体的详细描述。
results: 据小规模实验结果表明, semi-inductive LP 性能与辅助式 LP 性能在长尾实体上存在差异,并且 semi-inductive LP 性能远远低于辅助式 LP 性能。 这个 benchmark 为未来在 semi-inductive LP 模型中集成文本和上下文信息进行进一步研究提供了一个测试床。Abstract
Semi-inductive link prediction (LP) in knowledge graphs (KG) is the task of predicting facts for new, previously unseen entities based on context information. Although new entities can be integrated by retraining the model from scratch in principle, such an approach is infeasible for large-scale KGs, where retraining is expensive and new entities may arise frequently. In this paper, we propose and describe a large-scale benchmark to evaluate semi-inductive LP models. The benchmark is based on and extends Wikidata5M: It provides transductive, k-shot, and 0-shot LP tasks, each varying the available information from (i) only KG structure, to (ii) including textual mentions, and (iii) detailed descriptions of the entities. We report on a small study of recent approaches and found that semi-inductive LP performance is far from transductive performance on long-tail entities throughout all experiments. The benchmark provides a test bed for further research into integrating context and textual information in semi-inductive LP models.
摘要
《知识 graphs(KG)中的半推导链预测(LP)任务是预测新、未看过的实体上的事实,基于上下文信息。虽然新实体可以通过重新训练模型来扩展,但这种方法在大规模KG中是不可行的,因为重新训练是昂贵的并且新实体可能会频繁出现。在这篇论文中,我们提出了一个大规模的LP模型评估标准 benchmark。该 benchmark 基于并扩展了 Wikidata5M:它提供了半推导、k-shot、0-shot LP任务,每个任务都不同的提供KG结构、文本提及和实体的详细描述。我们对一些最新的方法进行了一小项研究,发现在长尾实体上,半推导LP性能与推导LP性能在所有实验中都远远不同。该 benchmark 提供了一个研究 semi-inductive LP 模型integrating上下文和文本信息的测试床。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Analyze Mass Spectrometry data with Artificial Intelligence to assist the understanding of past habitability of Mars and provide insights for future missions
results: 研究表明EGA-MS和GC-MS数据可以用于描述外星物质的化学成分,并且提供了一种可靠的方法来分析这些数据。Abstract
This paper presents an application of artificial intelligence on mass spectrometry data for detecting habitability potential of ancient Mars. Although data was collected for planet Mars the same approach can be replicated for any terrestrial object of our solar system. Furthermore, proposed methodology can be adapted to any domain that uses mass spectrometry. This research is focused in data analysis of two mass spectrometry techniques, evolved gas analysis (EGA-MS) and gas chromatography (GC-MS), which are used to identify specific chemical compounds in geological material samples. The study demonstrates the applicability of EGA-MS and GC-MS data to extra-terrestrial material analysis. Most important features of proposed methodology includes square root transformation of mass spectrometry values, conversion of raw data to 2D sprectrograms and utilization of specific machine learning models and techniques to avoid overfitting on relative small datasets. Both EGA-MS and GC-MS datasets come from NASA and two machine learning competitions that the author participated and exploited. Complete running code for the GC-MS dataset/competition is available at GitHub.1 Raw training mass spectrometry data include [0, 1] labels of specific chemical compounds, selected to provide valuable insights and contribute to our understanding of the potential past habitability of Mars.
摘要
From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks
results: 本文通过对多种神经网络模型进行分析,发现了一些有用的概念解释方法,并且提出了一些可能的应用场景。这些结果可能为实现基于可解释概念的神经网络和符号学AI做出了重要贡献。Abstract
In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.
摘要
在这篇论文中,我们对现代神经网络中的概念解释方法进行了评论。神经网络中的概念可以作为自然的链接连接学习和理解:一旦已经确定了神经学习系统使用的概念,那么可以将这些概念与符号系统集成,以进行推理或使用符号系统来改进或增强神经学习系统。同时,可以从神经网络中提取知识,同时也可以将符号知识插入到神经网络架构中。由于将学习和理解集成是神经 симвоlic AI 的核心,这些评论所获得的启示可以作为实现神经 симвоlic AI 基于可解释的概念的重要一步。
AI Nushu: An Exploration of Language Emergence in Sisterhood -Through the Lens of Computational Linguistics
results: 该研究提供了一种搭建在人工智能技术和中国文化遗产之上的艺术解读,以及一种将女性视角与计算语言学融合的新的视角。Abstract
This paper presents "AI Nushu," an emerging language system inspired by Nushu (women's scripts), the unique language created and used exclusively by ancient Chinese women who were thought to be illiterate under a patriarchal society. In this interactive installation, two artificial intelligence (AI) agents are trained in the Chinese dictionary and the Nushu corpus. By continually observing their environment and communicating, these agents collaborate towards creating a standard writing system to encode Chinese. It offers an artistic interpretation of the creation of a non-western script from a computational linguistics perspective, integrating AI technology with Chinese cultural heritage and a feminist viewpoint.
摘要
Enhancing Genetic Improvement Mutations Using Large Language Models
results: 研究发现,使用LLM生成的编辑可以提高单元测试通过率达到75%,但找到最佳改进的patch通常是通过标准插入编辑。此外,LLM增强的GI可以找到许多改进patch,但是最佳改进patch是通过标准GI找到的。Abstract
Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.
摘要
大型语言模型(LLM)已成功应用于软件工程任务,包括程序修复。然而,它们在基于搜索的技术,如遗传改进(GI)中的应用仍然是未知之地。在这篇论文中,我们评估了使用 LLM 作为 GI 中的变异运算来改善搜索过程。我们扩展了 Gin Java GI 工具包,以调用 OpenAI 的 API 生成 JCodec 工具中的修改。我们随机采样了修改空间,使用 5 种不同的修改类型。我们发现,使用 LLM 生成的修改可以提高单元测试通过率达到 75%,而且发现 LLM 生成的修改通常比标准插入修改更加稳定。此外,我们发现使用 LLM 增强 GI 可以找到更好的改进补丁,但最佳改进补丁仍然由标准 GI 找到。
The Value-Sensitive Conversational Agent Co-Design Framework
results: 本研究提出了一个评估协议,以评估框架和设计工具组在设计工作室中的效果。Abstract
Conversational agents (CAs) are gaining traction in both industry and academia, especially with the advent of generative AI and large language models. As these agents are used more broadly by members of the general public and take on a number of critical use cases and social roles, it becomes important to consider the values embedded in these systems. This consideration includes answering questions such as 'whose values get embedded in these agents?' and 'how do those values manifest in the agents being designed?' Accordingly, the aim of this paper is to present the Value-Sensitive Conversational Agent (VSCA) Framework for enabling the collaborative design (co-design) of value-sensitive CAs with relevant stakeholders. Firstly, requirements for co-designing value-sensitive CAs which were identified in previous works are summarised here. Secondly, the practical framework is presented and discussed, including its operationalisation into a design toolkit. The framework facilitates the co-design of three artefacts that elicit stakeholder values and have a technical utility to CA teams to guide CA implementation, enabling the creation of value-embodied CA prototypes. Finally, an evaluation protocol for the framework is proposed where the effects of the framework and toolkit are explored in a design workshop setting to evaluate both the process followed and the outcomes produced.
摘要
对话代理(CA)在工业和学术界受到推广,特别是在生成AI和大型自然语言模型的出现后。这些代理在公众中更加广泛使用,扮演许多重要的使用案和社会角色,因此需要考虑这些系统中嵌入的价值。因此,本文的目的是提出价值敏感对话代理(VSCA)框架,帮助专业人员和重要参与者在一起设计价值敏感CA。首先,以前的研究中所识别出的实现值敏感CA的需求简述了一下。其次,实际的框架被提出来,并讨论了它的实现方式。这个框架包括三个展示实物,吸引参与者的价值,并对CA团队提供技术实用性,帮助创建具有价值的CA原型。最后,为这个框架和工具组提出评估协议,以评估这个框架和工具组在设计工作室中的影响,以及它们创造的结果。
Masked Pretraining for Multi-Agent Decision Making
results: 实验结果表明,使用MaskMA模型可以在11个训练地图上进行零情况下的赢利率达77.8%,并且在其他类型的下游任务中表现良好(如多策略协作和随机团队游戏)。Abstract
Building a single generalist agent with zero-shot capability has recently sparked significant advancements in decision-making. However, extending this capability to multi-agent scenarios presents challenges. Most current works struggle with zero-shot capabilities, due to two challenges particular to the multi-agent settings: a mismatch between centralized pretraining and decentralized execution, and varying agent numbers and action spaces, making it difficult to create generalizable representations across diverse downstream tasks. To overcome these challenges, we propose a \textbf{Mask}ed pretraining framework for \textbf{M}ulti-\textbf{a}gent decision making (MaskMA). This model, based on transformer architecture, employs a mask-based collaborative learning strategy suited for decentralized execution with partial observation. Moreover, MaskMA integrates a generalizable action representation by dividing the action space into actions toward self-information and actions related to other entities. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model pretrained on 11 training maps, can achieve an impressive 77.8% zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (\textit{e.g.,} varied policies collaboration and ad hoc team play).
摘要
Brain decoding: toward real-time reconstruction of visual perception
results: 1. MEG decoder 可以7倍提高图像检索率; 2. 脑响应图像具有高级视觉特征; 3. 图像检索和生成都表明MEG信号主要含有高级视觉特征, 而7T fMRI 则捕捉低级视觉特征。Abstract
In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5 Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that MEG signals primarily contain high-level visual features, whereas the same approach applied to 7T fMRI also recovers low-level features. Overall, these results provide an important step towards the decoding - in real time - of the visual processes continuously unfolding within the human brain.
摘要
在过去五年,基于生成和基础AI系统的使用已经大幅提高了脑动力的解码。视觉认知特别是可以通过功能磁共振成像(fMRI)进行高度准确的解码。然而,这种神经成像技术受到时间分辨率的限制(约为0.5Hz),因此在实时应用中受到极大的限制。我们提出了一种备选方案,基于磁共振成像(MEG),这种神经成像设备可以在高时间分辨率(约为5000Hz)下测量脑动力。为此,我们开发了一个基于MEG的解码模型,其包括三个模块:i)预训练的嵌入,ii)基于MEG的练习结构,iii)预训练的图像生成器。我们的结果如下:首先,我们的MEG解码器与 классические线性解码器相比,图像检索的性能提高了7倍。其次,对于图像的晚期响应,DINOv2,一种最新的基础图像模型,表现最佳。最后,图像检索和生成都表明MEG信号主要含有高级视觉特征,而使用7T fMRI也能够恢复低级特征。总之,这些结果为实时解码人类大脑中不断发展的视觉过程提供了重要的一步。
results: 论文显示了每个独立和不同的类别聚合函数都是 dictatorship,这与 Maniquet 和 Mongin (2016)的结果相同。此外,论文还提出了一种新的证明方法,可以涵盖两个类别的情况,除非对象的数量也是两个。最后,论文还列出了两个类别和两个对象的所有独立和一致的 классификация聚合函数。Abstract
A classification is a surjective mapping from a set of objects to a set of categories. A classification aggregation function aggregates every vector of classifications into a single one. We show that every citizen sovereign and independent classification aggregation function is essentially a dictatorship. This impossibility implies an earlier result of Maniquet and Mongin (2016), who show that every unanimous and independent classification aggregation function is a dictatorship. The relationship between the two impossibilities is reminiscent to the relationship between Wilson's and Arrow's impossibilities in preference aggregation. Moreover, while the Maniquet-Mongin impossibility rests on the existence of at least three categories, we propose an alternative proof technique that covers the case of two categories, except when the number of objects is also two. We also identify all independent and unanimous classification aggregation functions for the case of two categories and two objects.
摘要
一种分类是一个射函数,将一个集合对象映射到另一个集合类别。一个分类汇聚函数将每个vector分类汇聚成一个单一的汇聚结果。我们表明,每个公民独立和自主的分类汇聚函数都是一种独裁统治。这一不可能性等价于 Earlier Maniquet 和 Mongin(2016)的结果,他们表明,每个一致和独立的分类汇聚函数都是一种独裁统治。这两个不可能性之间的关系与 Wilson 和 Arrow 的不可能性在偏好汇聚中有相似之处。此外,我们提出了一种不同的证明技巧,覆盖了三个类别的情况,而不是两个类别和两个对象的情况。我们还确定了所有独立和一致的分类汇聚函数的情况,只有两个类别和两个对象的情况例外。
IntentDial: An Intent Graph based Multi-Turn Dialogue System with Reasoning Path Visualization
results: 该研究通过实验证明了该系统的可行性和效果,并且可以帮助提高对话系统的实际应用。Abstract
Intent detection and identification from multi-turn dialogue has become a widely explored technique in conversational agents, for example, voice assistants and intelligent customer services. The conventional approaches typically cast the intent mining process as a classification task. Although neural classifiers have proven adept at such classification tasks, the issue of neural network models often impedes their practical deployment in real-world settings. We present a novel graph-based multi-turn dialogue system called , which identifies a user's intent by identifying intent elements and a standard query from a dynamically constructed and extensible intent graph using reinforcement learning. In addition, we provide visualization components to monitor the immediate reasoning path for each turn of a dialogue, which greatly facilitates further improvement of the system.
摘要
<>转换给定文本到简化中文。<>对话机器人中的意图检测和识别已经广泛研究,例如语音助手和智能客服。传统方法通常将意图挖掘过程视为一个分类任务。虽然神经网络模型在这类分类任务中表现出色,但神经网络模型在实际应用中的实现往往受阻。我们提出了一种新的图表基多轮对话系统,可以通过动态构建和扩展意图图来识别用户的意图,并使用回归学习来确定意图元素和标准查询。此外,我们还提供了可视化组件,可以帮助监测对话中每个转折的直接逻辑路径,这对系统进一步改进很有帮助。
methods: 使用 tradicional MLPs 和可微分决策树,在 sintetic data 和实际金融市场数据上预测Fixed-term returns,并通过遍历多个模型来减少数据噪音
results: our approach 可以获得更高的总收益,同时降低风险水平,并且提出了一个新的实用指标来衡量每笔交易的平均收益和风险衡量指标。Abstract
Price movements in financial markets are well known to be very noisy. As a result, even if there are, on occasion, exploitable patterns that could be picked up by machine-learning algorithms, these are obscured by feature and label noise rendering the predictions less useful, and risky in practice. Traditional rule-learning techniques developed for noisy data, such as CN2, would seek only high precision rules and refrain from making predictions where their antecedents did not apply. We apply a similar approach, where a model abstains from making a prediction on data points that it is uncertain on. During training, a cascade of such models are learned in sequence, similar to rule lists, with each model being trained only on data on which the previous model(s) were uncertain. Similar pruning of data takes place at test-time, with (higher accuracy) predictions being made albeit only on a fraction (support) of test-time data. In a financial prediction setting, such an approach allows decisions to be taken only when the ensemble model is confident, thereby reducing risk. We present results using traditional MLPs as well as differentiable decision trees, on synthetic data as well as real financial market data, to predict fixed-term returns using commonly used features. We submit that our approach is likely to result in better overall returns at a lower level of risk. In this context we introduce an utility metric to measure the average gain per trade, as well as the return adjusted for downside risk, both of which are improved significantly by our approach.
摘要
金融市场的价格变化非常具有噪音特性,因此,即使有时存在可以被机器学习算法捕捉的可见模式,这些模式受到特征和标签噪音的干扰,导致预测的准确性受到限制,实际应用中风险较高。传统的规则学习技术,如CN2,会寻找高精度规则,并在其前提不适用时停止预测。我们采用类似的方法,其中一个模型在训练过程中会决定不预测数据点,当前模型不确定时。在测试时,数据会被减少,仅保留一部分(支持)测试数据,并且使用高精度预测。在金融预测设置下,这种方法可以降低风险,只有当 ensemble 模型确定时,才会进行决策。我们使用传统的 MLP 以及可微分决策树,在 sintetic 数据和实际金融市场数据上预测 fixes-term 回报,使用常用的特征。我们认为,我们的方法可能会带来更好的总回报,同时降低风险水平。为此,我们引入了一个实用指标,用于衡量每笔交易的均衡收益,以及对于降低风险的回报调整指标,两者均得到了显著提高。
Learning and Discovering Quantum Properties with Multi-Task Neural Networks
results: 发现一种模型可以通过多用途训练,不仅预测给定集合中的性质,还可以描述全局多体量子系统的性质从本地测量中。同时,模型还可以分类保护型态相对变化、发现不确定的界限。Abstract
Deep neural networks are a powerful tool for predicting properties of quantum states from limited measurement data. Here we develop a network model that can simultaneously predict multiple quantum properties, including not only expectation values of quantum observables, but also general nonlinear functions of the quantum state, like entanglement entropies and many-body topological invariants. Remarkably, we find that a model trained on a given set of properties can also discover new properties outside that set. Multi-purpose training also enables the model to infer global properties of many-body quantum systems from local measurements, to classify symmetry protected topological phases of matter, and to discover unknown boundaries between different phases.
摘要
深度神经网络是一种 poderous 工具,可以预测量子状态的性质从有限的测量数据中。在这里,我们开发了一种网络模型,可以同时预测多种量子性质,包括不只是量子观测器的期望值,还有一些泛函数,如量子状态的异步率和多体几何抽象。很意外地,我们发现,一个基于给定的性质集合来训练的模型,可以同时揭示未知的性质集合。多用途培训也使得模型可以从地方测量数据中推断全局的多体量子系统的性质,分类保护 topological phases of matter,并发现未知的阶段边界。
results: 这种方法可以在不同的环境下实现长期公平的决策,并且可以解决多个目标之间的冲突。在路径规划问题上,这种方法可以synthesize一对策略和它们的初始分配的预算,以及拍卖策略。Abstract
Many sequential decision-making tasks require satisfaction of multiple, partially contradictory objectives. Existing approaches are monolithic, namely all objectives are fulfilled using a single policy, which is a function that selects a sequence of actions. We present auction-based scheduling, a modular framework for multi-objective decision-making problems. Each objective is fulfilled using a separate policy, and the policies can be independently created, modified, and replaced. Understandably, different policies with conflicting goals may choose conflicting actions at a given time. In order to resolve conflicts, and compose policies, we employ a novel auction-based mechanism. We allocate a bounded budget to each policy, and at each step, the policies simultaneously bid from their available budgets for the privilege of being scheduled and choosing an action. Policies express their scheduling urgency using their bids and the bounded budgets ensure long-run scheduling fairness. We lay the foundations of auction-based scheduling using path planning problems on finite graphs with two temporal objectives. We present decentralized algorithms to synthesize a pair of policies, their initially allocated budgets, and bidding strategies. We consider three categories of decentralized synthesis problems, parameterized by the assumptions that the policies make on each other: (a) strong synthesis, with no assumptions and strongest guarantees, (b) assume-admissible synthesis, with weakest rationality assumptions, and (c) assume-guarantee synthesis, with explicit contract-based assumptions. For reachability objectives, we show that, surprisingly, decentralized assume-admissible synthesis is always possible when the out-degrees of all vertices are at most two.
摘要
许多顺序决策任务需满足多个、部分矛盾的目标。现有的方法都是单一的,即所有目标都是通过单一策略(一个函数选择一系列动作)来满足。我们介绍了拍卖机制来解决这类决策问题。在这种机制下,每个目标都是通过一个分离的策略来满足,这些策略可以独立创建、修改和替换。当不同的策略有冲突目标时,我们使用一种新的拍卖机制来解决冲突。我们为每个策略分配一个固定预算,并在每步中让各策略同时从其可用预算中竞拍为执行动作的权利。策略通过竞拍价格表达其排期优先级,并且固定预算确保长期排期公平。我们在路径规划问题上建立了拍卖机制的基础,并提出了三种分类的分解问题:强化合理化(strong synthesis)、弱合理化(assume-admissible synthesis)和合理合同(assume-guarantee synthesis)。对于可达性目标,我们发现了一个意外的结论:在所有顶点出度都不大于2时,分解问题总是可能的。
Solving the multiplication problem of a large language model system using a graph-based method
paper_authors: Turker Tuncer, Sengul Dogan, Mehmet Baygin, Prabal Datta Barua, Abdul Hafeez-Baig, Ru-San Tan, Subrata Chakraborty, U. Rajendra Acharya
for: 解决 chatGPT 模型中的乘法问题,提高其数学运算精度。
methods: 基于图表结构的乘法算法,通过增加 10k 操作符来模拟人类数学运算。
results: 对 1,000,000 个大数乘法任务,提出了 100% 的准确率,成功解决了 GPT 模型中的乘法挑战。Abstract
The generative pre-trained transformer (GPT)-based chatbot software ChatGPT possesses excellent natural language processing capabilities but is inadequate for solving arithmetic problems, especially multiplication. Its GPT structure uses a computational graph for multiplication, which has limited accuracy beyond simple multiplication operations. We developed a graph-based multiplication algorithm that emulated human-like numerical operations by incorporating a 10k operator, where k represents the maximum power to base 10 of the larger of two input numbers. Our proposed algorithm attained 100% accuracy for 1,000,000 large number multiplication tasks, effectively solving the multiplication challenge of GPT-based and other large language models. Our work highlights the importance of blending simple human insights into the design of artificial intelligence algorithms. Keywords: Graph-based multiplication; ChatGPT; Multiplication problem
摘要
《基于Transformer(GPT)的对话机器人软件ChatGPT具有出色的自然语言处理能力,但对数学问题(尤其是乘法)的解决能力不足。GPT结构使用的计算图在多项式乘法操作上有限的准确性。我们开发了基于图的乘法算法,通过 incorporating a 10k操作符(其中k表示base 10最大幂),实现了人类化数学操作。我们的提议算法在100万大数乘法任务中达到100%的准确率,有效解决了GPT基于和其他大语言模型的乘法挑战。我们的工作强调了人工智能算法设计中的人类智慧的重要性。关键词:图基于乘法; ChatGPT; 乘法问题》Note: Please keep in mind that the translation is done by a machine and may not be perfect.
Telecom AI Native Systems in the Age of Generative AI – An Engineering Perspective
paper_authors: Ricardo Britto, Timothy Murphy, Massimo Iovene, Leif Jonsson, Melike Erol-Kantarci, Benedek Kovács
for: The paper explores the integration of foundational models (FMs) in the telecommunications industry, with a focus on the concept of “AI native telco” and the engineering considerations and challenges associated with implementing FMs in the software life cycle.
methods: The paper discusses the use of FMs in natural language processing tasks and content generation, and highlights the need for AI native-first approaches to fully leverage the potential of FMs in the telecom industry.
results: The paper emphasizes the enormous potential of FMs in revolutionizing how we interact with software products and services in the telecom industry, but also acknowledges the need for careful consideration of ethical, regulatory, and operational challenges to ensure the successful integration of FMs in mission-critical telecom contexts.Abstract
The rapid advancements in Artificial Intelligence (AI), particularly in generative AI and foundational models (FMs), have ushered in transformative changes across various industries. Large language models (LLMs), a type of FM, have demonstrated their prowess in natural language processing tasks and content generation, revolutionizing how we interact with software products and services. This article explores the integration of FMs in the telecommunications industry, shedding light on the concept of AI native telco, where AI is seamlessly woven into the fabric of telecom products. It delves into the engineering considerations and unique challenges associated with implementing FMs into the software life cycle, emphasizing the need for AI native-first approaches. Despite the enormous potential of FMs, ethical, regulatory, and operational challenges require careful consideration, especially in mission-critical telecom contexts. As the telecom industry seeks to harness the power of AI, a comprehensive understanding of these challenges is vital to thrive in a fiercely competitive market.
摘要
“人工智能(AI)的快速进步,特别是生成AI和基础模型(FM),已经在不同行业引入了transformative变革。大语言模型(LLM),一种基础模型,在自然语言处理任务和内容生成方面表现出色,改变了我们与软件产品和服务的交互方式。本文探讨了在电信行业中基础模型的整合,探讨了AI native telco这一概念,其中AI被融入了电信产品的тка料中。文章还讨论了在软件生命周期中实施基础模型的工程准则和特有挑战,强调了AI native-first的方法。虽然FM具有巨大的潜力,但是伦理、法规和运营上的挑战需要仔细考虑,特别在关键的电信上下文中。电信行业想要利用AI的力量,需要深入理解这些挑战,以在竞争激烈的市场中vivify。”
Stranger Danger! Cross-Community Interactions with Fringe Users Increase the Growth of Fringe Communities on Reddit
paper_authors: Giuseppe Russo, Manoel Horta Ribeiro, Robert West for: 这些研究旨在解释具有偏见和极端思想的社区在主流平台上快速增长的机制。methods: 这些研究使用文本基因推断技术来研究具有偏见和极端思想的社区在Reddit上的增长。results: 研究发现,与偏见和极端思想相关的社区之间的交互可以吸引新成员加入这些社区。接受这些交互的用户比相似的匹配用户更有4.2%的可能性加入偏见和极端思想社区。这种效应受到社区特点(如左右两派社区)和交互语言的影响。使用恶意语言进行交互可以增加加入偏见和极端思想社区的可能性,比非恶意交互高5pp。对于非偏见和极端思想社区(如r/climatechange、r/NBA、r/leagueoflegends)进行重复分析,未发现这种增长机制。总的来说,我们的发现表明,减少偏见和极端思想社区之间的交互可以减少主流平台上的偏见和极端思想社区的增长。Abstract
Fringe communities promoting conspiracy theories and extremist ideologies have thrived on mainstream platforms, raising questions about the mechanisms driving their growth. Here, we hypothesize and study a possible mechanism: new members may be recruited through fringe-interactions: the exchange of comments between members and non-members of fringe communities. We apply text-based causal inference techniques to study the impact of fringe-interactions on the growth of three prominent fringe communities on Reddit: r/Incel, r/GenderCritical, and r/The_Donald. Our results indicate that fringe-interactions attract new members to fringe communities. Users who receive these interactions are up to 4.2 percentage points (pp) more likely to join fringe communities than similar, matched users who do not. This effect is influenced by 1) the characteristics of communities where the interaction happens (e.g., left vs. right-leaning communities) and 2) the language used in the interactions. Interactions using toxic language have a 5pp higher chance of attracting newcomers to fringe communities than non-toxic interactions. We find no effect when repeating this analysis by replacing fringe (r/Incel, r/GenderCritical, and r/The_Donald) with non-fringe communities (r/climatechange, r/NBA, r/leagueoflegends), suggesting this growth mechanism is specific to fringe communities. Overall, our findings suggest that curtailing fringe-interactions may reduce the growth of fringe communities on mainstream platforms.
摘要
极端社区促进阴谋理论和极端思想的发展,在主流平台上蓬勃发展,这引发了关于这些机制的问题。我们提出和研究一种可能的机制:新成员可能通过极端互动被招募到极端社区中。我们使用文本基因ferrer inference技术来研究极端互动对Reddit上三个 prominent fringe community的发展产生的影响:r/Incel、r/GenderCritical和r/The_Donald。我们的结果表明,极端互动会吸引新成员加入极端社区。接受这些互动的用户比相似的匹配用户更有4.2%的可能性加入极端社区。这种效应受到社区的特点(如左右翼社区)以及互动语言的影响。使用恶势力语言进行互动可能使新成员加入极端社区的可能性提高5pp。当我们将极端社区换为非极端社区(如r/climatechange、r/NBA、r/leagueoflegends)进行重复分析时,我们没有发现这种效应,这表明这种生长机制特有于极端社区。总的来说,我们的发现表明,遏制极端互动可能会降低主流平台上极端社区的发展。
Estimating Material Properties of Interacting Objects Using Sum-GP-UCB
results: 能够有效地进行逐步学习,不需要重新评估已有观察数据的奖励值Abstract
Robots need to estimate the material and dynamic properties of objects from observations in order to simulate them accurately. We present a Bayesian optimization approach to identifying the material property parameters of objects based on a set of observations. Our focus is on estimating these properties based on observations of scenes with different sets of interacting objects. We propose an approach that exploits the structure of the reward function by modeling the reward for each observation separately and using only the parameters of the objects in that scene as inputs. The resulting lower-dimensional models generalize better over the parameter space, which in turn results in a faster optimization. To speed up the optimization process further, and reduce the number of simulation runs needed to find good parameter values, we also propose partial evaluations of the reward function, wherein the selected parameters are only evaluated on a subset of real world evaluations. The approach was successfully evaluated on a set of scenes with a wide range of object interactions, and we showed that our method can effectively perform incremental learning without resetting the rewards of the gathered observations.
摘要
Robots需要估算物体的物理和动态性质从观察数据中,以便准确模拟。我们提出了一种 bayesian优化方法,用于根据观察数据中的物体参数进行物体物理性质的估算。我们的注重点是基于不同交互对象的场景中的观察数据进行估算。我们提议利用奖励函数的结构,将奖励函数分割成每个观察中的奖励模型,并只使用场景中的对象参数作为输入。这将导致更好的维度减少,从而更快地优化。为了进一步加速优化过程,并减少需要进行实际评估的运行次数,我们还提议使用部分评估奖励函数,选择的参数只在一部分实际评估中进行评估。我们成功地应用了这种方法在一组具有多种对象交互的场景中,并证明了我们的方法可以进行逐步学习而不需要重置观察得到的奖励。
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning
results: 该论文在多个离线RL方法,如IQL、CQL和BRAC等方法的基础上,提出了一种适应量化的策略,并在Robomimic环境中进行了详细的实验 validate。结果显示,与离线RL方法相比,该策略可以提高策略性能 by 2-3倍。Abstract
The offline reinforcement learning (RL) paradigm provides a general recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. While policy constraints, conservatism, and other methods for mitigating distributional shifts have made offline reinforcement learning more effective, the continuous action setting often necessitates various approximations for applying these techniques. Many of these challenges are greatly alleviated in discrete action settings, where offline RL constraints and regularizers can often be computed more precisely or even exactly. In this paper, we propose an adaptive scheme for action quantization. We use a VQ-VAE to learn state-conditioned action quantization, avoiding the exponential blowup that comes with na\"ive discretization of the action space. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme. We further validate our approach on a set of challenging long-horizon complex robotic manipulation tasks in the Robomimic environment, where our discretized offline RL algorithms are able to improve upon their continuous counterparts by 2-3x. Our project page is at https://saqrl.github.io/
摘要
“偏离线强化学习(RL)模式提供了一个通用的方法,将静止行为数据转换成可以更好地性能的策略。虽然政策限制、保守主义和其他避免分布Shift的方法有助于减轻偏离线RL的挑战,但是连续动作设置frequently需要一些近似方法。在离散动作设置中,偏离线RL约束和规范可以经常更加精确地计算或者甚至是 exactly。在这篇论文中,我们提议了一种可变的行动量化方案。我们使用了VQ-VAE来学习状态决定的行动量化,以避免由粗粒化所带来的极大增长。我们证明了一些state-of-the-art的偏离线RL方法,如IQL、CQL和BRAC,在与我们提议的量化方案结合后,在标准准确度上提高了性能。我们进一步验证了我们的方法在Robomimic环境中的一组复杂的长期机械抓取任务上,我们的量化的偏离线RL算法能够超过其连续 counterpart的性能,提高了2-3倍。我们的项目页面是https://saqrl.github.io/”
Federated Heterogeneous Graph Neural Network for Privacy-preserving Recommendation
paper_authors: Bo Yan, Yang Cao, Haoyu Wang, Wenchuan Yang, Junping Du, Chuan Shi
For: This paper proposes a federated heterogeneous graph neural network (FedHGNN) based framework for recommendation, which can collaboratively train a recommendation model on distributed Heterogeneous Information Networks (HINs) without leaking user privacy.* Methods: The paper formalizes a privacy definition based on differential privacy for HIN-based federated recommendation, and elaborately designs a semantic-preserving user interactions publishing method to recover the broken meta-path based semantics caused by distributed data storage.* Results: The proposed FedHGNN model outperforms existing methods by a large margin (up to 34% in HR@10 and 42% in NDCG@10) under an acceptable privacy budget, as demonstrated through extensive experiments on three datasets.Abstract
Heterogeneous information network (HIN), which contains rich semantics depicted by meta-paths, has become a powerful tool to alleviate data sparsity in recommender systems. Existing HIN-based recommendations hold the data centralized storage assumption and conduct centralized model training. However, the real-world data is often stored in a distributed manner for privacy concerns, resulting in the failure of centralized HIN-based recommendations. In this paper, we suggest the HIN is partitioned into private HINs stored in the client side and shared HINs in the server. Following this setting, we propose a federated heterogeneous graph neural network (FedHGNN) based framework, which can collaboratively train a recommendation model on distributed HINs without leaking user privacy. Specifically, we first formalize the privacy definition in the light of differential privacy for HIN-based federated recommendation, which aims to protect user-item interactions of private HIN as well as user's high-order patterns from shared HINs. To recover the broken meta-path based semantics caused by distributed data storage and satisfy the proposed privacy, we elaborately design a semantic-preserving user interactions publishing method, which locally perturbs user's high-order patterns as well as related user-item interactions for publishing. After that, we propose a HGNN model for recommendation, which conducts node- and semantic-level aggregations to capture recovered semantics. Extensive experiments on three datasets demonstrate our model outperforms existing methods by a large margin (up to 34% in HR@10 and 42% in NDCG@10) under an acceptable privacy budget.
摘要
众所周知的异种信息网络(HIN)已成为推荐系统中强大的工具,它可以通过meta-paths中嵌入的 semantics来解决数据稀缺问题。然而,现实中的数据通常会被分布式存储,这导致了传统的中央化HIN-based推荐模型的失败。在这篇论文中,我们提议将HIN partitioned into private HINs stored on the client side and shared HINs on the server。基于这种设定,我们提出了一种联邦异种图神经网络(FedHGNN)基础架构,可以在分布式HINs上并发训练一个推荐模型,无需透露用户隐私。具体来说,我们首先定义了隐私定义,以保护用户-项交互的隐私以及用户高阶征分的信息。然后,我们采用了一种semantic-preserving用户互动发布方法,通过地方扰动用户的高阶征分和相关用户-项交互来发布。接着,我们提出了一种HGNN模型,通过节点和semantic-level汇聚来捕捉恢复的 semantics。我们对三个数据集进行了广泛的实验,结果显示,我们的模型在遵守隐私预算下,可以大幅提高推荐效果(最多34%的HR@10和42%的NDCG@10)。
Uncertainty in Automated Ontology Matching: Lessons Learned from an Empirical Experimentation
results: 实验结果表明,自动对照过程中存在较大的不确定性,而 semi-supervised 方法则显示出了更好的可靠性。Abstract
Data integration is considered a classic research field and a pressing need within the information science community. Ontologies play a critical role in such a process by providing well-consolidated support to link and semantically integrate datasets via interoperability. This paper approaches data integration from an application perspective, looking at techniques based on ontology matching. An ontology-based process may only be considered adequate by assuming manual matching of different sources of information. However, since the approach becomes unrealistic once the system scales up, automation of the matching process becomes a compelling need. Therefore, we have conducted experiments on actual data with the support of existing tools for automatic ontology matching from the scientific community. Even considering a relatively simple case study (i.e., the spatio-temporal alignment of global indicators), outcomes clearly show significant uncertainty resulting from errors and inaccuracies along the automated matching process. More concretely, this paper aims to test on real-world data a bottom-up knowledge-building approach, discuss the lessons learned from the experimental results of the case study, and draw conclusions about uncertainty and uncertainty management in an automated ontology matching process. While the most common evaluation metrics clearly demonstrate the unreliability of fully automated matching solutions, properly designed semi-supervised approaches seem to be mature for a more generalized application.
摘要
<>Translate the given text into Simplified Chinese.<>数据集成被视为信息科学领域的经典研究领域和紧迫需求。 ontology 在这种过程中扮演了关键的支持角色,以链接和semantic集成数据。本文从应用角度出发,研究基于 ontology 匹配技术的数据集成方法。然而,由于系统规模增加, manual 匹配过程变得不现实。因此,我们在实际数据支持下进行了现有的自动 ontology 匹配工具的实验。尽管使用简单的case study(即全球指标的空间-时间Alignment),实验结果显示了自动匹配过程中的显著不确定性。本文的目标是在真实数据上测试底层知识建构方法,讨论实验结果中的教训,并对自动匹配过程中的不确定性和不确定性管理进行结论。尽管常见的评价指标显示全自动匹配解决方案的不可靠性,但是正确设计的半监督方法似乎已经成熟备用更广泛的应用。
Quantify Health-Related Atomic Knowledge in Chinese Medical Large Language Models: A Computational Analysis
paper_authors: Yaxin Fan, Feng Jiang, Peifeng Li, Haizhou Li
for: This paper aims to evaluate the ability of large language models (LLMs) to provide accurate and factual suggestions for user self-diagnosis queries.
methods: The authors constructed a benchmark of common atomic knowledge in user self-diagnosis queries and evaluated both generic and specialized LLMs on this benchmark. They also performed error analysis and explored different types of data for fine-tuning specialized LLMs.
results: The results showed that generic LLMs perform better than specialized LLMs in terms of atomic knowledge and instruction-following ability, and that distilled data can benefit LLMs most. Additionally, the authors found that both generic and specialized LLMs are sycophantic, meaning they tend to cater to users’ claims when it comes to unknown knowledge.Abstract
Large Language Models (LLMs) have the potential to revolutionize the way users self-diagnose through search engines by offering direct and efficient suggestions. Recent studies primarily focused on the quality of LLMs evaluated by GPT-4 or their ability to pass medical exams, no studies have quantified the extent of health-related atomic knowledge stored in LLMs' memory, which is the basis of LLMs to provide more factual suggestions. In this paper, we first constructed a benchmark, including the most common types of atomic knowledge in user self-diagnosis queries, with 17 atomic types and a total of 14, 048 pieces of atomic knowledge. Then, we evaluated both generic and specialized LLMs on the benchmark. The experimental results showcased that generic LLMs perform better than specialized LLMs in terms of atomic knowledge and instruction-following ability. Error analysis revealed that both generic and specialized LLMs are sycophantic, e.g., always catering to users' claims when it comes to unknown knowledge. Besides, generic LLMs showed stronger safety, which can be learned by specialized LLMs through distilled data. We further explored different types of data commonly adopted for fine-tuning specialized LLMs, i.e., real-world, semi-distilled, and distilled data, and found that distilled data can benefit LLMs most.
摘要
大型语言模型(LLMs)有可能革命化用户自诊查找结果,提供直接和有效的建议。现在的研究主要集中在GPT-4评估了LMMs的质量或者LMMs能否通过医学考试,但是没有评估LMMs储存的健康相关知识量,这是LMMs提供更加正确的建议的基础。在这篇论文中,我们首先建立了一个benchmark,包括用户自诊查找常见的17种原子知识,总共14,048个原子知识。然后,我们评估了一般和特殊的LMMs在benchmark上。实验结果显示,一般LMMs在原子知识和指令遵循能力方面表现比特殊LMMs更好。错误分析表明,一般和特殊LMMs都具有追求用户的倾听,即当用户提出未知知识时,LMMs都会尽力适应。此外,一般LMMs表现出更强的安全性,可以通过特殊LMMs的滤过资料学习。我们进一步探索了不同类型的特殊LMMs fine-tuning的常用数据,包括实际世界、半滤过和滤过数据,发现滤过数据可以帮助LMMs最多。
Enhancing Low-resource Fine-grained Named Entity Recognition by Leveraging Coarse-grained Datasets
results: 实验结果表明,我们的方法在只有少量细化标注时比$K$-shot学习和监督学习方法表现更好。Abstract
Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although $K$-shot learning techniques can be applied, their performance tends to saturate when the number of annotations exceeds several tens of labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large number of annotations. A straightforward approach to address this problem is pre-finetuning, which employs coarse-grained data for representation learning. However, it cannot directly utilize the relationships between fine-grained and coarse-grained entities, although a fine-grained entity type is likely to be a subcategory of a coarse-grained entity type. We propose a fine-grained NER model with a Fine-to-Coarse(F2C) mapping matrix to leverage the hierarchical structure explicitly. In addition, we present an inconsistency filtering method to eliminate coarse-grained entities that are inconsistent with fine-grained entity types to avoid performance degradation. Our experimental results show that our method outperforms both $K$-shot learning and supervised learning methods when dealing with a small number of fine-grained annotations.
摘要
翻译结果:Named Entity Recognition (NER) часто遇到缺乏标签数据的问题,特别在细化NER场景下。虽可以使用$K$-shot学习技术,但其表现往往停滞在数十个标签以上。为解决这问题,我们利用现有的粗化数据,它们提供了大量的标签。一种直接 Addressing this problem is pre-finetuning, which uses coarse-grained data for representation learning. However, it cannot directly utilize the relationships between fine-grained and coarse-grained entities, although a fine-grained entity type is likely to be a subcategory of a coarse-grained entity type. We propose a fine-grained NER model with a Fine-to-Coarse(F2C) mapping matrix to leverage the hierarchical structure explicitly. In addition, we present an inconsistency filtering method to eliminate coarse-grained entities that are inconsistent with fine-grained entity types to avoid performance degradation. Our experimental results show that our method outperforms both $K$-shot learning and supervised learning methods when dealing with a small number of fine-grained annotations.
Learning Co-Speech Gesture for Multimodal Aphasia Type Detection
results: 对比 exist 方法,该研究实现了状态机器人的Result(F1 84.2%),并显示了手势特征的优越性, highlighting the significance of gesture expression in detecting aphasia types。Abstract
Aphasia, a language disorder resulting from brain damage, requires accurate identification of specific aphasia types, such as Broca's and Wernicke's aphasia, for effective treatment. However, little attention has been paid to developing methods to detect different types of aphasia. Recognizing the importance of analyzing co-speech gestures for distinguish aphasia types, we propose a multimodal graph neural network for aphasia type detection using speech and corresponding gesture patterns. By learning the correlation between the speech and gesture modalities for each aphasia type, our model can generate textual representations sensitive to gesture information, leading to accurate aphasia type detection. Extensive experiments demonstrate the superiority of our approach over existing methods, achieving state-of-the-art results (F1 84.2\%). We also show that gesture features outperform acoustic features, highlighting the significance of gesture expression in detecting aphasia types. We provide the codes for reproducibility purposes.
摘要
apraxia, a language disorder caused by brain damage, requires accurate identification of specific apraxia types, such as Broca's and Wernicke's apraxia, for effective treatment. However, little attention has been paid to developing methods to detect different types of apraxia. Recognizing the importance of analyzing co-speech gestures for distinguishing apraxia types, we propose a multimodal graph neural network for apraxia type detection using speech and corresponding gesture patterns. By learning the correlation between the speech and gesture modalities for each apraxia type, our model can generate textual representations sensitive to gesture information, leading to accurate apraxia type detection. Extensive experiments demonstrate the superiority of our approach over existing methods, achieving state-of-the-art results (F1 84.2\%). We also show that gesture features outperform acoustic features, highlighting the significance of gesture expression in detecting apraxia types. We provide the codes for reproducibility purposes.Note: "apraxia" is the traditional Chinese term for aphasia, and "apraxia type" is the Chinese term for aphasia type.
Live Graph Lab: Towards Open, Dynamic and Real Transaction Graphs with NFT
paper_authors: Zhen Zhang, Bingqiao Luo, Shengliang Lu, Bingsheng He
for: This paper is written for investigating the properties of the Non-fungible tokens (NFTs) ecosystem from a temporal graph analysis perspective.
methods: The paper uses a live graph with NFT transaction network, which is obtained by downloading and parsing the NFT transaction activities. The authors also use a series of measurements to understand the properties of the NFT ecosystem and compare it with social, citation, and web networks.
results: The paper provides new observations and insights into the characteristics of the emerging NFT ecosystem, including its dynamics and properties. The authors also study machine learning models in this live graph to enrich the current datasets and provide new opportunities for the graph community.Abstract
Numerous studies have been conducted to investigate the properties of large-scale temporal graphs. Despite the ubiquity of these graphs in real-world scenarios, it's usually impractical for us to obtain the whole real-time graphs due to privacy concerns and technical limitations. In this paper, we introduce the concept of {\it Live Graph Lab} for temporal graphs, which enables open, dynamic and real transaction graphs from blockchains. Among them, Non-fungible tokens (NFTs) have become one of the most prominent parts of blockchain over the past several years. With more than \$40 billion market capitalization, this decentralized ecosystem produces massive, anonymous and real transaction activities, which naturally forms a complicated transaction network. However, there is limited understanding about the characteristics of this emerging NFT ecosystem from a temporal graph analysis perspective. To mitigate this gap, we instantiate a live graph with NFT transaction network and investigate its dynamics to provide new observations and insights. Specifically, through downloading and parsing the NFT transaction activities, we obtain a temporal graph with more than 4.5 million nodes and 124 million edges. Then, a series of measurements are presented to understand the properties of the NFT ecosystem. Through comparisons with social, citation, and web networks, our analyses give intriguing findings and point out potential directions for future exploration. Finally, we also study machine learning models in this live graph to enrich the current datasets and provide new opportunities for the graph community. The source codes and dataset are available at https://livegraphlab.github.io.
摘要
多个研究已经进行了大规模时间图的性质调查。尽管这些图在实际场景中很常见,但由于隐私问题和技术限制,我们通常无法获取实时图。在这篇论文中,我们介绍了一种名为{\it Live Graph Lab}的概念,该概念可以在区块链上提供开放、动态和实时交易图。其中,非 fungible tokens(NFTs)在过去几年中成为了区块链的一个最 prominent的部分。NFTs的市场规模超过400亿美元,这个分布式生态系统会生成大量、匿名和实时交易活动,自然形成了复杂的交易网络。然而,关于这个emerging NFT生态系统从时间图分析的特点还有很少的理解。为了减少这一差距,我们在实时图中实例化了NFT交易网络,并investigated its dynamics,以提供新的观察和发现。具体来说,通过下载和解析NFT交易活动,我们获得了一个时间图,包含超过450万个节点和124亿个边。然后,我们进行了一系列测量,以了解NFT生态系统的特点。通过与社交、引用和网络相比较,我们的分析发现了有趣的发现,并指出了未来的探索方向。此外,我们还研究了在这个实时图上的机器学习模型,以激励当前的数据集和提供新的探索机会。实时图和数据集的源代码可以在https://livegraphlab.github.io/ obtained。
A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge
results: 本文提供了现有Vector数据库的挑战和大语言模型的组合,以及它们在新的可能性领域中的应用。Abstract
A vector database is used to store high-dimensional data that cannot be characterized by traditional DBMS. Although there are not many articles describing existing or introducing new vector database architectures, the approximate nearest neighbor search problem behind vector databases has been studied for a long time, and considerable related algorithmic articles can be found in the literature. This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities.
摘要
vector database 是用于存储高维数据的数据库,而这些数据不能由传统的DBMS进行描述。虽然有很少的文章描述了现有或引入新的vector database架构,但近似最近邻居问题(ANNS)在vector databases中的研究已经很长时间了,相关的算法文章在 literatura 中可以找到。本文尝试从ategorize 这些研究,按照解决 ANNS 问题的方法分为hash-based、tree-based、graph-based和quantization-based四种方法。然后,我们介绍vector databases 存在的挑战,最后,我们探讨 vector databases 与大型自然语言模型如何结合,提供新的可能性。Note: "vector database" is a literal translation of the English phrase, and it is not a commonly used term in Chinese. In Chinese, the term "高维数据库" (gāo wěi dà kē) is more commonly used to refer to a database that stores high-dimensional data.
Runner re-identification from single-view video in the open-world setting
methods: 该paper使用了预训练的YOLOv8和EfficientNet,以及自适应的gated recurrent unit autoencoder模型来自动处理单视图视频,并使用运动动态特征来提高重新认识精度。
results: 该paper在使用一个运动实践视频数据集上进行测试,并显示了与一种状态的艺术模型相比,该方法可以更高的准确率来重新认识运动员。此外,该paper还证明了自适应运动动态特征提取器的有效性。该runner重新认识系统可以用于自动分析运动视频。Abstract
In many sports, player re-identification is crucial for automatic video processing and analysis. However, most of the current studies on player re-identification in multi- or single-view sports videos focus on re-identification in the closed-world setting using labeled image dataset, and player re-identification in the open-world setting for automatic video analysis is not well developed. In this paper, we propose a runner re-identification system that directly processes single-view video to address the open-world setting. In the open-world setting, we cannot use labeled dataset and have to process video directly. The proposed system automatically processes raw video as input to identify runners, and it can identify runners even when they are framed out multiple times. For the automatic processing, we first detect the runners in the video using the pre-trained YOLOv8 and the fine-tuned EfficientNet. We then track the runners using ByteTrack and detect their shoes with the fine-tuned YOLOv8. Finally, we extract the image features of the runners using an unsupervised method using the gated recurrent unit autoencoder model. To improve the accuracy of runner re-identification, we use dynamic features of running sequence images. We evaluated the system on a running practice video dataset and showed that the proposed method identified runners with higher accuracy than one of the state-of-the-art models in unsupervised re-identification. We also showed that our unsupervised running dynamic feature extractor was effective for runner re-identification. Our runner re-identification system can be useful for the automatic analysis of running videos.
摘要
在多种运动中,玩家重新认定是自动视频处理和分析的关键。然而,当前大多数player重新认定在多视图或单视图运动视频中的研究都集中在关闭世界设定下使用标注图像集,而在开放世界设定下的自动视频分析中player重新认定还未得到充分开发。在这篇论文中,我们提出了一个runner重新认定系统,Directly处理单视图视频来解决开放世界设定。在开放世界设定下,我们不能使用标注集和处理视频 directly。提出的系统可以自动处理原始视频,并在多次框架外重新认定运动员。为了实现自动处理,我们首先在视频中检测运动员使用预训练的YOLOv8和精度调整的EfficientNet。然后,我们使用ByteTrack跟踪运动员,并使用精度调整的YOLOv8检测他们的鞋。最后,我们使用无监督方法使用闭环回归自适应模型提取运动员的图像特征。为了提高运动员重新认定的准确性,我们使用运动序列图像的动态特征。我们对一个跑步练习视频数据集进行评估,并显示了我们提出的方法可以在无监督下高度准确地重新认定运动员,并且我们的无监督跑动特征提取器是 runner重新认定中有效的。我们的runner重新认定系统可以对运动视频进行自动分析。
Architectural Implications of GNN Aggregation Programming Abstractions
results: 研究发现,使用不同的抽象方法可以得到不同的性能和效率。同时,对于某些特定的图数据处理任务,某些抽象方法可以显著提高性能。Abstract
Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.
摘要
格raph神经网络(GNNs)已经吸引了广泛的关注,因为它们可以从图数据中提取有用的表示。随着GNN计算的需求越来越高,各种优化GNN聚合的编程封装出现了,以便加速。然而,现有的编程封装没有得到全面的评估和分析,因此没有明确的共识,哪个方法更好。在这封信中,我们将现有的GNN聚合编程封装分类为数据组织维度和传播方法。通过在当今的GNN库上构建这些封装,我们进行了详细的性能和效率Characterization研究,并提供了一些关于未来GNN加速的反思。
Quantum Acceleration of Infinite Horizon Average-Reward Reinforcement Learning
results: 我们通过了对论文的严格理论分析,证明量子估计优势导致量子算法在无穷Horizon Reinforcement学习中获得了极大的进步,具体来说,我们的量子算法可以实现$\tilde{\mathcal{O}(1)$的 regret bound,比 классические对手的$\tilde{\mathcal{O}(\sqrt{T})$ bound明显更高。Abstract
This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}(1)$, a significant improvement over the $\tilde{\mathcal{O}(\sqrt{T})$ bound exhibited by classical counterparts.
摘要
results: 该系统可以帮助研究人员轻松地获得高级知识和详细参考,并且可以交互地循序搜索到有关的信息。在COVID-19研究中,该系统得到了广泛的应用,如药物重用和文献筛选。Abstract
We present a novel system that automatically extracts and generates informative and descriptive sentences from the biomedical corpus and facilitates the efficient search for relational knowledge. Unlike previous search engines or exploration systems that retrieve unconnected passages, our system organizes descriptive sentences as a relational graph, enabling researchers to explore closely related biomedical entities (e.g., diseases treated by a chemical) or indirectly connected entities (e.g., potential drugs for treating a disease). Our system also uses ChatGPT and a fine-tuned relation synthesis model to generate concise and reliable descriptive sentences from retrieved information, reducing the need for extensive human reading effort. With our system, researchers can easily obtain both high-level knowledge and detailed references and interactively steer to the information of interest. We spotlight the application of our system in COVID-19 research, illustrating its utility in areas such as drug repurposing and literature curation.
摘要
我们提出了一种新的系统,可以自动提取和生成有用和描述性的句子从生物医学词库,以便高效地搜索关系知识。与过去的搜索引擎或探索系统不同,我们的系统将描述句子组织成关系图,allowing researchers to explore closely related biomedical entities (例如,由化学物质治疗的疾病) or indirectly connected entities (例如,用于治疗疾病的潜在药物).我们的系统还使用ChatGPT和一种精心调整的关系合成模型,从检索到的信息中生成高度可靠和 concise的描述句子,从而减少了人类阅读努力。通过我们的系统,研究人员可以轻松地获得高级知识和详细参考,并且可以互动地导航到 interessant information。我们在COVID-19研究中强调了我们的系统的应用,例如药物重用和文献筛选。
Using Experience Classification for Training Non-Markovian Tasks
results: 通过在多个 benchmark 问题中实践,证明了我们的方法的可行性和效果。Abstract
Unlike the standard Reinforcement Learning (RL) model, many real-world tasks are non-Markovian, whose rewards are predicated on state history rather than solely on the current state. Solving a non-Markovian task, frequently applied in practical applications such as autonomous driving, financial trading, and medical diagnosis, can be quite challenging. We propose a novel RL approach to achieve non-Markovian rewards expressed in temporal logic LTL$_f$ (Linear Temporal Logic over Finite Traces). To this end, an encoding of linear complexity from LTL$_f$ into MDPs (Markov Decision Processes) is introduced to take advantage of advanced RL algorithms. Then, a prioritized experience replay technique based on the automata structure (semantics equivalent to LTL$_f$ specification) is utilized to improve the training process. We empirically evaluate several benchmark problems augmented with non-Markovian tasks to demonstrate the feasibility and effectiveness of our approach.
摘要
Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes
results: 这个算法可以达到 $\mathcal{O}({\epsilon^{-2})$ 样本复杂度和 $\mathcal{O}(\epsilon^{-1})$ 迭代复杂度,比现状态艺术ifactoria 样本复杂度增加 $\log(\frac{1}{\epsilon})$ 因子。此外,这个算法不需要不可证明的假设,即IS重要性的方差是上界bounded。在Hessian-free和IS-free算法中,ANPG beat最佳样本复杂度的记录,同时与其他最佳迭代复杂度匹配。Abstract
We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2})$ and simultaneously matches their state-of-the-art iteration complexity.
摘要
我们考虑无限 horizon 折抵质量评估 Markov Decision Process 的问题。我们提出了加速自然策略导数(ANPG)算法,它利用加速随机Gradient Descent 过程来获得自然策略导数。ANPG 实现了 $\mathcal{O}({\epsilon^{-2})$ 样本复杂性和 $\mathcal{O}(\epsilon^{-1})$ 迭代复杂性,其中 $\epsilon$ 定义了优化误差。这超过了现有的 state-of-the-art 样本复杂性中的 $\log(\frac{1}{\epsilon})$ 因子。ANPG 是一个首顺算法,不需要先前的文献中未能证明的不可靠的假设,即 importance sampling 的 variance 的Upper Bounded。在 Hessian-free 和 IS-free 数据中,ANPG 比最好的 known sample complexity 的factor $\mathcal{O}(\epsilon^{-\frac{1}{2})$ ,同时将其state-of-the-art迭代复杂性与最佳的 state-of-the-art 匹配。
PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection
results: 经过对五种真实世界数据集的严格评估,PREM方法显示了robustness和效果。特别是在ACM数据集上,PREM方法与最高效的基线方法相比,提高了5%的AUC,提高了9倍的训练速度,并大幅降低内存使用量。Abstract
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in various domains such as medicine, social networks, and e-commerce. However, challenges have arisen due to the diversity of anomalies and the dearth of labeled data. Existing methodologies - reconstruction-based and contrastive learning - while effective, often suffer from efficiency issues, stemming from their complex objectives and elaborate modules. To improve the efficiency of GAD, we introduce a simple method termed PREprocessing and Matching (PREM for short). Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities. Comprising two modules - a pre-processing module and an ego-neighbor matching module - PREM eliminates the necessity for message-passing propagation during training, and employs a simple contrastive loss, leading to considerable reductions in training time and memory usage. Moreover, through rigorous evaluations of five real-world datasets, our method demonstrated robustness and effectiveness. Notably, when validated on the ACM dataset, PREM achieved a 5% improvement in AUC, a 9-fold increase in training speed, and sharply reduce memory usage compared to the most efficient baseline.
摘要
nodal-level 图像异常检测 (GAD) 在不同领域中,如医学、社交网络和电商,扮演了重要的角色,以识别图像中异常的节点。然而,由于异常的多样性以及标注数据的缺乏,存在许多挑战。现有的方法ologies,如重建基于的方法和对比学习,虽然有效,但往往受到效率问题的困扰,这些问题来自于复杂的目标函数和复杂的模块。为了改善 GAD 的效率,我们提出了一种简单的方法,称为 PREprocessing 和 Matching (PREM)。我们的方法通过流elines 节点级别的图像数据,从而减少训练时间和内存使用,同时保持强大的异常检测能力。PREM 包括两个模块:预处理模块和一个 Egon 的匹配模块。我们的方法不需要在训练期间进行消息传递,而是使用一个简单的对比损失函数,从而导致训练时间和内存使用的减少。此外,我们对五个真实世界数据集进行了严格的评估,我们的方法在robustness和效果两个方面具有出色的表现。特别是在 ACM 数据集上,PREM 可以在训练速度、内存使用和 AUC 等方面与最高效的基eline 相比,达到 5% 的提升,9 倍增加训练速度,并显著减少内存使用。
Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning
results: 这篇论文的实验结果显示,PHA方法在多任务学习和几少例转移学习中比较其他强基eline方法表现更好,尤其是当资料量变少时。Abstract
Parameter-efficient fine-tuning (PEFT) has shown its effectiveness in adapting the pre-trained language models to downstream tasks while only updating a small number of parameters. Despite the success, most existing methods independently adapt to each task without considering knowledge transfer between tasks and are limited to low-data regimes. To overcome this issue, we propose Prototype-based HyperAdapter (PHA), a novel framework built on the adapter-tuning and hypernetwork. It introduces an instance-dense retriever and a prototypical hypernetwork to generate the conditional modules in a sample-efficient manner. This leads to comparable performance improvements against existing PEFT methods on multi-task learning and few-shot transfer learning. More importantly, when the available data size gets smaller, our method outperforms other strong baselines by a large margin. Based on our extensive empirical experiments across various datasets, we demonstrate that PHA strikes a better trade-off between trainable parameters, accuracy on stream tasks, and sample efficiency.
摘要
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
results: 发现GPT-4在SOTOPIA-hard subsets中表现较差,其社交常识理解和战略通信技能受限,而人类则在这些 subsets中表现出优异的社会智能能力。Abstract
Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.
摘要
人类是社交生物,我们在日常互动中追求社交目标,这是人工智能系统的能力领域中的一个关键方面。然而,人工智能系统在这个领域的能力仍然尚未得到解释。我们提出了SOTOPIA,一个开放式环境,用于模拟人工智能代理人在复杂社交交互中的表现。在我们的环境中,代理人扮演和互动,在多种情况下协同合作、交换和竞争以完成复杂社交目标。我们在这个任务空间中模拟了LLM基于代理人和人类之间的角色扮演互动,并使用SOTOPIA-Eval全面评价框架进行评估。与SOTOPIA的使用,我们发现了不同的人工智能模型在社交智能方面存在显著差异,并确定了一个通用难度集合(SOTOPIA-hard),该集合对所有模型都是挑战性的。我们发现在这个集合中,GPT-4的目标完成率远低于人类,并且它很难展现社交感知和战略通信技能。这些发现表明SOTOPIA的潜在价值,作为一个通用的人工智能社交评价和改进平台。
Hetero$^2$Net: Heterophily-aware Representation Learning on Heterogenerous Graphs
for: 本研究旨在 investigating the heterophily properties in heterogeneous graphs, and developing a heterophily-aware graph neural network (HGNN) to effectively handle more complex heterogeneous graphs.
methods: 我们使用 metapaths to identify the heterophily in heterogeneous graphs, and propose two practical metrics to quantitatively describe the levels of heterophily. We also introduce Hetero$^2$Net, a heterophily-aware HGNN that incorporates both masked metapath prediction and masked label prediction tasks to effectively handle both homophilic and heterophilic heterogeneous graphs.
results: 我们在 five real-world heterogeneous graph benchmarks with varying levels of heterophily 上 evaluate the performance of Hetero$^2$Net, and demonstrate that it outperforms strong baselines in the semi-supervised node classification task, providing valuable insights into effectively handling more complex heterogeneous graphs.Abstract
Real-world graphs are typically complex, exhibiting heterogeneity in the global structure, as well as strong heterophily within local neighborhoods. While a growing body of literature has revealed the limitations of common graph neural networks (GNNs) in handling homogeneous graphs with heterophily, little work has been conducted on investigating the heterophily properties in the context of heterogeneous graphs. To bridge this research gap, we identify the heterophily in heterogeneous graphs using metapaths and propose two practical metrics to quantitatively describe the levels of heterophily. Through in-depth investigations on several real-world heterogeneous graphs exhibiting varying levels of heterophily, we have observed that heterogeneous graph neural networks (HGNNs), which inherit many mechanisms from GNNs designed for homogeneous graphs, fail to generalize to heterogeneous graphs with heterophily or low level of homophily. To address the challenge, we present Hetero$^2$Net, a heterophily-aware HGNN that incorporates both masked metapath prediction and masked label prediction tasks to effectively and flexibly handle both homophilic and heterophilic heterogeneous graphs. We evaluate the performance of Hetero$^2$Net on five real-world heterogeneous graph benchmarks with varying levels of heterophily. The results demonstrate that Hetero$^2$Net outperforms strong baselines in the semi-supervised node classification task, providing valuable insights into effectively handling more complex heterogeneous graphs.
摘要
To address this gap, we identify the heterophily in heterogeneous graphs using metapaths and propose two practical metrics to quantitatively describe the levels of heterophily. Through in-depth investigations on several real-world heterogeneous graphs with varying levels of heterophily, we find that existing heterogeneous graph neural networks (HGNNs) fail to generalize to heterogeneous graphs with heterophily or low levels of homophily.To address this challenge, we present Hetero$^2$Net, a heterophily-aware HGNN that incorporates both masked metapath prediction and masked label prediction tasks to effectively and flexibly handle both homophilic and heterophilic heterogeneous graphs. We evaluate the performance of Hetero$^2$Net on five real-world heterogeneous graph benchmarks with varying levels of heterophily, and the results show that Hetero$^2$Net outperforms strong baselines in the semi-supervised node classification task, providing valuable insights into effectively handling more complex heterogeneous graphs.
Cloud-Magnetic Resonance Imaging System: In the Era of 6G and Artificial Intelligence
paper_authors: Yirong Zhou, Yanhuang Wu, Yuhan Su, Jing Li, Jianyun Cai, Yongfu You, Di Guo, Xiaobo Qu
for: 解决医疗机构年度生成巨量数据问题,提高医疗诊断精度和工作效率。
methods: integrating 分布式云计算、6G频率、边缘计算、联合学习和区块链技术。
results: 提高数据存储安全性、传输速度、人工智能算法维护、硬件升级和交叉机构医疗协作。Abstract
Magnetic Resonance Imaging (MRI) plays an important role in medical diagnosis, generating petabytes of image data annually in large hospitals. This voluminous data stream requires a significant amount of network bandwidth and extensive storage infrastructure. Additionally, local data processing demands substantial manpower and hardware investments. Data isolation across different healthcare institutions hinders cross-institutional collaboration in clinics and research. In this work, we anticipate an innovative MRI system and its four generations that integrate emerging distributed cloud computing, 6G bandwidth, edge computing, federated learning, and blockchain technology. This system is called Cloud-MRI, aiming at solving the problems of MRI data storage security, transmission speed, AI algorithm maintenance, hardware upgrading, and collaborative work. The workflow commences with the transformation of k-space raw data into the standardized Imaging Society for Magnetic Resonance in Medicine Raw Data (ISMRMRD) format. Then, the data are uploaded to the cloud or edge nodes for fast image reconstruction, neural network training, and automatic analysis. Then, the outcomes are seamlessly transmitted to clinics or research institutes for diagnosis and other services. The Cloud-MRI system will save the raw imaging data, reduce the risk of data loss, facilitate inter-institutional medical collaboration, and finally improve diagnostic accuracy and work efficiency.
摘要
The workflow of Cloud-MRI commences with the transformation of k-space raw data into the standardized Imaging Society for Magnetic Resonance in Medicine Raw Data (ISMRMRD) format. Then, the data are uploaded to the cloud or edge nodes for fast image reconstruction, neural network training, and automatic analysis. Finally, the outcomes are seamlessly transmitted to clinics or research institutes for diagnosis and other services.The Cloud-MRI system will save the raw imaging data, reduce the risk of data loss, facilitate inter-institutional medical collaboration, and finally improve diagnostic accuracy and work efficiency.Translated into Simplified Chinese:магнитно резонантно изображение (MRI) играет важную роль в медицинском диагнозирању, генеришући петабајтове количине слике података годишње у великим болницама. Овај обимни поток података захтева значајан удео мрежне брзине и екстензивну инфраструктуру за чување. Осим тога, локално обрадање података захтеваsubstantial ljudske ресурсе и инвестиције у хардвер. Ограничење података међу различитим здравственим установама отежава међуустанове медицинску сарадњу у клиникама и истраживањима. У овом раду, очекујемо иновативни систем MRI и његове четири генерације које интегришу емерингве дистрибуировану рачунарску облак технологију, 6G фреквенцију, ивицу рачунара, federated learning и блокчејн технологију. Овај систем се зове Cloud-MRI и има за циљ решења проблема чувања података MRI, брзине преноса, одржавања алгоритама, побољшања хардвера и сарадње.Радни процес Cloud-MRI почиње трансформацијомraw k-простора у стандардизовану форму Imaging Society for Magnetic Resonance in Medicine Raw Data (ISMRMRD). Затим, подаци се upload у облак или ивицу за брзо реконструкцију слике, тренинг неуралних мрежа и автоматско анализирање. На крају, извори се преносе безбедно на клинике или истраживачке институте за дијагнозу и друге услуге.Cloud-MRI систем ће чувати raw слике, смањити ризик губитка података, побољшати међуустанове медицинску сарадњу и на крају побољшати точност дијагнозе и ефикасност рада.
A Symbolic Language for Interpreting Decision Trees
methods: 该论文使用了一种名为StratiFOILed的精心构造的 fragments of first-ordered logic,可以计算多种后期解释,包括本地解释(如推理和对比解释)和全局解释(如特征相关性)。
results: 该论文提出了ExplainDT,一种符号语言用于解释decision trees,可以根据用户需求来定制查询。StratiFOILed queries可以写作Boolean combination of NP-problems,可以在实践中使用常数数量的SAT解决器调用来评估。Abstract
The recent development of formal explainable AI has disputed the folklore claim that "decision trees are readily interpretable models", showing different interpretability queries that are computationally hard on decision trees, as well as proposing different methods to deal with them in practice. Nonetheless, no single explainability query or score works as a "silver bullet" that is appropriate for every context and end-user. This naturally suggests the possibility of "interpretability languages" in which a wide variety of queries can be expressed, giving control to the end-user to tailor queries to their particular needs. In this context, our work presents ExplainDT, a symbolic language for interpreting decision trees. ExplainDT is rooted in a carefully constructed fragment of first-ordered logic that we call StratiFOILed. StratiFOILed balances expressiveness and complexity of evaluation, allowing for the computation of many post-hoc explanations--both local (e.g., abductive and contrastive explanations) and global ones (e.g., feature relevancy)--while remaining in the Boolean Hierarchy over NP. Furthermore, StratiFOILed queries can be written as a Boolean combination of NP-problems, thus allowing us to evaluate them in practice with a constant number of calls to a SAT solver. On the theoretical side, our main contribution is an in-depth analysis of the expressiveness and complexity of StratiFOILed, while on the practical side, we provide an optimized implementation for encoding StratiFOILed queries as propositional formulas, together with an experimental study on its efficiency.
摘要
最近的形式可解AI发展有抵触了传统的说法,证明了决策树不是一种直观可解的模型,并提出了不同的可解性查询和处理方法。然而,没有一个单一的可解性查询或分数可以满足每个情况和用户需求。这自然地提出了“可解性语言”的概念,允许用户根据自己的需求定制查询。在这个上下文中,我们提出了ExplainDT,一种符号语言用于解释决策树。ExplainDT基于我们优化的一种首领逻辑,即StratiFOILed,该逻辑具有较高的表达力和评估复杂性,可以计算多种后期解释(包括地方的推理和对比解释以及全局的特征相关性),同时仍然保持在Boolean Hierarchy中。此外,StratiFOILed查询可以写作一个Boolean组合,因此可以通过一个常数数量的SAT解决器的调用来评估。从理论角度来看,我们的主要贡献是对StratiFOILed的表达力和评估复杂性进行深入分析,而从实践角度来看,我们提供了优化的编码方法和实验研究其效率。
results: 论文通过对多个未看过的数据集进行严格的训练,证明REMARK-LLM可以插入2倍多的签名比特数据入文本中,同时保持 semantic integrity,并且在各种水印检测和移除攻击下展现出更好的抗性。Abstract
We present REMARK-LLM, a novel efficient, and robust watermarking framework designed for texts generated by large language models (LLMs). Synthesizing human-like content using LLMs necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (IP). However, the generated content is prone to malicious exploitation, including spamming and plagiarism. To address the challenges, REMARK-LLM proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into LLM-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; Furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. REMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. Extensive evaluations on multiple unseen datasets highlight REMARK-LLM proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. Furthermore, REMARK-LLM exhibits better resilience against a spectrum of watermark detection and removal attacks.
摘要
我们介绍REMARK-LLM,一种新的高效、可靠的文本杂化框架,适用于大语言模型(LLM)生成的文本。使用LLM生成人类化内容需要庞大的计算资源和广泛的数据集,包括重要知识产权(IP)。然而,生成的内容容易被恶意利用,如垃圾邮件和抄袭。为解决这些挑战,REMARK-LLM提出了三个新组件:(i)一个学习基于的消息编码模块,用于在LLM生成的文本中混入binary标识符;(ii)一个重parameterization模块,将消息编码的稠密分布转换为稀疏分布的杂化文本token;(iii)一个专门 для抽取签名的解码模块。此外,我们引入了优化的搜索算法,以确保生成的内容具有准确性和一致性。REMARK-LLM在 Semantic integrity的保持和有效签名检索方面进行了严格的训练,同时能够插入2倍多的签名比特到同一个文本中,并且维护Semantic integrity。此外,REMARK-LLM表现出更好的抗干扰和抗除法风险。
GRI: Graph-based Relative Isomorphism of Word Embedding Spaces
paper_authors: Muhammad Asif Ali, Yan Hu, Jianbin Qin, Di Wang
for: automatic construction of bilingual dictionaries using monolingual embedding spaces
methods: combines distributional training objectives with attentive graph convolutions to consider the impact of semantically similar words
results: outperforms existing research by improving the average P@1 by up to 63.6%Abstract
Automated construction of bilingual dictionaries using monolingual embedding spaces is a core challenge in machine translation. The end performance of these dictionaries relies upon the geometric similarity of individual spaces, i.e., their degree of isomorphism. Existing attempts aimed at controlling the relative isomorphism of different spaces fail to incorporate the impact of semantically related words in the training objective. To address this, we propose GRI that combines the distributional training objectives with attentive graph convolutions to unanimously consider the impact of semantically similar words required to define/compute the relative isomorphism of multiple spaces. Experimental evaluation shows that GRI outperforms the existing research by improving the average P@1 by a relative score of up to 63.6%. We release the codes for GRI at https://github.com/asif6827/GRI.
摘要
自动化建立双语词典使用单语空间的嵌入是机器翻译的核心挑战。这些词典的性能取决于各个空间的几何相似性,即他们的相对几何同构性。现有的尝试都没有考虑semantic关联的影响,即在训练目标中考虑相似的单词。为解决这个问题,我们提出了GRI,它将分布式训练目标与注意力 Graph Convolutions 结合,同时考虑多个空间中相似的单词,以统一评估多个空间的相对几何同构性。实验表明,GRI可以提高平均P@1的表现,相比现有研究提高63.6%。我们在github上分享了GRI代码,可以在https://github.com/asif6827/GRI中下载。
results: 实现了一个高效的 kNN-MT 框架,可以快速构建大规模的 datastore,并在 WMT’19 German-to-English 翻译任务中实现了相当的提升。Abstract
k-nearest-neighbor machine translation (kNN-MT) boosts the translation quality of a pre-trained neural machine translation (NMT) model by utilizing translation examples during decoding. Translation examples are stored in a vector database, called a datastore, which contains one entry for each target token from the parallel data it is made from. Due to its size, it is computationally expensive both to construct and to retrieve examples from the datastore. In this paper, we present an efficient and extensible kNN-MT framework, knn-seq, for researchers and developers that is carefully designed to run efficiently, even with a billion-scale large datastore. knn-seq is developed as a plug-in on fairseq and easy to switch models and kNN indexes. Experimental results show that our implemented kNN-MT achieves a comparable gain to the original kNN-MT, and the billion-scale datastore construction took 2.21 hours in the WMT'19 German-to-English translation task. We publish our knn-seq as an MIT-licensed open-source project and the code is available on https://github.com/naist-nlp/knn-seq . The demo video is available on https://youtu.be/zTDzEOq80m0 .
摘要
k- nearest-neighbor机器翻译(kNN-MT)可以提高一个预训练的神经机器翻译(NMT)模型的翻译质量,通过在解码过程中使用翻译示例。翻译示例被存储在一个vector数据库中,称为datastore,每个目标单词都有一个入口。由于其大小,construct和retrieve示例从datastore是计算昂贵的。在这篇论文中,我们提出了一个高效和可扩展的kNN-MT框架,knn-seq,这是为研究人员和开发人员设计的,可以高效运行,即使数据存储量达到了十亿级。knn-seq是一个plug-in在fairseq上,可以方便地更换模型和kNN索引。实验结果表明,我们实现的kNN-MT可以与原始kNN-MT做比较,并且构建了一个百亿级datastore只需2.21小时在WMT'19德语到英语翻译任务中。我们在MIT许可下发布了knn-seq作为开源项目,代码可以在https://github.com/naist-nlp/knn-seq上获取。demo视频可以在https://youtu.be/zTDzEOq80m0上找到。
LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following
paper_authors: Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, Kai-Wei Chang
for: 这种 paper 的目的是提高 Embodied Instruction Following 中的泛化能力,使 agents 能够在未看过的环境中更好地执行任务。
methods: 这种 paper 使用了 contrastive learning 和 meta-actions 来解决 Embodied Instruction Following 中的泛化问题。
results: compared to a strong multi-modal Transformer baseline, 这种方法 achieved a significant 4.5% absolute gain in success rate in unseen environments of ALFRED Embodied Instruction Following.I hope that helps! Let me know if you have any other questions.Abstract
End-to-end Transformers have demonstrated an impressive success rate for Embodied Instruction Following when the environment has been seen in training. However, they tend to struggle when deployed in an unseen environment. This lack of generalizability is due to the agent's insensitivity to subtle changes in natural language instructions. To mitigate this issue, we propose explicitly aligning the agent's hidden states with the instructions via contrastive learning. Nevertheless, the semantic gap between high-level language instructions and the agent's low-level action space remains an obstacle. Therefore, we further introduce a novel concept of meta-actions to bridge the gap. Meta-actions are ubiquitous action patterns that can be parsed from the original action sequence. These patterns represent higher-level semantics that are intuitively aligned closer to the instructions. When meta-actions are applied as additional training signals, the agent generalizes better to unseen environments. Compared to a strong multi-modal Transformer baseline, we achieve a significant 4.5% absolute gain in success rate in unseen environments of ALFRED Embodied Instruction Following. Additional analysis shows that the contrastive objective and meta-actions are complementary in achieving the best results, and the resulting agent better aligns its states with corresponding instructions, making it more suitable for real-world embodied agents. The code is available at: https://github.com/joeyy5588/LACMA.
摘要
END-TO-END 转换器在训练中见过环境下的Embodied Instruction Following任务中表现出色,但在未经训练的环境下却表现不佳,这导致了模型的普适性受到限制。这种问题的原因在于模型对自然语言指令的敏感性不够,这使得模型在不同环境下无法适应。为了解决这个问题,我们提议通过对模型隐藏状态与指令进行对齐来提高模型的敏感性。然而,高级语言指令和模型的低级动作空间之间的差距仍然存在,这使得模型困难地将高级语言指令翻译成低级动作。为了解决这个问题,我们提出了一种新的概念——元动作。元动作是在原始动作序列中提取出的普适的动作模式,它们可以帮助模型更好地理解高级语言指令的含义。当元动作作为训练信号时,模型在未经训练的环境下的总成功率得到了显著的提高。相比于一个强大的多Modal Transformer参考点,我们在未经训练的ALFRED Embodied Instruction Following任务中实现了4.5%的绝对提升。更进一步的分析表明,对比于对照学习和元动作的融合,我们的方法更好地实现了模型与指令之间的对齐,使得模型更适合实际的具体体现agent。代码可以在以下链接获取:https://github.com/joeyy5588/LACMA。
Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-ly
results: 我们进行了一项广泛的实验分析,以评估in-context PVI 的可靠性。我们的发现表明,in-context PVI 估计值具有类似的特性于原始 PVI。具体地说,在听 Context 设置下,in-context PVI 估计值具有稳定的特性,不受不同的示例选择和射击数的影响。此外,我们还示了如何使用 in-context PVI 来标识困难的实例。这篇论文强调了 in-context PVI 的潜在价值和ICL的可能性。Abstract
In-context learning (ICL) is a new learning paradigm that has gained popularity along with the development of large language models. In this work, we adapt a recently proposed hardness metric, pointwise $\mathcal{V}$-usable information (PVI), to an in-context version (in-context PVI). Compared to the original PVI, in-context PVI is more efficient in that it requires only a few exemplars and does not require fine-tuning. We conducted a comprehensive empirical analysis to evaluate the reliability of in-context PVI. Our findings indicate that in-context PVI estimates exhibit similar characteristics to the original PVI. Specific to the in-context setting, we show that in-context PVI estimates remain consistent across different exemplar selections and numbers of shots. The variance of in-context PVI estimates across different exemplar selections is insignificant, which suggests that in-context PVI are stable. Furthermore, we demonstrate how in-context PVI can be employed to identify challenging instances. Our work highlights the potential of in-context PVI and provides new insights into the capabilities of ICL.
摘要
新学习理念“内容学习”(ICL)随着大语言模型的发展而受到关注。在这项工作中,我们对一种最近提出的困难度度量,点对可用信息(PVI)进行了适应。与原始PVI相比,内容PVI更加高效,只需几个示例并无需微调。我们进行了广泛的实验分析,以评估内容PVI的可靠性。我们的发现表明,内容PVI估计具有与原始PVI相似的特征。具体来说,在内容设置下,内容PVI估计具有不同示例选择和射击数量的稳定性。 var(内容PVI估计)在不同示例选择下的差异不显著,这表明内容PVI是稳定的。此外,我们还证明了内容PVI可以用于标识困难实例。我们的工作探讨了内容PVI的潜力和ICL的可能性,并提供了新的视角。
Direct Neural Machine Translation with Task-level Mixture of Experts models
methods: 论文提出了多种方法来解决直接NMT系统的限制,包括多语言NMT和中间语言NMT(通过英语翻译)。它们还提出了Task-level Mixture of expert models(任务级混合专家模型),一种基于Transformer模型的推理效率优化方法。
results: 论文表明,Task-level MoE-based direct NMT系统在大量低资源和高资源irect对的翻译任务上表现出色,并且在7种语言对上超过了双语和中间语言NMT模型。Abstract
Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.
摘要
直接神经机器翻译(直接NMT)是一种NMT系统,用于翻译非英语语言对。直接NMT系统经常面临限制,即非英语语言对的并不充足。多种方法已经提出来解决这个问题,如多语言NMT和中转NMT(通过英语翻译)。任务级别混合模型(Task-level MoE),一种基于转换器模型的推理效率版本,在许多语言对上表现出了优秀的NMT性能。在Task-level MoE中,不同语言组可以使用不同的路由策略来优化对语言之间的学习和推理速度。在本研究中,我们研究Task-level MoE在直接NMT中的适用性,并提出了一系列高性能的训练和评估配置。通过这些配置,Task-level MoE基于直接NMT系统在大量低资源和高资源直接对的翻译方向上表现出了比比较好的成绩。我们的Task-level MoE系统与16个专家相比,超过了双语NMT和中转NMT模型在7个语言对上的性能。然而,中转NMT模型仍然在9个语言对和方向上表现出了较好的成绩。
Understanding Retrieval Augmentation for Long-Form Question Answering
results: 研究发现,使用不同检索文档集可以影响LM的回答生成质量。此外,研究还发现了长文本生成中的归因模式,以及LM的归因错误的主要原因。Abstract
We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.
摘要
我们提出了一项研究,探讨 Retrieval-augmented 语言模型(LM)在长问答中的表现。我们分析了不同LM在使用同一份证据文档时的响应,以及不同证据文档集的质量如何影响LM生成的答案。我们研究了各种答案特征(如流畅度、长度、变化程度),强调在上下文文档中归因生成的长文答案。我们收集了人类标注答案归因的数据,并评估了自动判断归因的方法。我们的研究提供了新的认知,揭示了 Retrieval-augmented 语言模型在长知识含量文本生成中的影响,以及长文生成中的归因模式和错误的主要原因。这些分析结果为未来工作提供了方向。
Simple Mechanisms for Representing, Indexing and Manipulating Concepts
results: 该方法可以在不同的概念之间找到共同主题,并可以用来建立一个概念字典,以便将输入数据正确地归类到相关的概念中。Abstract
Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data. We will argue instead that learning a concept could be done by looking at its moment statistics matrix to generate a concrete representation or signature of that concept. These signatures can be used to discover structure across the set of concepts and could recursively produce higher-level concepts by learning this structure from those signatures. When the concepts are `intersected', signatures of the concepts can be used to find a common theme across a number of related `intersected' concepts. This process could be used to keep a dictionary of concepts so that inputs could correctly identify and be routed to the set of concepts involved in the (latent) generation of the input.
摘要
Note: Simplified Chinese is used in this translation, which is a more casual and conversational style of Chinese. Traditional Chinese would be more formal and written.
Pseudointelligence: A Unifying Framework for Language Model Evaluation
results: 可以用来评估语言模型的两个案例研究以及现有评估方法的分析Abstract
With large language models surpassing human performance on an increasing number of benchmarks, we must take a principled approach for targeted evaluation of model capabilities. Inspired by pseudorandomness, we propose pseudointelligence, which captures the maxim that "(perceived) intelligence lies in the eye of the beholder". That is, that claims of intelligence are meaningful only when their evaluator is taken into account. Concretely, we propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator. We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
摘要
With large language models surpassing human performance on an increasing number of benchmarks, we must take a principled approach for targeted evaluation of model capabilities. Inspired by pseudorandomness, we propose pseudointelligence, which captures the maxim that "(perceived) intelligence lies in the eye of the beholder". That is, that claims of intelligence are meaningful only when their evaluator is taken into account. Concretely, we propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator. We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation
results: 研究发现,IFT 模型默认将 male-inflected 翻译作为结果,甚至会忽略女性职业 gender 标签。此外,研究还发现模型在错误翻译中忽略 masculine 和 feminine pronoun 的问题。基于这些发现,研究提出了一种简单、有效的偏见 Mitigation 解决方案,通过 few-shot learning 实现了更加公平的翻译结果。Abstract
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. Finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
摘要
现代指导模型可以解决多个自然语言处理任务,机器翻译(MT)是其中的一个重要用例。然而,当前的研究经常关注标准性能指标,而忽略了吸引人的公平和道德考虑。在MT中,这可能导致误射翻译,其中的一些危害包括延续偏见和预设。在这项工作中,我们填补这个遗漏,我们研究了IFT模型在机器翻译中是否存在性别偏见,以及如何缓解它。具体来说,我们在英语到德语和西班牙语的WinoMT corpus上计算了确定性别偏见的指标。我们发现,IFT模型默认使用♂inflected翻译,即使 female occupational stereotypes。然后,使用可见性方法,我们发现模型在误射翻译中系统地忽略指示翻译对象的性别的代名词。最后,基于这一发现,我们提出了一种易于实施的和有效的偏见缓解解决方案,该解决方案基于几 shot learning,可以导致非常公平的翻译。
Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
results: 实现10%的提高精度在CFQ和COGS数据集上,无需hyperparameter tuningAbstract
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harnessing the power of dataset cartography (Swayamdipta et al., 2020). By strategically identifying a subset of compositional generalization data using this approach, we achieve a remarkable improvement in model accuracy, yielding enhancements of up to 10% on CFQ and COGS datasets. Notably, our technique incorporates dataset cartography as a curriculum learning criterion, eliminating the need for hyperparameter tuning while consistently achieving superior performance. Our findings highlight the untapped potential of dataset cartography in unleashing the full capabilities of compositional generalization within Transformer models. Our code is available at https://github.com/cyberiada/cartography-for-compositionality.
摘要
神经网络已经革命化语言模型化,并在各种下游任务中表现出色。然而,这些模型是否达到人类认知能力的 Compositional generalization 水平仍然是一个议题。现有的方法主要集中在新的建筑和学习方法上,而我们则提出了一种拓展 dataset cartography(Swayamdipta et al., 2020)的新方法。通过策略地选择 Compositional generalization 数据 subsets,我们实现了模型精度的显著提高,CFQ 和 COGS 数据集上的提高达到 10%。值得注意的是,我们的技术将 dataset cartography 作为课程学习标准,从而消除了 hyperparameter 调整的需求,并一直保持优秀的性能。我们的发现表明,使用 dataset cartography 可以解 liberate 传播模型中的 Compositional generalization 潜力。我们的代码可以在 GitHub 上找到:https://github.com/cyberiada/cartography-for-compositionality。
On the Benefit of Generative Foundation Models for Human Activity Recognition
paper_authors: Zikang Leng, Hyeokhyen Kwon, Thomas Plötz
for: solves the problem of limited annotated data in human activity recognition (HAR) by using generative AI to autonomously generate virtual IMU data from text descriptions.
methods: uses Large Language Models (LLMs) and motion synthesis models to generate virtual IMU data.
results: identifies several promising research pathways that could benefit from generative AI in HAR, including generating benchmark datasets, developing foundational models specific to HAR, exploring hierarchical structures within HAR, breaking down complex activities, and applications in health sensing and activity summarization.Here is the text in Simplified Chinese:
for: 解决人体活动识别(HAR)中数据稀缺问题,使用生成AI自动生成文本描述IMU数据。
methods: 使用大型语言模型(LLMs)和运动合成模型生成IMU数据。
results: 找到了生成AI在HAR中的许多有优势的研究方向,包括生成数据集、开发特有于HAR的基础模型、阶段分解复杂活动、应用于健康感知和活动概要。Abstract
In human activity recognition (HAR), the limited availability of annotated data presents a significant challenge. Drawing inspiration from the latest advancements in generative AI, including Large Language Models (LLMs) and motion synthesis models, we believe that generative AI can address this data scarcity by autonomously generating virtual IMU data from text descriptions. Beyond this, we spotlight several promising research pathways that could benefit from generative AI for the community, including the generating benchmark datasets, the development of foundational models specific to HAR, the exploration of hierarchical structures within HAR, breaking down complex activities, and applications in health sensing and activity summarization.
摘要
人类活动识别(HAR)中,数据缺乏问题是一大挑战。我们 Drawing inspiration from the latest advancements in generative AI,包括大型自然语言模型(LLM)和运动合成模型,我们认为生成AI可以解决这种数据缺乏问题,通过自动生成虚拟IMU数据从文本描述中。此外,我们还指出了许多有前途的研究方向,包括生成标准数据集,开发特有的HAR基础模型,探索HAR层次结构,分解复杂活动,以及医疗感知和活动概要应用。
Towards Safer Operations: An Expert-involved Dataset of High-Pressure Gas Incidents for Preventing Future Failures
results: 初步的结果表明,NLP技术可以有效地分析事故报告,以预防未来的失败。 dataset 可以促进未来的研究在 NLP 和事故管理领域。 dataset 的访问也提供(IncidentAI dataset 可以在:https://github.com/Cinnamon/incident-ai-dataset 中找到)。Abstract
This paper introduces a new IncidentAI dataset for safety prevention. Different from prior corpora that usually contain a single task, our dataset comprises three tasks: named entity recognition, cause-effect extraction, and information retrieval. The dataset is annotated by domain experts who have at least six years of practical experience as high-pressure gas conservation managers. We validate the contribution of the dataset in the scenario of safety prevention. Preliminary results on the three tasks show that NLP techniques are beneficial for analyzing incident reports to prevent future failures. The dataset facilitates future research in NLP and incident management communities. The access to the dataset is also provided (the IncidentAI dataset is available at: https://github.com/Cinnamon/incident-ai-dataset).
摘要
这篇论文介绍了一个新的 IncidentAI 数据集,用于安全预防。与过去的数据集不同,我们的数据集包含三个任务:命名实体识别、 causal EXTRACTION 和信息检索。数据集由具有至少六年实践经验的高压气保存管理员进行标注。我们验证了数据集在安全预防方面的贡献。初步结果显示,NLP 技术可以有效地分析事故报告,以预防未来的失败。该数据集将促进未来 NLP 和事故管理社区的研究。数据集的访问权也提供(IncidentAI 数据集可以在:https://github.com/Cinnamon/incident-ai-dataset 中获取)。
SPEED: Speculative Pipelined Execution for Efficient Decoding
results: 在Transformerdecoder中实现Parameter sharing,通过减少内存操作的负担,提高生成LLM推理的效率,并在模型准确率 versus 延迟时间之间取得平衡。Abstract
Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
摘要
大量的自然语言处理任务中的生成大语言模型(LLM)基于Transformer架构最近占据了主导地位。然而,它们在实时场景中的应用受到了较大的推理延迟的限制。这主要是因为生成LLM的推理是sequential的,每个token都виси于所有前一个输出token。因此,难以实现任务级别的并行计算,使推理变得具有很高的内存约束。在这种情况下,我们提出了SPEED方法,它通过预测基于早期隐藏状态的值来спекулятив执行多个未来的token在并行的方式。对于使用参数共享的Transformer解码器,我们可以归并内存操作,这allow us以加速生成LLM推理。我们通过对响应率和模型精度之间的负载减少来证明我们的方法的效率。此外,我们还示出了通过 especulation进行训练更深的解码器,只需要最小的运行时开销。
Code Book for the Annotation of Diverse Cross-Document Coreference of Entities in News Articles
results: 这篇论文的主要贡献是提供了一种多层次注释方法,可以应用于媒体偏见分析中的词汇选择和标签。Abstract
This paper presents a scheme for annotating coreference across news articles, extending beyond traditional identity relations by also considering near-identity and bridging relations. It includes a precise description of how to set up Inception, a respective annotation tool, how to annotate entities in news articles, connect them with diverse coreferential relations, and link them across documents to Wikidata's global knowledge graph. This multi-layered annotation approach is discussed in the context of the problem of media bias. Our main contribution lies in providing a methodology for creating a diverse cross-document coreference corpus which can be applied to the analysis of media bias by word-choice and labelling.
摘要
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education
results: 研究发现,这些LLM模型在越南语MCQA任务中具有扎实的MCSB能力,特别是在零shot和一shot设置下。Abstract
In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
摘要
在这篇论文中,我们评估了大语言模型(LLM)在零批、一批和几批设置下的多选符号绑定(MCSB)能力,用于多选问答(MCQA)任务。我们关注越南语言,因为越南语言MCQA数据集比英语更少。我们研究的两个现有数据集是 ViMMRC 1.0 和 ViMMRC 2.0,它们都是文学类。近期的越南语言自然语言处理(NLP)研究主要集中在评估 ChatGPT,但是这些研究主要集中在 ChatGPT 如何解决越南语言高中毕业考试(VNHSGE)。我们希望创建一个新的高质量数据集,提供了 LaTeX 格式的结构化指南,以便用于评估 LLM 和更小的语言模型(LM)的 MCSB 能力。我们的评估结果显示,六种著名的 LLM 在 ViMMRC 1.0 和 ViMMRC 2.0 标准和我们提议的数据集上表现出了预期的 MCSB 能力。数据集仅用于研究目的。
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scaling of Texts with Large Language Models
results: 这篇论文使用CGCoT方法和大语言模型(LLM)对 Twitter 上的情感语言进行了扩展,并证明了该方法可以生成与人类评价相符的投票结果,而且不需要大量标注数据。Abstract
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data. We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models (LLMs). Specifically, we propose concept-guided chain-of-thought (CGCoT), which uses prompts designed to summarize ideas and identify target parties in texts to generate concept-specific breakdowns, in many ways similar to guidance for human coder content analysis. CGCoT effectively shifts pairwise text comparisons from a reasoning problem to a pattern recognition problem. We then pairwise compare concept-specific breakdowns using an LLM. We use the results of these pairwise comparisons to estimate a scale using the Bradley-Terry model. We use this approach to scale affective speech on Twitter. Our measures correlate more strongly with human judgments than alternative approaches like Wordfish. Besides a small set of pilot data to develop the CGCoT prompts, our measures require no additional labeled data and produce binary predictions comparable to a RoBERTa-Large model fine-tuned on thousands of human-labeled tweets. We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
摘要
现有的文本缩放方法通常需要大量数据集,困难处理短文本,或需要标注数据。我们开发了一种基于生成大语言模型(LLM)的文本缩放方法,具体来说是思想导向链条(CGCoT)。CGCoT使用用于概述想法和标识文本中targetparty的提示来生成思想特定的拆分,与人工编码分析类似。CGCoT将对比文本的对比问题转化为pattern recognition问题。然后,我们对每个拆分进行对比,使用LLM来对比拆分。我们使用这些对比结果来估算一个排名使用布莱德利-特里模型。我们使用这种方法来尺度Twitter上的情感语言。我们的度量与人类判断更高相关性,与替代方法如Wordfish相比。除了开发CGCoT提示的小数据集外,我们的度量不需要额外的标注数据,并且生成了与RoBERTa-Large模型 fine-tuned on thousands of human-labeled tweets相同的二进制预测。我们示例了如何将专业知识与LLM结合以创建状态的抽象概念度量。
CORE: A Few-Shot Company Relation Classification Dataset for Robust Domain Adaptation
results: 实验结果表明,当前的 RC 模型在 CORE dataset 上 exhibits substantial performance gaps, 并且模型在不同的领域上适应性不高。 however, 模型在 CORE 上训练显示出了改善的 out-of-domain 性能, 这表明高质量数据的重要性 для robust domain adaptation。Abstract
We introduce CORE, a dataset for few-shot relation classification (RC) focused on company relations and business entities. CORE includes 4,708 instances of 12 relation types with corresponding textual evidence extracted from company Wikipedia pages. Company names and business entities pose a challenge for few-shot RC models due to the rich and diverse information associated with them. For example, a company name may represent the legal entity, products, people, or business divisions depending on the context. Therefore, deriving the relation type between entities is highly dependent on textual context. To evaluate the performance of state-of-the-art RC models on the CORE dataset, we conduct experiments in the few-shot domain adaptation setting. Our results reveal substantial performance gaps, confirming that models trained on different domains struggle to adapt to CORE. Interestingly, we find that models trained on CORE showcase improved out-of-domain performance, which highlights the importance of high-quality data for robust domain adaptation. Specifically, the information richness embedded in business entities allows models to focus on contextual nuances, reducing their reliance on superficial clues such as relation-specific verbs. In addition to the dataset, we provide relevant code snippets to facilitate reproducibility and encourage further research in the field.
摘要
我们介绍了CORE数据集,专门用于几个shot关系分类(RC),关注公司关系和企业实体。CORE包含4,708个实例,12种关系类型的文本证据,从公司Wikipedia页面中提取。公司名称和商业实体可能会带来很多挑战,因为它们可能会表示法律实体、产品、人员或业务部门,具体取决于上下文。因此,从文本上提取关系类型 между实体是很有所依赖的。为了评估现有RC模型在CORE数据集上的性能,我们在几个shot领域适应设置下进行了实验。我们的结果表明,模型在不同领域的学习后,很难适应CORE。但是,模型在CORE上进行训练后,在其他领域的表现有所提高,这反映了高质量数据的重要性,以及企业实体中嵌入的信息 ricness,使模型更加注重上下文特征,减少对关系特有词的依赖。此外,我们还提供了相关的代码截图,以便复现和进一步研究。
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation
results: 实验结果显示,现有的两种方法在一些任务上都显示出问题,这表明长期 TABLETOP 推理任务仍然是现代具有问题。Abstract
The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
摘要
<> transtabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.Note that "LoHoRavens" is a simulation benchmark, and "LLMs" stands for "large language models".
Gold: A Global and Local-aware Denoising Framework for Commonsense Knowledge Graph Noise Detection
methods: incorporates entity semantic information, global rules, and local structural information from the CSKG
results: outperforms all baseline methods in noise detection tasks on synthetic noisy CSKG benchmarks, and benefits the downstream zero-shot commonsense question-answering task on a real-world CSKGAbstract
Commonsense Knowledge Graphs (CSKGs) are crucial for commonsense reasoning, yet constructing them through human annotations can be costly. As a result, various automatic methods have been proposed to construct CSKG with larger semantic coverage. However, these unsupervised approaches introduce spurious noise that can lower the quality of the resulting CSKG, which cannot be tackled easily by existing denoising algorithms due to the unique characteristics of nodes and structures in CSKGs. To address this issue, we propose Gold (Global and Local-aware Denoising), a denoising framework for CSKGs that incorporates entity semantic information, global rules, and local structural information from the CSKG. Experiment results demonstrate that Gold outperforms all baseline methods in noise detection tasks on synthetic noisy CSKG benchmarks. Furthermore, we show that denoising a real-world CSKG is effective and even benefits the downstream zero-shot commonsense question-answering task.
摘要
共享常识图(CSKG)是对常识理解的关键,但是通过人工标注可能会成本高。因此,多种自动方法已经被提议用于构建CSKG,以提高 semantic 覆盖率。然而,这些无监督方法会引入干扰噪声,这些噪声难以通过现有的噪声除除算法处理,因为CSKG 中节点和结构的特殊特征。为解决这个问题,我们提出了 Gold(全球和本地化噪声除法),一种特有CSKG噪声除法,该法利用实体 semantic 信息,全球规则和CSKG 本地结构信息。实验结果表明,Gold 在噪声检测任务中击败了所有基线方法。此外,我们还证明了对真实世界CSKG进行噪声除法有效,甚至对下游零shot常识问答任务有益。
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers
for: investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and multiplication.
methods: through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. we show that transformer models are able to generalize to long lengths with the help of targeted attention biasing.
results: we demonstrate that using ABC, the transformer model can achieve unprecedented perfect length generalization on certain arithmetic tasks.Abstract
Since its introduction, the transformer model has demonstrated outstanding performance across various tasks. However, there are still unresolved issues regarding length generalization, particularly in algorithmic tasks. In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and multiplication. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we link to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented perfect length generalization on certain arithmetic tasks.
摘要
自其引入以来,变换模型在不同任务中表现出色。然而,LENGTH总是一个尚未解决的问题,特别是在算法任务中。在这篇论文中,我们调查变换模型是否具备学习算术算法的能力,如加法和乘法。通过实验和注意力分析,我们确定了一些重要的因素,以实现最佳的长度总结。我们发现,变换模型可以通过targeted注意力偏好来总结到长 lengths。然后,我们引入了注意力偏好准备(ABC),一种准备阶段,它使得模型自动学习合适的注意力偏好,我们将其联系到相对位编码机制。我们示出,使用ABC,变换模型可以实现历史性的长度总结在某些算数任务中。
Filling in the Gaps: Efficient Event Coreference Resolution using Graph Autoencoder Networks
results: 本研究在大规模的荷兰事件核心参照 korpus 上显著超过 классиical mention-pair 方法,以至于总分、效率和训练速度。此外,我们的模型能够更好地识别更难的核心参照链,并在低数据设置下显示出较高的Robustness。Abstract
We introduce a novel and efficient method for Event Coreference Resolution (ECR) applied to a lower-resourced language domain. By framing ECR as a graph reconstruction task, we are able to combine deep semantic embeddings with structural coreference chain knowledge to create a parameter-efficient family of Graph Autoencoder models (GAE). Our method significantly outperforms classical mention-pair methods on a large Dutch event coreference corpus in terms of overall score, efficiency and training speed. Additionally, we show that our models are consistently able to classify more difficult coreference links and are far more robust in low-data settings when compared to transformer-based mention-pair coreference algorithms.
摘要
我们提出了一种新的和高效的事件核心关系解决方法(ECR),应用于低资源语言领域。我们将ECR视为图像重建任务,可以结合深度 semantics embedding 和结构核心关系链知识,创建一种参数高效的图像自编码器模型(GAE)。我们的方法在荷兰事件核心 correlate 词汇库中表现出色,在总分、效率和训练速度方面均超过了经典的提及对方法。此外,我们还证明了我们的模型在低数据情况下能够更好地分类更难的核心关系链,并且在基于转换器的提及对方法中更加稳定。
AMR Parsing with Causal Hierarchical Attention and Pointers
results: 实验表明,我们的模型在无额外数据的情况下,在四个benchmark中比基线模型表现出色,提高了性能。Abstract
Translation-based AMR parsers have recently gained popularity due to their simplicity and effectiveness. They predict linearized graphs as free texts, avoiding explicit structure modeling. However, this simplicity neglects structural locality in AMR graphs and introduces unnecessary tokens to represent coreferences. In this paper, we introduce new target forms of AMR parsing and a novel model, CHAP, which is equipped with causal hierarchical attention and the pointer mechanism, enabling the integration of structures into the Transformer decoder. We empirically explore various alternative modeling options. Experiments show that our model outperforms baseline models on four out of five benchmarks in the setting of no additional data.
摘要
听说过的AMR解析器在最近几年内受欢迎,因为它的简单和效果性。它预测了线性图,作为自由文本,避免了Explicit结构化。然而,这种简单性忽略了AMR图中的结构本地性,并且添加了不必要的标记来表示核心引用。在这篇论文中,我们介绍了新的AMR解析目标形式和一种新的模型,即 CHAP,它具有征识层次注意力和指针机制,使得Transformer解码器中的结构可以被集成。我们在不同的模型化选项上进行了实验,实验结果显示,我们的模型在无额外数据的情况下超过基eline模型在四个benchmark中。
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences
results: 比其他高效注意力变体在中等规模数据集上表现更好,具有更大的内存大小和准确率。Abstract
Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.
摘要
tranSformer-based models have achieved state-of-the-art performance in many areas, but the quadratic complexity of self-attention with respect to the input length limits their applicability to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}(\log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Emptying the Ocean with a Spoon: Should We Edit Models?
results: 研究发现,直接修改LLM模型不能被视为一种系统性的解决方案,而且可能会增加风险。在某些情况下,直接修改LLM模型可能会增加风险,而不是减少风险。Abstract
We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in LLMs; (2) concept erasure methods, which aim at preventing systemic bias in generated text; and (3) attribution methods, which aim at grounding generations into identified textual sources. We argue that direct model editing cannot be trusted as a systematic remedy for the disadvantages inherent to LLMs, and while it has proven potential in improving model explainability, it opens risks by reinforcing the notion that models can be trusted for factuality. We call for cautious promotion and application of model editing as part of the LLM deployment process, and for responsibly limiting the use cases of LLMs to those not relying on editing as a critical component.
摘要
我团队提出对直接模型编辑的方法进行批判,作为LLM生成中的错误纠正方法。我们将模型编辑与三种相似 yet distinct的方法进行对比:(1)检索型架构,它将知识存储和LLM中的语言能力分离开来;(2)概念消除方法,它们目的是避免生成文本中的系统偏见;以及(3)归因方法,它们强调将生成文本链接到特定的文本来源。我们认为直接模型编辑无法被视为LLM中的系统疾病纠正方法,尽管它在模型解释方面具有潜力。我们呼吁对模型编辑的推广和应用进行谨慎,并限制LLM的使用场景,以避免依赖于编辑的情况。
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
results: 提供了一个自动化的音乐处理系统,让用户可以快速地找到适合他们需求的音乐工具,减少了用户对音乐处理技术的学习压力,让用户更能专注于音乐创作。Abstract
AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience.
摘要
人工智能 empowered 音乐处理是一个多样化的领域,包括多种任务,例如生成任务(如 timbre 合成)和理解任务(如音乐分类)。为开发者和爱好者而言,抓住这些任务的要求非常困难,尤其是在音乐数据表示和模型在不同平台之间的差异非常大。因此,需要建立一个系统来组织和集成这些任务,以帮助实践者自动分析他们的需求,并选择适合的工具来满足他们的要求。受大语言模型(LLM)的成功启发,我们开发了一个名为 MusicAgent 的系统,它集成了多种音乐相关的工具和一个自动化的工作流程,以解决用户的需求。更具体来说,我们建立了以下两个部分:1. 工具集,收集了来自多种源,包括 Hugging Face、GitHub 和 Web API 等等的工具。2. 由 LLM(如 ChatGPT) empowered 的自动化工作流程,用于组织这些工具,并自动将用户的请求分解成多个子任务,并对应的邀请合适的音乐工具。MusicAgent 系统的Primary Goal 是免除用户对 AI-音乐工具的繁琐,让他们可以专注于创作。通过让用户轻松地组合工具,系统提供了一个无缝和丰富的音乐体验。
Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding
results: 实验结果表明,提供不同的输入模式可以导致模型的不同行为,包括跨模式背景、跨语言背景和未grounded模型的不同行为。这些行为的差异可以在全数据集水平和特定词表示水平上被衡量。Abstract
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.
摘要
文本背景是人工智能系统的重要组成部分,有一些研究者认为它可以帮助系统实现更完整和具有真正含义的语言理解能力。文献被分为两个派别:一些人认为,背景可以带来不同的普遍化,而另一些人则认为,它可以通过大量单Modal数据补做。然而,有限的实验证据已经出现了,支持或反对任一位置。在这篇论文中,我们提出了一种方法ológical框架,用于研究不同输入模式对NLP系统的效果。我们 constructed comparable samples of populations of models trained on different input modalities,以便分离不同输入源的qualitative效果和可衡量的模型性能。实验结果显示,在不同的语言和modalities中训练的模型 exhibit 不同的行为,我们在全数据集级别以及特定词表示性的方面进行了测量。
Investigating semantic subspaces of Transformer sentence embeddings through linear structural probing
results: 发现不同模型家族和模型大小具有不同的表现和层次动态,但表现相对较具同样性Abstract
The question of what kinds of linguistic information are encoded in different layers of Transformer-based language models is of considerable interest for the NLP community. Existing work, however, has overwhelmingly focused on word-level representations and encoder-only language models with the masked-token training objective. In this paper, we present experiments with semantic structural probing, a method for studying sentence-level representations via finding a subspace of the embedding space that provides suitable task-specific pairwise distances between data-points. We apply our method to language models from different families (encoder-only, decoder-only, encoder-decoder) and of different sizes in the context of two tasks, semantic textual similarity and natural-language inference. We find that model families differ substantially in their performance and layer dynamics, but that the results are largely model-size invariant.
摘要
研究各种语言模型层次的语言信息编码是NLPT社区中的一个非常有趣的问题。现有的工作 however,主要集中在单词水平表示和基于encoder-only语言模型的伪token训练目标上。在这篇论文中,我们进行了叙述结构探索,一种研究句子级别表示的方法,通过找到 embedding空间中任务特定的数据点对之间的适当对比距离来实现。我们将这种方法应用于不同家族(encoder-only、decoder-only、encoder-decoder)和不同大小的语言模型中,并在两个任务( semantics 文本相似性和自然语言推理)的上下文中进行了测试。我们发现,模型家族之间存在巨大差异,但结果几乎是模型大小不变的。
Rather a Nurse than a Physician – Contrastive Explanations under Investigation
paper_authors: Oliver Eberle, Ilias Chalkidis, Laura Cabello, Stephanie Brandl
For: The paper aims to investigate the claim that contrastive explanations are closer to human explanations than non-contrastive explanations.* Methods: The paper uses four English text-classification datasets and fine-tunes three different models (RoBERTa, GTP-2, and T5) in three different sizes. It also applies three post-hoc explainability methods (LRP, GradientxInput, and GradNorm) to extract explanations.* Results: The paper finds that there is a high agreement between model-based rationales and human annotations, both in contrastive and non-contrastive settings. Additionally, model-based explanations computed in both settings align equally well with human rationales, indicating that humans do not necessarily explain in a contrastive manner.Abstract
Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain a decision than non-contrastive explanations, where the decision is not necessarily referenced to an alternative. This claim has never been empirically validated. We analyze four English text-classification datasets (SST2, DynaSent, BIOS and DBpedia-Animals). We fine-tune and extract explanations from three different models (RoBERTa, GTP-2, and T5), each in three different sizes and apply three post-hoc explainability methods (LRP, GradientxInput, GradNorm). We furthermore collect and release human rationale annotations for a subset of 100 samples from the BIOS dataset for contrastive and non-contrastive settings. A cross-comparison between model-based rationales and human annotations, both in contrastive and non-contrastive settings, yields a high agreement between the two settings for models as well as for humans. Moreover, model-based explanations computed in both settings align equally well with human rationales. Thus, we empirically find that humans do not necessarily explain in a contrastive manner.9 pages, long paper at ACL 2022 proceedings.
摘要
“对比性解释”,即将一个决策解释为另一个决策的对比,被认为更接近人类的解释方式。然而,这一laim未经验证。我们分析了四个英文文本分类 dataset(SST2、DynaSent、BIOS和DBpedia-Animals),使用三种不同的模型(RoBERTa、GTP-2和T5),每种模型都有三个不同的大小,并应用三种后处 explainability 方法(LRP、GradientxInput和GradNorm)。此外,我们还收集并发布了BIOS dataset中的一百个样本的人类理由标注,用于对比和非对比设置。我们在这两种设置下进行了模型基于的理由和人类理由的交叉比较,发现两者之间存在高度一致性,并且模型基于的解释在两种设置下均与人类理由相吻合。因此,我们employm empirical研究发现,人类并不一定会在对比性下进行解释。Please note that the translation is in Simplified Chinese, and some words or phrases may have been translated differently in Traditional Chinese.
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification
methods: 我们采集了一个新的数据集RAVE:Rationale Variation in ECHR1,该数据集由两名国际人权法律领域专家标注而成,我们观察到了这两个专家之间的弱一致。我们研究了他们的不一致,并构建了两级独立任务的分类法,补充了COC特有的亚分类。这是法律自然语言处理领域中第一次关于人类标注变化的研究。
results: 我们量测不同分类类别的数据,发现主要的不一致来自于法律上下文的不充分规定,这种情况通常具有有限的精度和噪音。我们进一步评估了当今最佳COC模型在RAVE上的解释性,发现模型和专家之间的一致度有限。总之,我们的案例研究暴露了在法律自然语言处理领域创建标准数据集的复杂性,这些复杂性包括确定案例中 факт的重要性。Abstract
In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
摘要
法律自然语言处理(NLP)中的案例结果分类(COC)不仅需要准确,还需要可信和可解释。现有的可解释COC工作都是由单一专家进行标注。然而,法律专业人员在评估案例事实时可能会有差异。因此,我们收集了一个新的数据集RAVE:可理解变化在人权法院1中,该数据集来自两个国际人权法律领域专家,我们观察到了弱一致。我们研究了他们的不一致,并建立了两级无关任务的税onomy,补充了COC特有的亚类。我们知道,这是法律NLP中第一个关注人类标注变化的工作。我们量测不同税onomy类别,并发现,不一致主要来自法律Context的不足,这种情况在COC元数据中通常具有有限的精度和噪音。我们进一步评估了现有最佳COC模型在RAVE上的解释性,并发现模型和专家之间的一致不高。总的来说,我们的案例研究发现了法律NLP中创建benchmark数据集的复杂性,即确定案例事实中可能对结果的影响因素。
The Curious Case of Hallucinatory Unanswerablity: Finding Truths in the Hidden States of Over-Confident Large Language Models
for: investigate the behavior of LLMs when presented with unanswerable queries
methods: use a combination of human evaluation and automated metrics to study the representation of answerability in LLMs’ latent spaces
results: find strong indications that LLMs encode the answerability of input queries, with the representation of the first decoded token often being a strong indicator, which can be used to develop improved decoding techniques for factual generation.Here’s the full translation in Simplified Chinese:
results: 发现 LLMs 对输入问题的表示中具有强度的问题可answerability 表示,首个解码token 的表示frequently 是强度表示。这些发现可以用来发展更好的实际生成技术,特别在问题可answerability 是一个应对的情况下。Abstract
Large language models (LLMs) have been shown to possess impressive capabilities, while also raising crucial concerns about the faithfulness of their responses. A primary issue arising in this context is the management of unanswerable queries by LLMs, which often results in hallucinatory behavior, due to overconfidence. In this paper, we explore the behavior of LLMs when presented with unanswerable queries. We ask: do models \textbf{represent} the fact that the question is unanswerable when generating a hallucinatory answer? Our results show strong indications that such models encode the answerability of an input query, with the representation of the first decoded token often being a strong indicator. These findings shed new light on the spatial organization within the latent representations of LLMs, unveiling previously unexplored facets of these models. Moreover, they pave the way for the development of improved decoding techniques with better adherence to factual generation, particularly in scenarios where query unanswerability is a concern.
摘要
Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the original text provided, and it is not a word-for-word translation. Some phrases and sentences may be rephrased or condensed to improve readability and clarity.
Text Annotation Handbook: A Practical Guide for Machine Learning Projects
paper_authors: Felix Stollenwerk, Joey Öhman, Danila Petrelli, Emma Wallerö, Fredrik Olsson, Camilla Bengtsson, Andreas Horndahl, Gabriela Zarzar Gandler
for: 这份手册是一本关于文本标注任务的实用指南,用于介绍基本概念和实践技巧。
methods: 本文涉及了主要的技术方面,同时也触及了商业、伦理和法规问题。
results: 文件的重点是在于可读性和简洁性,而不是完整性和科学准确性。该手册可能会用于各种职业,如团队领导、项目经理、IT архитек、软件开发者和机器学习工程师。Abstract
This handbook is a hands-on guide on how to approach text annotation tasks. It provides a gentle introduction to the topic, an overview of theoretical concepts as well as practical advice. The topics covered are mostly technical, but business, ethical and regulatory issues are also touched upon. The focus lies on readability and conciseness rather than completeness and scientific rigor. Experience with annotation and knowledge of machine learning are useful but not required. The document may serve as a primer or reference book for a wide range of professions such as team leaders, project managers, IT architects, software developers and machine learning engineers.
摘要
这本手册是一本实用的文本标注任务指南。它提供了一个温顺的引导,覆盖了理论概念以及实践建议。覆盖的主题主要是技术性的,但也涉及到业务、伦理和法规问题。文本的重点是易读性和简洁性,而不是完整性和科学严谨性。经验与标注和机器学习知识可能有助于,但并非必需。这份文档可能作为团队领导、项目经理、IT建筑师、软件开发者和机器学习工程师的引导或参考书。
Language Agents for Detecting Implicit Stereotypes in Text-to-image Models at Scale
results: 研究发现,一些商业产品和开源文本到图像模型中的模型经常在某些提示中表现出严重的刻板印象,这些刻板印象与人类的性别、种族和宗教等社会维度有关。Abstract
The recent surge in the research of diffusion models has accelerated the adoption of text-to-image models in various Artificial Intelligence Generated Content (AIGC) commercial products. While these exceptional AIGC products are gaining increasing recognition and sparking enthusiasm among consumers, the questions regarding whether, when, and how these models might unintentionally reinforce existing societal stereotypes remain largely unaddressed. Motivated by recent advancements in language agents, here we introduce a novel agent architecture tailored for stereotype detection in text-to-image models. This versatile agent architecture is capable of accommodating free-form detection tasks and can autonomously invoke various tools to facilitate the entire process, from generating corresponding instructions and images, to detecting stereotypes. We build the stereotype-relevant benchmark based on multiple open-text datasets, and apply this architecture to commercial products and popular open source text-to-image models. We find that these models often display serious stereotypes when it comes to certain prompts about personal characteristics, social cultural context and crime-related aspects. In summary, these empirical findings underscore the pervasive existence of stereotypes across social dimensions, including gender, race, and religion, which not only validate the effectiveness of our proposed approach, but also emphasize the critical necessity of addressing potential ethical risks in the burgeoning realm of AIGC. As AIGC continues its rapid expansion trajectory, with new models and plugins emerging daily in staggering numbers, the challenge lies in the timely detection and mitigation of potential biases within these models.
摘要
现在的研究强化模型在人工智能生成内容(AIGC)的商业产品中得到了加速。而这些优秀的AIGC产品正在获得消费者的认可,并让人们很感到激动。然而,关于这些模型是否无意中强化社会刻板印象的问题仍然未得到解决。我们在语言代理的最新进展基础上,提出了一种适用于刻板印象检测的新型代理体系。这种多功能的代理体系可以自动采用多种工具来执行整个过程,从生成相应的指令和图像到检测刻板印象。我们基于多个开源文本数据集建立了刻板印象相关的 benchmark,并应用这种体系到商业产品和流行的开源文本到图像模型中。我们发现,这些模型在某些个人特征、社会文化背景和犯罪相关的提示下显示了严重的刻板印象。总之,这些实验结果表明了社会各个维度上的刻板印象的普遍存在,包括 gender、race 和 religion 等,这不仅证明了我们的提出的方法的有效性,而且强调了在AIGC领域的可能性潜在风险的紧迫性。随着AIGC不断扩展,新的模型和插件每天都在各种数字平台上出现,因此,检测和 mitigate 这些模型中的潜在偏见的挑战在继续增长。
Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling
results: 对比旧状态之前方法,Longformer具有我们提出的方法在WIKI-727K上显著提高了$F_1$值(73.74 -> 77.16),并在WikiSection上实现了平均相对减少$P_k$值(15.0 -> 13.89),两个数据集上的average相对减少$P_k$值为4.3%。Abstract
Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve $F_1$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_k$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on $P_k$ on WikiSection. The average relative $P_k$ drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.
摘要
Topic segmentation是文档结构化的关键之一,可以提高后续任务的信息检索性能。由于自动找到话题转换的灵活信息,最近的supervised神经网络模型在长文档话题分 segmentation方面取得了 significanth� development,但是它们之间的coherence关系还未得到充分探讨。因此,这篇论文提出了一种能够更好地捕捉coherence的方法,包括话题感知sentence结构预测(TSSP)和semantic similarity学习(CSSL)。具体来说,我们提出了TSSP任务,强制模型理解文档中的结构信息,通过学习原始文档中的关系来构建不规则的文档。此外,我们利用 между话题和同话题信息来构建对比采样,并设计了CSSL目标,确保同话题中的句子表示更加相似,而不同话题中的句子表示更加不相似。我们对Longformer模型进行了广泛的实验,并证明了我们的方法可以明显超越老的SOTA方法。我们的方法可以提高老SOTA的$F_1$指标的值,从73.74提高到77.16,并将15.0提高到13.89。此外,我们在WikiSection上 achieve了平均 относи于$P_k$指标的减少4.3%,并在两个out-of-domain数据集上实现了平均相对减少8.38%。这些结果表明我们的方法具有较好的Robustness性。
results: 我们对模型的性能进行了报告,并证明了模型的可靠性和高效性。Abstract
We have trained a named entity recognition (NER) model that screens Swedish job ads for different kinds of useful information (e.g. skills required from a job seeker). It was obtained by fine-tuning KB-BERT. The biggest challenge we faced was the creation of a labelled dataset, which required manual annotation. This paper gives an overview of the methods we employed to make the annotation process more efficient and to ensure high quality data. We also report on the performance of the resulting model.
摘要
我们已经训练了一个Named Entity Recognition(NER)模型,用于检测瑞典寻求人员岗位各种有用信息(例如,岗位需求的技能)。它是通过精度调整KB-BERT获得的。我们最大的挑战是创建标注数据集,需要人工标注。本文介绍了我们采用的方法,以提高标注过程的效率和数据质量。我们还对结果模型的性能进行了报告。
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction
results: 研究表明,在提供类案例和多选选项的情况下,大语言模型可以更好地回忆域知识,但是如果 IR 系统的能力较强,则 LLM 的作用变得 redundante。Abstract
Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts can help LLMs recall domain knowledge that is critical for expertise legal reasoning. We additionally present an intriguing paradox wherein an IR system surpasses the performance of LLM+IR due to limited gains acquired by weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes redundant. Our evaluation pipeline can be easily extended into other tasks to facilitate evaluations in other domains. Code is available at https://github.com/srhthu/LM-CompEval-Legal
摘要
大型语言模型(LLMs)在专业应用中展示了很大的潜力,例如法律领域。然而,最近有关GPT-4的法律评估争议引起了对它们在实际法律任务中的表现的质疑。为了系统地探索它们在法律中的能力,我们设计了实用的基线解决方案,并使用LLMs进行法律判断预测任务的评估。在我们的解决方案中,LLMs可以单独回答开问题,或与信息检索(IR)系统配合,从相似的案例或解决简单多选题中学习。我们发现,在提示中包含的相似案例和多选题( Label Candidates)可以帮助LLMs回传专业知识,并且我们还发现了一个有趣的 contradicton,即在IR系统超越了LLM+IR的性能时,LLMs的角色变得redundant。我们的评估管线可以轻松地扩展到其他任务,以便在其他领域进行评估。代码可以在 GitHub 上获取:https://github.com/srhthu/LM-CompEval-Legal。
results: 研究发现,通过精度调整可以提高ChatGPT的情感识别性能,但是不同的情感标签和数据集的选择会影响ChatGPT的情感识别性能,表明了存在内在的不稳定性和可能的偏见。Abstract
This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT.
摘要
这份技术报告探讨了 chatGPT 在文本中识别情感的能力,这可以为互动 chatbot、数据标注和心理健康分析等应用提供基础。 although prior research has shown chatGPT 的基本情感分析能力,其在更复杂的情感识别方面的性能未经探讨。 在这里,我们进行了实验来评估 chatGPT 在不同的数据集和情感标签下的表现。 我们发现了一定的可重复性,通过微调可以得到明显的改进。 然而,不同的情感标签和数据集的表现差异明显,这 highlights the inherent instability and possible bias. 选择数据集和情感标签对 chatGPT 的情感识别表现有着重要的影响。 这篇文章强调了数据集和标签选择的重要性,以及微调的潜在作用,为将来更好地将情感分析 integrate into applications using chatGPT 提供了基础。
Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting
results: 实验结果表明,适应LMs存在两种不同的uncertainty,负责答案决策和格式偏好。此外,研究者还发现了这两种uncertainty对适应LMs的准确性的影响,并提出了一种简单的Synthetic alignment scheme来缓解这种情况。Abstract
Despite the significant progress made in practical applications of aligned language models (LMs), they tend to be overconfident in output answers compared to the corresponding pre-trained LMs. In this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of LMs under the multiple-choice setting. We first conduct a thoughtful empirical study on how aligned LMs differ in calibration from their pre-trained counterparts. Experimental results reveal that there are two distinct uncertainties in LMs under the multiple-choice setting, which are responsible for the answer decision and the format preference of the LMs, respectively. Then, we investigate the role of these two uncertainties on aligned LM's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned LMs' overconfidence is the conflation of these two types of uncertainty. Furthermore, we examine the utility of common post-hoc calibration methods for aligned LMs and propose an easy-to-implement and sample-efficient method to calibrate aligned LMs. We hope our findings could provide insights into the design of more reliable alignment processes for LMs.
摘要
尽管已经在实际应用中采用了对齐语言模型(LMs),但它们往往会比预训练LMs更加自信。在这项工作中,我们系统地评估了对齐过程对logit基于不确定性调整的LMs的影响。我们首先通过思ful的实验研究了对齐LMs与其预训练对应者的不确定性差异。实验结果表明,LMs在多选设定下存在两种不同的不确定性,一是答案决定不确定性,二是格式偏好不确定性。然后,我们通过简单的合成对齐方案进行了调整,并发现对齐LMs的一种原因是这两种不确定性的混淆。此外,我们还检查了常见后处calibration方法对对齐LMs的效果,并提出了一种容易实现和效率高的方法来调整对齐LMs。我们希望我们的发现能为LMs的设计提供指导。
Chain-of-Thought Tuning: Masked Language Models can also Think Step By Step in Natural Language Understanding
for: 这 paper aims to improve the performance of Large Language Models (LLMs) on Natural Language Understanding (NLU) tasks by extending the success of Chain-of-Thought (CoT) technique to MLMs.
methods: The proposed method, Chain-of-Thought Tuning (CoTT), is a two-step reasoning framework based on prompt tuning that enables MLMs to implement step-by-step thinking for NLU tasks.
results: The experiments on two NLU tasks, hierarchical classification and relation extraction, show that CoTT outperforms baselines and achieves state-of-the-art performance.Abstract
Chain-of-Thought (CoT) is a technique that guides Large Language Models (LLMs) to decompose complex tasks into multi-step reasoning through intermediate steps in natural language form. Briefly, CoT enables LLMs to think step by step. However, although many Natural Language Understanding (NLU) tasks also require thinking step by step, LLMs perform less well than small-scale Masked Language Models (MLMs). To migrate CoT from LLMs to MLMs, we propose Chain-of-Thought Tuning (CoTT), a two-step reasoning framework based on prompt tuning, to implement step-by-step thinking for MLMs on NLU tasks. From the perspective of CoT, CoTT's two-step framework enables MLMs to implement task decomposition; CoTT's prompt tuning allows intermediate steps to be used in natural language form. Thereby, the success of CoT can be extended to NLU tasks through MLMs. To verify the effectiveness of CoTT, we conduct experiments on two NLU tasks: hierarchical classification and relation extraction, and the results show that CoTT outperforms baselines and achieves state-of-the-art performance.
摘要
Chain-of-Thought (CoT) 是一种技术,用于导引大型自然语言模型(LLM)来 decomposition 复杂任务为多个步骤的自然语言形式中的逻辑步骤。简单来说,CoT 使得 LLM 可以一步步地思考。然而,许多自然语言理解(NLU)任务也需要一步步地思考,但 LLM 表现比小规模的面纹语言模型(MLM)差。为将 CoT 从 LLM 迁移到 MLM 上,我们提出了链条思维调整(CoTT),一种基于提问调整的两步逻辑框架。从 CoT 的视角来看,CoTT 的两步框架使得 MLM 可以实现任务的分解;CoTT 的提问调整使得中间步骤可以在自然语言形式下使用。因此,CoT 的成功可以通过 MLM 扩展到 NLU 任务。为验证 CoTT 的有效性,我们对两个 NLU 任务进行了实验:层次分类和关系抽取,并得到的结果表明 CoTT 超过基eline和实现了状态的表现。
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
results: LLM经过“反思调教”后,在多种评价指标上表现出色,超过了传统的数据集训练方法。Abstract
Recent advancements in Large Language Models (LLMs) have expanded the horizons of natural language understanding and generation. Notably, the output control and alignment with the input of LLMs can be refined through instruction tuning. However, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading LLM outputs. We propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. Extensive experiments on widely used evaluation benchmarks show that LLMs trained with our recycled data outperform those trained with existing datasets in various benchmarks.
摘要
MISAR: A Multimodal Instructional System with Augmented Reality
results: 研究表明,通过使用大语言模型(LLMs)可以更好地估计任务性能,从而为AR系统提供更加适应性。I hope this helps! Let me know if you have any other questions.Abstract
Augmented reality (AR) requires the seamless integration of visual, auditory, and linguistic channels for optimized human-computer interaction. While auditory and visual inputs facilitate real-time and contextual user guidance, the potential of large language models (LLMs) in this landscape remains largely untapped. Our study introduces an innovative method harnessing LLMs to assimilate information from visual, auditory, and contextual modalities. Focusing on the unique challenge of task performance quantification in AR, we utilize egocentric video, speech, and context analysis. The integration of LLMs facilitates enhanced state estimation, marking a step towards more adaptive AR systems. Code, dataset, and demo will be available at https://github.com/nguyennm1024/misar.
摘要
现实扩展(AR)需要Visual、听力和语言通道的无缝结合,以便最佳化人机交互。听力和视觉输入可以提供实时和上下文相关的用户指导,但是大型自然语言模型(LLM)在这个场景中的潜在仍然未得到充分利用。我们的研究提出了一种新的方法,利用LLM来融合Visual、听力和上下文modalities。我们在实际任务完成量化中采用egocentric视频、Speech和上下文分析。通过LLM的集成,我们可以提高状态估计,这标志着更适应性AR系统的出发点。代码、数据集和示例将在https://github.com/nguyennm1024/misar上提供。
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
paper_authors: Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, Somesh Jha
For: 提高大语言模型在高风险决策场景的可靠性。* Methods: 基于自我评估的参数有效性调整方法,以适应特定任务而进行适应。* Results: 在多个问答(QA)数据集上进行评估,比靡前状态艺的选择预测方法表现更好,例如在CoQA标准测试集上,AUACC从91.23%提高到92.63%,AUROC从74.61%提高到80.25%.Abstract
Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. Selective prediction is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.
摘要
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention
results: 研究发现,软max注意力在大多数场景下具有更高的性能,而linear注意力具有更高的计算复杂度。Abstract
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.
摘要
大型转换器模型在自然语言处理多种任务中取得了状态机器人的Result。转换器架构中的注意机制对于序列中的Token交互进行捕捉具有关键作用,通过使用softmax函数。然而,线性注意presenteda更加计算效率的代替方案,但它在与传统的softmax注意机制相比 exhibits substantial performance degradation。在这篇论文中,我们尝试填补这两种注意机制之间的实践性能差距的理论理解漏洞。通过对这两种注意机制进行全面的比较分析,我们 shed light on the underlying reasons why softmax注意机制在大多数场景下比linear注意机制更高性能。
Open-ended Commonsense Reasoning with Unrestricted Answer Scope
methods: 利用预训练语言模型IterativelyRetrieve reasoning paths on external knowledge base,不需要任务特定的监督。
results: 对两个常识 benchmarck 数据集进行实验,与其他方法相比,提出的方法表现更好, both quantitatively and qualitatively。Abstract
Open-ended Commonsense Reasoning is defined as solving a commonsense question without providing 1) a short list of answer candidates and 2) a pre-defined answer scope. Conventional ways of formulating the commonsense question into a question-answering form or utilizing external knowledge to learn retrieval-based methods are less applicable in the open-ended setting due to an inherent challenge. Without pre-defining an answer scope or a few candidates, open-ended commonsense reasoning entails predicting answers by searching over an extremely large searching space. Moreover, most questions require implicit multi-hop reasoning, which presents even more challenges to our problem. In this work, we leverage pre-trained language models to iteratively retrieve reasoning paths on the external knowledge base, which does not require task-specific supervision. The reasoning paths can help to identify the most precise answer to the commonsense question. We conduct experiments on two commonsense benchmark datasets. Compared to other approaches, our proposed method achieves better performance both quantitatively and qualitatively.
摘要
MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction
results: 实验结果表明,一个具有高吸引力和合适多样性的数据扩充策略可以更好地提高 GEC 模型的性能。此外,提议的 MixEdit 数据扩充方法可以在不需要额外的单语言 corpus 的情况下,STRATEGICALLY AND DYNAMICALLY 扩充真实的数据,以提高 GEC 模型的性能。Abstract
Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of which are motivated by two heuristics, i.e., increasing the distribution similarity and diversity of pseudo data. However, the underlying mechanism responsible for the effectiveness of these strategies remains poorly understood. In this paper, we aim to clarify how data augmentation improves GEC models. To this end, we introduce two interpretable and computationally efficient measures: Affinity and Diversity. Our findings indicate that an excellent GEC data augmentation strategy characterized by high Affinity and appropriate Diversity can better improve the performance of GEC models. Based on this observation, we propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora. To verify the correctness of our findings and the effectiveness of the proposed MixEdit, we conduct experiments on mainstream English and Chinese GEC datasets. The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods.
摘要
<> transtable text into Simplified Chinese.<>数据扩充通过生成假数据已经证明可以有效地解决语法错误修复(GEC)领域中的数据缺乏问题。各种扩充策略已经广泛探索,大多数是基于两个启发,即增加假数据的分布相似性和多样性。然而,这些策略下面的机制仍然不够了解。在这篇论文中,我们目的是解释如何使数据扩充改进GEC模型。为此,我们引入了两种可解释的计算效率的度量:团结度和多样性。我们的发现表明,一个具有高团结度和合适的多样性的GEC数据扩充策略可以更好地改进GEC模型的性能。基于这一观察,我们提出了 MixEdit,一种数据扩充方法,不需要额外的同语言资料卷。为了验证我们的发现和 MixEdit 的有效性,我们在主流的英语和中文GEC数据集上进行了实验。结果表明,MixEdit 可以大幅提高 GEC 模型的性能,并且与传统的数据扩充方法相комplementary。
Field-testing items using artificial intelligence: Natural language processing with transformers
results: 研究发现,RoBERTa模型可以准确地回答英语文本理解测试中的29个多选题。数据还用于计算测试题的心理特性,与人类考生数据显示一定的一致性。Abstract
Five thousand variations of the RoBERTa model, an artificially intelligent "transformer" that can understand text language, completed an English literacy exam with 29 multiple-choice questions. Data were used to calculate the psychometric properties of the items, which showed some degree of agreement to those obtained from human examinee data.
摘要
五千种RoBERTa模型,一种人工智能"变换器",通过完成了29个多选题的英语阅读测验。使用数据计算测验项的心理属性,结果与人类考生数据有一定的相似度。Note:* "RoBERTa" is translated as "RoBERTa模型" in Simplified Chinese.* "transformer" is translated as "变换器" in Simplified Chinese.* "English literacy exam" is translated as "英语阅读测验" in Simplified Chinese.* "multiple-choice questions" is translated as "多选题" in Simplified Chinese.
Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model
results: 实验表明,FFLM 可以与或even outperform ChatGPT 在不同任务上,并且具有24倍 fewer 参数。 FFLM 还可以超过其他强基线。Abstract
Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.
摘要
尽管自然语言生成技术已经做出了很大的进步,摘要模型仍然面临着不忠问题。前一任的工作通常使用其他任务训练的模型或者域内生成的数据来评估忠诚性,或者激活大型模型如ChatGPT。这篇论文提议使用一个 moderately-sized 基础语言模型进行零 shot 忠诚性评估。我们引入了一个新的度量FFLM,它是基于输出预测概率变化的 prefixing 语句的 intuition。实验表明,FFLM 与 ChatGPT 在不一致检测和忠诚评分中表现竞争,并且具有24倍少的参数。FFLM 还超过了其他强大基elines。
Systematic Assessment of Factual Knowledge in Large Language Models
results: 实验表明,ChatGPT在各个领域中表现最佳,而LLM的表现受到 instrucion finetuning、领域、问题复杂度和Contextual Adversarial的影响。Abstract
Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.
摘要
Here's the translation in Simplified Chinese:先前的研究通过现有的问答指标来评估大语言模型(LLM)中的知识,但这种方法有限制,因为它主要集中在通用领域,这可能与预训练数据重叠。本文提出了一个框架,可以系统地评估 LLM 中的事实知识,通过利用知识图(KG)。我们的框架可以自动生成基于 KG 中的事实的问题和预期答案,然后评估 LLM 在回答这些问题时的准确性。我们系统地评估了当今最先进的 LLM 在通用和特定领域中的性能。实验结果显示,ChatGPT 在所有领域中占据了首位。我们还发现,LLM 的性能取决于 instrucion 精度调整、领域和问题复杂度,并且容易受到恶意上下文的影响。
MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations
results: 实验结果表明,LLMs在自然语言描述和长对话中能够很好地理解新的解释,但是我们的研究也发现,当面临不熟悉的词语或同时构建多个新解释时,LLMs的性能仍然有所不足。此外,我们的分析还揭示了LLMs中的semantic predispositions,以及长context中的recency bias的影响。Abstract
Humans possess a remarkable ability to assign novel interpretations to linguistic expressions, enabling them to learn new words and understand community-specific connotations. However, Large Language Models (LLMs) have a knowledge cutoff and are costly to finetune repeatedly. Therefore, it is crucial for LLMs to learn novel interpretations in-context. In this paper, we systematically analyse the ability of LLMs to acquire novel interpretations using in-context learning. To facilitate our study, we introduce MAGNIFICo, an evaluation suite implemented within a text-to-SQL semantic parsing framework that incorporates diverse tokens and prompt settings to simulate real-world complexity. Experimental results on MAGNIFICo demonstrate that LLMs exhibit a surprisingly robust capacity for comprehending novel interpretations from natural language descriptions as well as from discussions within long conversations. Nevertheless, our findings also highlight the need for further improvements, particularly when interpreting unfamiliar words or when composing multiple novel interpretations simultaneously in the same example. Additionally, our analysis uncovers the semantic predispositions in LLMs and reveals the impact of recency bias for information presented in long contexts.
摘要
人类具有强大的语言表达重新解释能力,可以学习新词和社区特有的含义。然而,大型自然语言模型(LLM)具有知识割辑和重新训练成本高的问题。因此,LLM需要在Context中学习新的解释。本文系统地分析了LLM在Context中学习新解释的能力。为了促进我们的研究,我们提出了MAGNIFICo评价集,该集包括多种Token和提示设置,以 simulate real-world complexity。实验结果表明,LLM在自然语言描述和长 conversations中的讨论中能够很好地理解新解释。然而,我们的发现也表明,当解释不熟悉的词语或者在同一个例子中同时构成多个新解释时,LLM的表现仍然需要进一步改进。此外,我们的分析还揭示了LLM中的含义偏好和长 context中的新信息偏好。
results: 在全部反馈模型中,learner 可以保证 $\tilde{O}(\sqrt{T})$ regret,与最佳固定价格相比,是OPTimal的。在partial feedback模型中,提供了一个 $\tilde{O}(T^{3/4})$ regret Upper bound的算法,并且 complement with a nearly-matching lower bound。Abstract
Bilateral trade revolves around the challenge of facilitating transactions between two strategic agents -- a seller and a buyer -- both of whom have a private valuations for the item. We study the online version of the problem, in which at each time step a new seller and buyer arrive. The learner's task is to set a price for each agent, without any knowledge about their valuations. The sequence of sellers and buyers is chosen by an oblivious adversary. In this setting, known negative results rule out the possibility of designing algorithms with sublinear regret when the learner has to guarantee budget balance for each iteration. In this paper, we introduce the notion of global budget balance, which requires the agent to be budget balance only over the entire time horizon. By requiring global budget balance, we provide the first no-regret algorithms for bilateral trade with adversarial inputs under various feedback models. First, we show that in the full-feedback model the learner can guarantee $\tilde{O}(\sqrt{T})$ regret against the best fixed prices in hindsight, which is order-wise optimal. Then, in the case of partial feedback models, we provide an algorithm guaranteeing a $\tilde{O}(T^{3/4})$ regret upper bound with one-bit feedback, which we complement with a nearly-matching lower bound. Finally, we investigate how these results vary when measuring regret using an alternative benchmark.
摘要
bilateral trade 环绕着两个策略代理人(一个买家和一个卖家)之间的挑战,这两个代理人都有私人的评价值。我们研究在网络上进行的这个问题,在每个时间步骤中,新的买家和卖家会出现。学习者的任务是设定价格,但是没有任何关于代理人们的评价知识。选择序列的买家和卖家是由一个无知的敌人选择。在这个设定下,已知的负结果规则排除了设计具有下图 regret 的算法的可能性。在这篇论文中,我们引入全面预算平衡的概念,它需要学习者在整个时间频谱上保持预算平衡。通过需要全面预算平衡,我们提供了首个不负担 regret 的算法,在不同的反馈模型下实现了双方贸易。首先,在完整反馈模型中,我们显示学习者可以在对照后获得 $\tilde{O}(\sqrt{T})$ regret,这是很好的估计。然后,在受限反馈模型中,我们提供了一个 garantia $\tilde{O}(T^{3/4})$ regret 的算法,并补充了一个几乎匹配的下限。最后,我们调查了这些结果如何在使用不同的参考基准时变化。
MARVEL: Multi-Agent Reinforcement-Learning for Large-Scale Variable Speed Limits
results: 对于一段7英里的高速公路,MARL方法提高了交通安全性63.4%,并提高了交通流体性14.6%,相比于现有的实践算法。此外,文章还进行了解释性分析,以了解代理在不同交通条件下的决策过程。最后,文章测试了在实际数据上的策略,以证明该策略的可部署性。Abstract
Variable speed limit (VSL) control is a promising traffic management strategy for enhancing safety and mobility. This work introduces MARVEL, a multi-agent reinforcement learning (MARL) framework for implementing large-scale VSL control on freeway corridors using only commonly available data. The agents learn through a reward structure that incorporates adaptability to traffic conditions, safety, and mobility; enabling coordination among the agents. The proposed framework scales to cover corridors with many gantries thanks to a parameter sharing among all VSL agents. The agents are trained in a microsimulation environment based on a short freeway stretch with 8 gantries spanning 7 miles and tested with 34 gantries spanning 17 miles of I-24 near Nashville, TN. MARVEL improves traffic safety by 63.4% compared to the no control scenario and enhances traffic mobility by 14.6% compared to a state-of-the-practice algorithm that has been deployed on I-24. An explainability analysis is undertaken to explore the learned policy under different traffic conditions and the results provide insights into the decision-making process of agents. Finally, we test the policy learned from the simulation-based experiments on real input data from I-24 to illustrate the potential deployment capability of the learned policy.
摘要
Variable speed limit (VSL) 控制是一种有前途的交通管理策略,可以提高安全性和流动性。这项工作介绍了 MARVEL,一种多代理学习 (MARL) 框架,用于实现大规模 VSL 控制在高速公路段上,只使用常见的数据。代理学习的奖励结构包括适应交通条件、安全性和流动性,使代理之间协调。提出的框架可以涵盖覆盖许多斜塔,因为所有 VSL 代理的参数共享。代理在基于微观 simulate 环境中学习,该环境基于一段长7英里的高速公路,涵盖8个斜塔。代理在基于实际数据进行测试,并在I-24公路上进行了17英里的测试。 MARVEL 可以提高交通安全性63.4%,并提高交通流动性14.6%,相比之前的实践算法。 Explainability 分析用于探索不同交通条件下代理学习的策略,结果提供了决策过程中代理的启示。最后,我们将在实际数据上测试从 simulate 中学习的策略,以 illustrate 学习的可部署性。
Networkwide Traffic State Forecasting Using Exogenous Information: A Multi-Dimensional Graph Attention-Based Approach
results: 实验结果表明,M-STGAT在使用加利福尼亚交通部门(Caltrans)性能衡量系统(PeMS)提供的交通速度和路况数据,并与国家海洋和大气管理局(NOAA)自动Surface Observing Systems(ASOS)提供的天气数据进行比较,在30、45和60分钟预测时间 horizons 上表现出了较好的预测性能,其中error measures包括 Mean Absolute Error(MAE)、Root Mean Square Error(RMSE)和Mean Absolute Percentage Error(MAPE)。但是,模型的传送性可能需要进一步的调查。Abstract
Traffic state forecasting is crucial for traffic management and control strategies, as well as user- and system-level decision making in the transportation network. While traffic forecasting has been approached with a variety of techniques over the last couple of decades, most approaches simply rely on endogenous traffic variables for state prediction, despite the evidence that exogenous factors can significantly impact traffic conditions. This paper proposes a multi-dimensional spatio-temporal graph attention-based traffic prediction approach (M-STGAT), which predicts traffic based on past observations of speed, along with lane closure events, temperature, and visibility across the transportation network. The approach is based on a graph attention network architecture, which also learns based on the structure of the transportation network on which these variables are observed. Numerical experiments are performed using traffic speed and lane closure data from the California Department of Transportation (Caltrans) Performance Measurement System (PeMS). The corresponding weather data were downloaded from the National Oceanic and Atmospheric Administration (NOOA) Automated Surface Observing Systems (ASOS). For comparison, the numerical experiments implement three alternative models which do not allow for the multi-dimensional input. The M-STGAT is shown to outperform the three alternative models, when performing tests using our primary data set for prediction with a 30-, 45-, and 60-minute prediction horizon, in terms of three error measures: Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). However, the model's transferability can vary for different transfer data sets and this aspect may require further investigation.
摘要
交通状况预测是交通管理和控制策略以及用户和系统层次的决策中非常重要的一环。自过去几十年来,交通预测已经使用了多种技术,但大多数方法都仅仅基于内生的交通变量进行预测,尽管外生因素可能对交通条件产生重要影响。这篇论文提出了一种多维度空间时间图注意力基本交通预测方法(M-STGAT),该方法基于过去观测到的速度,以及路段 closure事件、温度和视程等外生因素进行预测。该方法基于图注意力网络架构,同时还学习了交通网络上这些变量的结构。我们使用了加利福尼亚交通部门(Caltrans)性能测量系统(PeMS)中的交通速度和路段 closure数据进行数值实验,并下载了国家海洋和大气管理局(NOAA)自动地面观测系统(ASOS)中的天气数据。为比较,我们实现了三种不允许多维度输入的数学模型。M-STGAT在使用我们的主要数据集进行预测时,在30-, 45-, 和 60-分钟预测距离时表现出了与三个错误度量( Mean Absolute Error,Root Mean Square Error 和 Mean Absolute Percentage Error)相对较高的性能。然而,模型的传输性可能会随着不同的传输数据集而异。这一点可能需要进一步的调查。
Equipping Federated Graph Neural Networks with Structure-aware Group Fairness
results: 该方法在许多基准方法上表现出色,在公平性和模型准确性两个方面均有显著提升。Abstract
Graph Neural Networks (GNNs) have been widely used for various types of graph data processing and analytical tasks in different domains. Training GNNs over centralized graph data can be infeasible due to privacy concerns and regulatory restrictions. Thus, federated learning (FL) becomes a trending solution to address this challenge in a distributed learning paradigm. However, as GNNs may inherit historical bias from training data and lead to discriminatory predictions, the bias of local models can be easily propagated to the global model in distributed settings. This poses a new challenge in mitigating bias in federated GNNs. To address this challenge, we propose $\text{F}^2$GNN, a Fair Federated Graph Neural Network, that enhances group fairness of federated GNNs. As bias can be sourced from both data and learning algorithms, $\text{F}^2$GNN aims to mitigate both types of bias under federated settings. First, we provide theoretical insights on the connection between data bias in a training graph and statistical fairness metrics of the trained GNN models. Based on the theoretical analysis, we design $\text{F}^2$GNN which contains two key components: a fairness-aware local model update scheme that enhances group fairness of the local models on the client side, and a fairness-weighted global model update scheme that takes both data bias and fairness metrics of local models into consideration in the aggregation process. We evaluate $\text{F}^2$GNN empirically versus a number of baseline methods, and demonstrate that $\text{F}^2$GNN outperforms these baselines in terms of both fairness and model accuracy.
摘要
GRAPH Neural Networks (GNNs) 在不同领域中对各种图数据进行处理和分析任务广泛使用。在中央化图数据上训练 GNNs 可能因为隐私问题和管制约束而成为不可能的。因此,联邦学习 (FL) 成为一种解决这个挑战的趋势。然而, GNNs 可能从训练数据中继承历史偏见,并导致歧视性预测,因此在分布式设置下,本地模型的偏见可能被轻松传播到全球模型。这种挑战需要解决偏见在联邦 GNN 中的问题。为此,我们提出了 $\text{F}^2$GNN,一种增强分布式 Graph Neural Network 的分组公平性。由于偏见可以来自数据和学习算法,$\text{F}^2$GNN 采用了两个关键组成部分:在客户端上使用公平性意识的本地模型更新方案,以及在聚合过程中考虑本地模型的公平性度量和数据偏见的准确度。我们对 $\text{F}^2$GNN 进行了理论分析,并对其与一些基准方法进行了实验比较,并证明 $\text{F}^2$GNN 在公平性和模型准确性两个方面都高于基准方法。
Tracking electricity losses and their perceived causes using nighttime light and social media
results: 研究发现,夜晚照明强度与停电Region之间存在 inverse 关系。twitter上提到委内瑞拉总统的帖子具有更高的负面性和更多的责任相关词汇,这表明公众归咎政府对停电的责任。Abstract
Urban environments are intricate systems where the breakdown of critical infrastructure can impact both the economic and social well-being of communities. Electricity systems hold particular significance, as they are essential for other infrastructure, and disruptions can trigger widespread consequences. Typically, assessing electricity availability requires ground-level data, a challenge in conflict zones and regions with limited access. This study shows how satellite imagery, social media, and information extraction can monitor blackouts and their perceived causes. Night-time light data (in March 2019 for Caracas, Venezuela) is used to indicate blackout regions. Twitter data is used to determine sentiment and topic trends, while statistical analysis and topic modeling delved into public perceptions regarding blackout causes. The findings show an inverse relationship between nighttime light intensity. Tweets mentioning the Venezuelan President displayed heightened negativity and a greater prevalence of blame-related terms, suggesting a perception of government accountability for the outages.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China and Singapore.
results: 对三个实际数据集进行了广泛的实验,结果显示,我们的方法在不同的设定下都能够超越现有方法,从而在TSAD领域实现了新的状态级表现。Abstract
Numerous methods for time series anomaly detection (TSAD) methods have emerged in recent years. Most existing methods are unsupervised and assume the availability of normal training samples only, while few supervised methods have shown superior performance by incorporating labeled anomalous samples in the training phase. However, certain anomaly types are inherently challenging for unsupervised methods to differentiate from normal data, while supervised methods are constrained to detecting anomalies resembling those present during training, failing to generalize to unseen anomaly classes. This paper is the first attempt in providing a novel approach for the open-set TSAD problem, in which a small number of labeled anomalies from a limited class of anomalies are visible in the training phase, with the objective of detecting both seen and unseen anomaly classes in the test phase. The proposed method, called Multivariate Open-Set timeseries Anomaly Detection (MOSAD) consists of three primary modules: a Feature Extractor to extract meaningful time-series features; a Multi-head Network consisting of Generative-, Deviation-, and Contrastive heads for capturing both seen and unseen anomaly classes; and an Anomaly Scoring module leveraging the insights of the three heads to detect anomalies. Extensive experiments on three real-world datasets consistently show that our approach surpasses existing methods under various experimental settings, thus establishing a new state-of-the-art performance in the TSAD field.
摘要
Recently, many time series anomaly detection (TSAD) methods have been proposed. Most of these methods are unsupervised and assume the availability of normal training samples, while only a few supervised methods have shown better performance by incorporating labeled anomalous samples in the training phase. However, some anomaly types are difficult for unsupervised methods to distinguish from normal data, while supervised methods are limited to detecting anomalies similar to those present during training and cannot handle unseen anomaly classes. This paper is the first attempt to solve the open-set TSAD problem, in which a small number of labeled anomalies from a limited class of anomalies are available during training, with the goal of detecting both seen and unseen anomaly classes in the test phase.The proposed method, called Multivariate Open-Set Time Series Anomaly Detection (MOSAD), consists of three primary modules: a Feature Extractor to extract meaningful time-series features; a Multi-head Network consisting of Generative-, Deviation-, and Contrastive heads to capture both seen and unseen anomaly classes; and an Anomaly Scoring module that leverages the insights of the three heads to detect anomalies. Extensive experiments on three real-world datasets consistently show that our approach outperforms existing methods under various experimental settings, thereby establishing a new state-of-the-art performance in the TSAD field.
A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs
results: 学习omega-正则目标的Markov决策过程中的可能approx Correct算法。Abstract
Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes. Unlike prior approaches, our algorithm learns from sampled trajectories of the system and does not require prior knowledge of the system's topology.
摘要
线性时间逻辑(LTL)和奥米加常量目标——LTL的超集——在人工智能中被用来表达非马普朗的目标。我们介绍了基于模型的可能相对正确(PAC)学习算法 для奥米加常量目标在Markov决策过程中。与先前的方法不同,我们的算法从系统样本轨迹中学习,而不需要先知系统结构。
Fast Parameter Inference on Pulsar Timing Arrays with Normalizing Flows
results: 该论文的实验结果表明,使用 conditional normalizing flows 技术可以在狮子时间尺度数组(SGWB)的 posterior distribution 计算中大幅提高效率,从原来的数天减少到只需几秒钟。Abstract
Pulsar timing arrays (PTAs) perform Bayesian posterior inference with expensive MCMC methods. Given a dataset of ~10-100 pulsars and O(10^3) timing residuals each, producing a posterior distribution for the stochastic gravitational wave background (SGWB) can take days to a week. The computational bottleneck arises because the likelihood evaluation required for MCMC is extremely costly when considering the dimensionality of the search space. Fortunately, generating simulated data is fast, so modern simulation-based inference techniques can be brought to bear on the problem. In this paper, we demonstrate how conditional normalizing flows trained on simulated data can be used for extremely fast and accurate estimation of the SGWB posteriors, reducing the sampling time from weeks to a matter of seconds.
摘要
Dynamic financial processes identification using sparse regressive reservoir computers
paper_authors: Fredy Vides, Idelfonso B. R. Nogueira, Lendy Banegas, Evelyn Flores
For: 本文研究结构矩阵近似理论,应用于财经系统动态过程的回归表示。* Methods: 使用非线性时间延迟嵌入、稀疏最小二乘和结构矩阵近似方法来探索财经系统的输出封顶矩阵的近似表示。* Results: 通过应用上述技术,可以实现财经系统动态过程的近似识别和预测,包括可能或可能不具有混沌行为的场景。Abstract
In this document, we present key findings in structured matrix approximation theory, with applications to the regressive representation of dynamic financial processes. Initially, we explore a comprehensive approach involving generic nonlinear time delay embedding for time series data extracted from a financial or economic system under examination. Subsequently, we employ sparse least-squares and structured matrix approximation methods to discern approximate representations of the output coupling matrices. These representations play a pivotal role in establishing the regressive models corresponding to the recursive structures inherent in a given financial system. The document further introduces prototypical algorithms that leverage the aforementioned techniques. These algorithms are demonstrated through applications in approximate identification and predictive simulation of dynamic financial and economic processes, encompassing scenarios that may or may not exhibit chaotic behavior.
摘要
在本文中,我们介绍了结构化矩阵近似理论的关键发现,并应用于金融或经济系统中的回归表现力学过程的重构表示。我们首先探讨了一种通用非线性时间延迟嵌入方法,用于从金融或经济系统中提取时间序列数据。然后,我们使用稀疏最小二乘和结构矩阵近似方法来推导出输出 coupling 矩阵的近似表示。这些表示在建立金融系统中的回归模型中扮演关键角色。文章还介绍了一些原型算法,这些算法利用上述技术来实现精度的回归预测和模拟。这些算法在不同的金融和经济过程中的应用中得到了证明。
Automatic prediction of mortality in patients with mental illness using electronic health records
methods: 使用predictive machine-learning models with electronic health records (EHR),包括Logistic Regression、Random Forest、Support Vector Machine和K-Nearest Neighbors四种机器学习算法
results: Random Forest和Support Vector Machine模型表现最佳,AUC分数为0.911,Feature importance分析显示 morphine sulfate等药物具有预测作用。Abstract
Mental disorders impact the lives of millions of people globally, not only impeding their day-to-day lives but also markedly reducing life expectancy. This paper addresses the persistent challenge of predicting mortality in patients with mental diagnoses using predictive machine-learning models with electronic health records (EHR). Data from patients with mental disease diagnoses were extracted from the well-known clinical MIMIC-III data set utilizing demographic, prescription, and procedural information. Four machine learning algorithms (Logistic Regression, Random Forest, Support Vector Machine, and K-Nearest Neighbors) were used, with results indicating that Random Forest and Support Vector Machine models outperformed others, with AUC scores of 0.911. Feature importance analysis revealed that drug prescriptions, particularly Morphine Sulfate, play a pivotal role in prediction. We applied a variety of machine learning algorithms to predict 30-day mortality followed by feature importance analysis. This study can be used to assist hospital workers in identifying at-risk patients to reduce excess mortality.
摘要
精神疾病影响全球数百万人的生活,不仅妨碍日常生活,而且明显减少生存期。本文使用可预测机器学习模型和电子健康纪录(EHR)预测患有精神诊断的患者 mortality。从 клиничеwell-known MIMIC-III数据集中提取了患有精神疾病诊断的患者数据,并使用LOGISTIC REGRESSION、Random Forest、Support Vector Machine和K-Nearest Neighbors四种机器学习算法。结果表明,Random Forest和Support Vector Machine模型在其他四个模型中表现最佳,AUC分数为0.911。特征重要性分析显示,药物处方,特别是摩革定(Morphine Sulfate),在预测中扮演着关键性角色。我们通过不同的机器学习算法预测30天内死亡,并进行特征重要性分析,以帮助医院工作人员识别高风险患者,从而减少过度死亡。
MMD-based Variable Importance for Distributional Random Forest
for: 这篇论文目的是提出一种基于森林方法的全Conditional分布估计方法,用于 Multivariate output of interest 的输入变量。
methods: 该论文使用了 Drop and relearn 原理和MMD距离来实现变量重要性度量,而传统的重要性度量仅检测输出均值的影响变量。
results: 引入的重要性度量是一致的,在实际数据和模拟数据上具有高效性,并且超越竞争者。特别是,该算法可以通过回归特征减少来选择变量,从而提供高精度的Conditional输出分布估计。Abstract
Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions.
摘要
<>使用 Distributional Random Forest(DRF)来估算输入变量的多变量输出分布。在本文中,我们提出了基于drop和重新学习原则以及MMD距离的变量重要性算法,该算法可捕捉输入变量对输出分布的影响,而不仅仅是输出均值。我们证明了该算法的一致性和高效性,并在实际和预测数据上实现了比较高的表现。特别是,我们的算法可以通过 recursively feature elimination来选择变量,从而快速建立高精度的 conditional output distribution 估计。Translation notes:* "Distributional Random Forest" is translated as "多变量随机森林" (mányuànxīn sēn lín)* "full conditional distribution" is translated as "完整的分布" (quèzhè de fēn xiǎng)* "variable importance" is translated as "变量重要性" (biànxīn zhòng yào xìng)* "drop and relearn principle" is translated as "drop和重新学习原则" (drop hé zhòng xīn xué xí yuè)* "MMD distance" is translated as "MMD距离" (MMD jù lù)* "recursive feature elimination" is translated as " recursively feature elimination" (jiē yǐjī zhì xiǎng fāng yì)
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
paper_authors: Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré
results: 作为证明,该论文在三个领域中explored M2 的性能:非 causal BERT 样式语言模型、ViT 样式图像分类和 causal GPT 样式语言模型。在非 causal BERT 样式模型中,M2 与 BERT-base 和 BERT-large 相比,在下游 GLUE 质量上具有相同的性能,并且可以达到更高的通过put 性能(最高达 9.1 倍)。在 ImageNet 上,M2 超过 ViT-b 的准确率,仅使用半个参数。在 causal GPT 样式模型中,M2 可以与 Transformer 相比,在 360M 参数的预训练质量上具有相同的性能。Abstract
Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1$\times$ higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE--showing for the first time that it may be possible to match Transformer quality without attention or MLPs.
摘要
机器学习模型在序列长度和模型维度上逐渐升级以达到更长的上下文和更高的性能。然而,现有的架构,如Transformers,在这两个轴上 quadratic scaling。我们问:是否存在高性能的架构,可以在序列长度和模型维度上下降幂?我们介绍了一新的架构:宫廷混合器(M2),它使用同样的幂次性 primitive来序列长度和模型维度:宫廷矩阵,一种简单的表达 Structured matrices 的类型,可以在 GPU 上实现高硬件效率,并在序列长度和模型维度上下降幂。作为一个证明,我们探索了 M2 在三个领域的性能:非 causal BERT 风格语言模型、ViT 风格图像分类和 causal GPT 风格语言模型。在非 causal BERT 风格模型中,M2 与 BERT-base 和 BERT-large 相当在下游 GLUE 质量上,并且在序列长度 4K 时间点可以达到 9.1 倍的throughput,而且只需要 27% 的参数。在 ImageNet 上,M2 超过 ViT-b 的准确率,只需要一半的参数。 causal GPT 风格模型引入了一个技术挑战:在 маSKing 中引入的 quadratic bottleneck。为了缓解这个瓶颈,我们开发了一种新的理论视角,基于多Variable 多项式评估和插值,这使得我们可以在 M2 中使用 causal 参数化,而不是 quadratic 参数化。使用这种参数化,M2 与 GPT 风格 Transformers 在 360M 参数的预训练损失上匹配,这是第一次证明可以不使用注意力或 MLP 匹配 Transformer 质量。
For: The paper is written for investigating brokerage between traders from an online learning perspective, with a focus on the case where there are no designated buyer and seller roles.* Methods: The paper uses online learning techniques to achieve a low regret bound in the brokerage problem, specifically providing algorithms achieving regret $M \log T$ and $\sqrt{M T}$ under different assumptions about the agents’ valuations.* Results: The paper shows that the optimal regret rate is $M \log T$ when the agents’ valuations are revealed after each interaction, and $\sqrt{M T}$ when only their willingness to sell or buy at the proposed price is revealed. Additionally, the paper demonstrates that the optimal rate degrades to $\sqrt{T}$ when the bounded density assumption is dropped.Abstract
We investigate brokerage between traders from an online learning perspective. At any round $t$, two traders arrive with their private valuations, and the broker proposes a trading price. Unlike other bilateral trade problems already studied in the online learning literature, we focus on the case where there are no designated buyer and seller roles: each trader will attempt to either buy or sell depending on the current price of the good. We assume the agents' valuations are drawn i.i.d. from a fixed but unknown distribution. If the distribution admits a density bounded by some constant $M$, then, for any time horizon $T$: $\bullet$ If the agents' valuations are revealed after each interaction, we provide an algorithm achieving regret $M \log T$ and show this rate is optimal, up to constant factors. $\bullet$ If only their willingness to sell or buy at the proposed price is revealed after each interaction, we provide an algorithm achieving regret $\sqrt{M T}$ and show this rate is optimal, up to constant factors. Finally, if we drop the bounded density assumption, we show that the optimal rate degrades to $\sqrt{T}$ in the first case, and the problem becomes unlearnable in the second.
摘要
我们研究在线学习中的经纪人交易。在任意的回合 $t$ 中,两个经纪人会 arrive WITH 他们的私人估价,经纪人会提议交易价格。与其他双方贸易问题已经在在线学习文献中研究过的不同,我们专注于情况下没有指定的买方和卖方角色:每个经纪人都会尝试 Either 购买或卖出,根据当前商品价格。 我们假设经纪人的估价是从固定而 unknown 的分布中随机样本。如果该分布具有最大值 $M$,那么,对于任意的时间 horizon $T$:❝ 如果经纪人的估价在每次交互后公布,我们提供了一个算法,其 regret 为 $M \log T$,并证明这个率是最佳的,占常数因子。❞❝ 如果只有经纪人对于提议价格的愿意性被公布在每次交互后,我们提供了一个算法,其 regret 为 $\sqrt{M T}$,并证明这个率是最佳的,占常数因子。❞最后,如果我们取消了均勋度 bound 的假设,我们显示了最佳率下降到 $\sqrt{T}$ 在第一个情况下,并问题变得不可学习在第二个情况下。
On the latent dimension of deep autoencoders for reduced order modeling of PDEs parametrized by random fields
results: 本文提供了关于DL-ROMs在随机场中的理论分析,并提供了可导的错误 bound,以帮助域专家在选择深度学习自动编码器的缓存维度时进行优化。 数据示例表明,本文的分析对DL-ROMs的性能产生了显著的影响。Abstract
Deep Learning is having a remarkable impact on the design of Reduced Order Models (ROMs) for Partial Differential Equations (PDEs), where it is exploited as a powerful tool for tackling complex problems for which classical methods might fail. In this respect, deep autoencoders play a fundamental role, as they provide an extremely flexible tool for reducing the dimensionality of a given problem by leveraging on the nonlinear capabilities of neural networks. Indeed, starting from this paradigm, several successful approaches have already been developed, which are here referred to as Deep Learning-based ROMs (DL-ROMs). Nevertheless, when it comes to stochastic problems parameterized by random fields, the current understanding of DL-ROMs is mostly based on empirical evidence: in fact, their theoretical analysis is currently limited to the case of PDEs depending on a finite number of (deterministic) parameters. The purpose of this work is to extend the existing literature by providing some theoretical insights about the use of DL-ROMs in the presence of stochasticity generated by random fields. In particular, we derive explicit error bounds that can guide domain practitioners when choosing the latent dimension of deep autoencoders. We evaluate the practical usefulness of our theory by means of numerical experiments, showing how our analysis can significantly impact the performance of DL-ROMs.
摘要
深度学习对减少顺序模型(ROMs)的设计产生了深刻的影响,特别是在解决复杂问题上,其中经典方法可能会失败时。在这个情况下,深度自适应神经网络扮演了非常重要的角色,因为它们可以通过神经网络的非线性能力来减少问题的维度。从这个角度出发,已经有许多成功的方法被开发出来,这些方法被称为深度学习基于ROMs(DL-ROMs)。然而,当面临随机场所 parametrized 的问题时,现有的理论分析仅限于具有固定数量的 deterministic 参数的PDEs。本文的目的是扩展现有的文献,提供关于DL-ROMs在随机场所下的理论分析。特别是,我们 derive 了明确的错误 bound,可以帮助域专家在选择深度自适应神经网络的缓存维度时作出决策。我们通过数值实验证明了我们的理论对DL-ROMs的性能产生了显著的影响。
Contributing Components of Metabolic Energy Models to Metabolic Cost Estimations in Gait
paper_authors: Markus Gambietz, Marlies Nitschke, Jörg Miehling, Anne Koelewijn
for: 这个研究旨在深入理解人类行走中的代谢能量消耗模型,以便更好地估计代谢能量消耗。
methods: 我们使用了四种代谢能量消耗模型的参数进行 Monte Carlo 敏感分析,然后分析了这些参数的敏感指数、生理上的Context和生理过程中的代谢率。最终选择了一个 quasi-优化的模型。在第二步,我们 investigate了输入参数和变量的重要性,通过使用不同的输入特征来训练神经网络。
results: 我们发现,力量相关的参数在敏感分析中最为重要,而神经网络基于的输入特征选择也显示了承诺。然而,我们发现,使用神经网络模型的代谢能量消耗估计并没有达到传统模型的准确性。Abstract
Objective: As metabolic cost is a primary factor influencing humans' gait, we want to deepen our understanding of metabolic energy expenditure models. Therefore, this paper identifies the parameters and input variables, such as muscle or joint states, that contribute to accurate metabolic cost estimations. Methods: We explored the parameters of four metabolic energy expenditure models in a Monte Carlo sensitivity analysis. Then, we analysed the model parameters by their calculated sensitivity indices, physiological context, and the resulting metabolic rates during the gait cycle. The parameter combination with the highest accuracy in the Monte Carlo simulations represented a quasi-optimized model. In the second step, we investigated the importance of input parameters and variables by analysing the accuracy of neural networks trained with different input features. Results: Power-related parameters were most influential in the sensitivity analysis and the neural network-based feature selection. We observed that the quasi-optimized models produced negative metabolic rates, contradicting muscle physiology. Neural network-based models showed promising abilities but have been unable to match the accuracy of traditional metabolic energy expenditure models. Conclusion: We showed that power-related metabolic energy expenditure model parameters and inputs are most influential during gait. Furthermore, our results suggest that neural network-based metabolic energy expenditure models are viable. However, bigger datasets are required to achieve better accuracy. Significance: As there is a need for more accurate metabolic energy expenditure models, we explored which musculoskeletal parameters are essential when developing a model to estimate metabolic energy.
摘要
方法:我们在四种代谢能耗模型中进行了Monte Carlo敏感分析,然后分析了模型参数的计算敏感度指数、生理上的文脉和代谢过程中的代谢率。在Monte Carlo优化中,我们选择了最佳的参数组合,并在第二步中,通过不同输入特征的分析,了解输入参数和变量的重要性。结果:在敏感分析中,力量相关的参数具有最大的影响力,而神经网络基于的特征选择也显示了扩展的能力。然而,我们发现,在许多情况下,神经网络模型的准确性不如传统的代谢能耗模型。结论:我们发现,在步行过程中,力量相关的代谢能耗模型参数和输入变量具有最大的影响力。此外,我们的结果表明,神经网络基于的代谢能耗模型是可行的,但需要更大的数据来达到更高的准确性。重要性:由于代谢成本的估计是一个需要更加准确的问题,我们在这篇论文中探讨了 Musculoskeletal 参数是如何影响代谢能耗模型的。
Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks
results: 本文发现了两种无形网络架构在初始化时的相同架构准确性限制,并且对无形MLP网络的层次相关性进行了第一项级准确性修正。这些结果表明了无形网络和形态activation函数之间的连接,并开 up了研究正则化方法和形态activation函数之间的关系的可能性。Abstract
Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks. Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization: (i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth. (ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.
摘要
近期的分析表明,在神经网络中使用扩展 activation function(即网络大小增长时Activation function也随着增长)会导致分析限制,这些结果并不直接告诉我们关于“常规”无形网络(即Activation function不变化与网络大小增长)的 anything。在这篇文章中,我们发现了两种类型的无形网络的极限性特征,即:首先,我们证明了以下两个架构在初始化时 converges to the same infinite-depth-and-width limit:(i)一个具有 $d^{-1/2}$ 因子的完全连接 ResNet,其中 $d$ 是网络深度。(ii)一个具有 $d \ll n$ 的多层感知器(MLP),其中 $d$ 是网络深度, activation 是 $d^{-1/2}$ 的折叠函数。其次,对于无形 MLP 的初始化,我们 derive the first order asymptotic correction to the layerwise correlation。 Specifically, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$.这些结果共同表明了无形和形 activation function 之间的连接,并开放了研究正规化方法和 activation function 的拟合方面的可能性。
Transformers for scientific data: a pedagogical review for astronomers
results: 论文介绍了自注意机制的数学基础和transformers的应用在时间序列和图像数据中的成果。Note: The above information is in Simplified Chinese text.Abstract
The deep learning architecture associated with ChatGPT and related generative AI products is known as transformers. Initially applied to Natural Language Processing, transformers and the self-attention mechanism they exploit have gained widespread interest across the natural sciences. The goal of this pedagogical and informal review is to introduce transformers to scientists. The review includes the mathematics underlying the attention mechanism, a description of the original transformer architecture, and a section on applications to time series and imaging data in astronomy. We include a Frequently Asked Questions section for readers who are curious about generative AI or interested in getting started with transformers for their research problem.
摘要
<>将文本翻译成简化中文。<>与ChatGPT和相关的生成AI产品相关的深度学习架构被称为transformers。初始应用于自然语言处理,transformers和它们利用的自注意机制已经在自然科学领域引起了广泛的关注。本文的教学和非正式评论的目的是引入transformers给科学家。文中包括自注意机制的数学基础、原始transformer架构的描述和在天文学中对时间序列和图像数据的应用。我们附加了关于生成AI或想要使用transformers解决研究问题的常见问题 section。
Learning Gradient Fields for Scalable and Generalizable Irregular Packing
for: solves the packing problem with irregularly shaped pieces, minimizing waste and avoiding overlap, using machine learning and conditional generative modeling.
methods: employs the score-based diffusion model to learn gradient fields that encode constraint satisfaction and spatial relationships, and uses a coarse-to-fine refinement mechanism to generate packing solutions.
results: demonstrates spatial utilization rates comparable to or surpassing those achieved by the teacher algorithm, and exhibits some level of generalization to shape variations.Abstract
The packing problem, also known as cutting or nesting, has diverse applications in logistics, manufacturing, layout design, and atlas generation. It involves arranging irregularly shaped pieces to minimize waste while avoiding overlap. Recent advances in machine learning, particularly reinforcement learning, have shown promise in addressing the packing problem. In this work, we delve deeper into a novel machine learning-based approach that formulates the packing problem as conditional generative modeling. To tackle the challenges of irregular packing, including object validity constraints and collision avoidance, our method employs the score-based diffusion model to learn a series of gradient fields. These gradient fields encode the correlations between constraint satisfaction and the spatial relationships of polygons, learned from teacher examples. During the testing phase, packing solutions are generated using a coarse-to-fine refinement mechanism guided by the learned gradient fields. To enhance packing feasibility and optimality, we introduce two key architectural designs: multi-scale feature extraction and coarse-to-fine relation extraction. We conduct experiments on two typical industrial packing domains, considering translations only. Empirically, our approach demonstrates spatial utilization rates comparable to, or even surpassing, those achieved by the teacher algorithm responsible for training data generation. Additionally, it exhibits some level of generalization to shape variations. We are hopeful that this method could pave the way for new possibilities in solving the packing problem.
摘要
<> packing 问题,也称为割辑或嵌入,在物流、制造、布局设计和地图生成中有广泛的应用。它涉及到将不规则形状的物品安排,以最小化剩下物和避免重叠。 recent advances in machine learning,特别是强化学习,对 packing 问题提出了新的思路。在这项工作中,我们将更深入地探讨一种基于机器学习的新方法,将 packing 问题转化为 conditional generative modeling。为了解决不规则嵌入中的挑战,包括物体有效性约束和碰撞避免,我们的方法使用分数据模型来学习一系列的梯度场。这些梯度场表达了对约束满足和物体间的空间关系的学习。在测试阶段,我们使用一种粗细层次匀化机制,以指导学习的梯度场来生成嵌入解。为了提高嵌入可行性和优化,我们引入了两种关键的建筑设计:多尺度特征提取和粗细层次关系提取。我们对两种典型的工业嵌入领域进行实验,只考虑翻译。实验结果表明,我们的方法可以与教师算法负责数据生成的空间利用率相当,甚至超过。此外,它还有一定的泛化能力。我们希望这种方法可以为嵌入问题开拓新的可能性。
Understanding Reward Ambiguity Through Optimal Transport Theory in Inverse Reinforcement Learning
for: 这 paper 的中心目标是寻找在观察到的专家行为中隐藏的奖励函数,以便不仅解释数据,还能够泛化到未经见过的情况。
methods: 这 paper 使用 optimal transport (OT) 理论,提供了一种新的视角来解决高维问题和奖励不确定性的问题。
results: 这 paper 的研究发现,通过 Wasserstein 距离来衡量奖励不确定性,并提供了一种中心表示或中心函数的确定方法,这些发现可以为高维 setting 中的 robust IRL 方法提供一种结构化的途径。Abstract
In inverse reinforcement learning (IRL), the central objective is to infer underlying reward functions from observed expert behaviors in a way that not only explains the given data but also generalizes to unseen scenarios. This ensures robustness against reward ambiguity where multiple reward functions can equally explain the same expert behaviors. While significant efforts have been made in addressing this issue, current methods often face challenges with high-dimensional problems and lack a geometric foundation. This paper harnesses the optimal transport (OT) theory to provide a fresh perspective on these challenges. By utilizing the Wasserstein distance from OT, we establish a geometric framework that allows for quantifying reward ambiguity and identifying a central representation or centroid of reward functions. These insights pave the way for robust IRL methodologies anchored in geometric interpretations, offering a structured approach to tackle reward ambiguity in high-dimensional settings.
摘要
倒 inverse reinforcement learning(IRL)的中心目标是从专家行为中推理出底层奖励函数,以解释数据并在未看到的情况下推广。这 Ensures robustness against 奖励ambiguity, where multiple 奖励函数可以一样 explain 专家行为。 although significant efforts have been made to address this issue, current methods often face challenges with high-dimensional problems and lack a geometric foundation.this paper harnesses the optimal transport(OT)theory to provide a fresh perspective on these challenges. By utilizing the Wasserstein distance from OT, we establish a geometric framework that allows for quantifying 奖励ambiguity and identifying a central representation or centroid of reward functions. These insights pave the way for robust IRL methodologies anchored in geometric interpretations, offering a structured approach to tackle 奖励ambiguity in high-dimensional settings.
Applications of ML-Based Surrogates in Bayesian Approaches to Inverse Problems
paper_authors: Pelin Ersin, Emma Hayes, Peter Matthews, Paramjyoti Mohapatra, Elisa Negrini, Karl Schulz
for: 寻找波源位置在方正区域的逆问题,给出噪音解的解决方案。
methods: 使用神经网络作为代理模型,提高计算效率,使得Markov Chain Monte Carlo方法可以用于评估 posterior 分布中的源位置。
results: 通过寻找波源位置的方法,可以准确地从噪音数据中提取源位置信息。Abstract
Neural networks have become a powerful tool as surrogate models to provide numerical solutions for scientific problems with increased computational efficiency. This efficiency can be advantageous for numerically challenging problems where time to solution is important or when evaluation of many similar analysis scenarios is required. One particular area of scientific interest is the setting of inverse problems, where one knows the forward dynamics of a system are described by a partial differential equation and the task is to infer properties of the system given (potentially noisy) observations of these dynamics. We consider the inverse problem of inferring the location of a wave source on a square domain, given a noisy solution to the 2-D acoustic wave equation. Under the assumption of Gaussian noise, a likelihood function for source location can be formulated, which requires one forward simulation of the system per evaluation. Using a standard neural network as a surrogate model makes it computationally feasible to evaluate this likelihood several times, and so Markov Chain Monte Carlo methods can be used to evaluate the posterior distribution of the source location. We demonstrate that this method can accurately infer source-locations from noisy data.
摘要
Translated into Simplified Chinese:神经网络已成为数学问题的强大工具,提供了计算效率的增强。这种效率可以在计算复杂的问题中帮助提高解决时间,或者在评估多个相似的分析场景时提高计算效率。一个科学领域的特别兴趣是反问题,即知道系统的前向动力学方程,并且要从(潜在噪声)观测中推断系统的性质。我们考虑了二维声波方程的反问题,即在平方Domain中推断声源的位置,给出噪声解的情况。在假设 Gaussian 噪声时,可以形式化一个likelihood函数,该函数需要一次前向模拟 per 评估。使用标准神经网络作为模拟模型,可以使计算这个likelihood多次成为可能,然后使用Markov Chain Monte Carlo 方法评估 posterior 分布。我们示示了这种方法可以准确地从噪声数据中推断声源位置。
Conformal Drug Property Prediction with Density Estimation under Covariate Shift
paper_authors: Siddhartha Laghuvarapu, Zhen Lin, Jimeng Sun
for: This paper aims to address the challenge of obtaining reliable uncertainty estimates in drug discovery tasks using Conformal Prediction (CP) and to provide valid prediction sets for molecular properties with a coverage guarantee.
methods: The proposed method, CoDrug, employs an energy-based model leveraging both training data and unlabelled data, and Kernel Density Estimation (KDE) to assess the densities of a molecule set. The estimated densities are then used to weigh the molecule samples while building prediction sets and rectifying for distribution shift.
results: In extensive experiments involving realistic distribution drifts in various small-molecule drug discovery tasks, CoDrug was shown to provide valid prediction sets and to reduce the coverage gap by over 35% when compared to conformal prediction sets not adjusted for covariate shift.Abstract
In drug discovery, it is vital to confirm the predictions of pharmaceutical properties from computational models using costly wet-lab experiments. Hence, obtaining reliable uncertainty estimates is crucial for prioritizing drug molecules for subsequent experimental validation. Conformal Prediction (CP) is a promising tool for creating such prediction sets for molecular properties with a coverage guarantee. However, the exchangeability assumption of CP is often challenged with covariate shift in drug discovery tasks: Most datasets contain limited labeled data, which may not be representative of the vast chemical space from which molecules are drawn. To address this limitation, we propose a method called CoDrug that employs an energy-based model leveraging both training data and unlabelled data, and Kernel Density Estimation (KDE) to assess the densities of a molecule set. The estimated densities are then used to weigh the molecule samples while building prediction sets and rectifying for distribution shift. In extensive experiments involving realistic distribution drifts in various small-molecule drug discovery tasks, we demonstrate the ability of CoDrug to provide valid prediction sets and its utility in addressing the distribution shift arising from de novo drug design models. On average, using CoDrug can reduce the coverage gap by over 35% when compared to conformal prediction sets not adjusted for covariate shift.
摘要
在药物发现中,确认计算模型预测的药品性能需要通过costly的湿lab实验进行验证。因此,获得可靠的不确定性估计是关键的,以便在后续实验验证中PRIORITIZE drug molecules。 Conformal Prediction(CP)是一种可靠的工具,可以创建包含预测性能的prediction sets。然而,CP中的交换性假设在药物发现任务中经常遇到冲击:大多数数据集只包含有限的标签数据,这些数据可能不能代表整个化学空间中的分子。为解决这个限制,我们提出了一种方法called CoDrug,它使用能量基本模型利用训练数据和无标签数据,以及Kernel Density Estimation(KDE)来评估分子集的浓度。然后,使用这些估计的浓度来权重分子样本,以建立预测集和纠正分布shift。在具有实际分布滑动的小分子药物发现任务中,我们通过实验证明CoDrug可以提供有效的预测集,并且在de novo drug design模型中 Addressing the distribution shift。在average上,使用CoDrug可以将覆盖缺口减少超过35%,比不 Rectifying for distribution shift。
Exact and efficient solutions of the LMC Multitask Gaussian Process model
for: 这个论文是关于多任务 Gaussian process regression 或分类的一种非常通用的模型,它的表达能力和概念简单性很吸引人。但是,直接实现方式的复杂性是 cubic 在数据点和任务数量的平方,这意味着大多数应用中需要使用简化方法。然而,最近的研究表明,在某些条件下,模型的隐藏过程可以分离,从而实现 linear 复杂性。
results: 我们在synthetic数据上进行了参数研究,并证明了我们的方法的出色表现,相比之下unrestricted exact LMC和其他简化方法。总之, проекed LMC 模型是一种可靠和简单的代用方法,它可以大大简化一些计算,如离散一个数据点的cross-validation和幻想。Abstract
The Linear Model of Co-regionalization (LMC) is a very general model of multitask gaussian process for regression or classification. While its expressivity and conceptual simplicity are appealing, naive implementations have cubic complexity in the number of datapoints and number of tasks, making approximations mandatory for most applications. However, recent work has shown that under some conditions the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes. We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model. We introduce a full parametrization of the resulting \emph{projected LMC} model, and an expression of the marginal likelihood enabling efficient optimization. We perform a parametric study on synthetic data to show the excellent performance of our approach, compared to an unrestricted exact LMC and approximations of the latter. Overall, the projected LMC appears as a credible and simpler alternative to state-of-the art models, which greatly facilitates some computations such as leave-one-out cross-validation and fantasization.
摘要
linear 模型的协同地域化 (LMC) 是一种非常通用的多任务 Gaussian 过程 regression 或 classification 模型。 虽其表达能力和概念简洁吸引人,但直接实现的方法具有 кубиック complexity 在数据点和任务数量上,使得大多数应用中需要使用 Approximations。然而,最近的研究表明,在某些条件下,latent 过程的模型可以减少,导致只有 linear 复杂度在数据点和任务数量上。我们在这里扩展这些结果,表明只需要对模型噪声模型进行一定的假设,就可以实现高效的精确计算。我们介绍了该模型的完整均衡 parametrization,并提供了计算 marginal 概率的表达,使得可以高效地优化。我们在synthetic数据上进行了参数研究,并证明了我们的方法在比较于未限制的精确 LMC 和其approximations 上表现出色。总之,Projected LMC 模型看起来是一种可靠和简单的代码,它可以大大简化一些计算,如离开一个 cross-validation 和幻想。
Nonparametric Discrete Choice Experiments with Machine Learning Guided Adaptive Design
paper_authors: Mingzhang Yin, Ruijiang Gao, Weiran Lin, Steven M. Shugan
For: 这个论文旨在设计用于满足消费者偏好的产品,以提高企业的成功。* Methods: 论文提出了一种名为 Gradient-based Survey (GBS) 的不 Parametric 的选择实验方法,用于多Attribute 产品设计。GBS 通过问题序列和响应者之前的选择来逐步定义产品特性。* Results: 对于在 simulations 中进行比较的 parametric 和非 Parametric 方法,GBS 具有更高的准确率和样本效率。Abstract
Designing products to meet consumers' preferences is essential for a business's success. We propose the Gradient-based Survey (GBS), a discrete choice experiment for multiattribute product design. The experiment elicits consumer preferences through a sequence of paired comparisons for partial profiles. GBS adaptively constructs paired comparison questions based on the respondents' previous choices. Unlike the traditional random utility maximization paradigm, GBS is robust to model misspecification by not requiring a parametric utility model. Cross-pollinating the machine learning and experiment design, GBS is scalable to products with hundreds of attributes and can design personalized products for heterogeneous consumers. We demonstrate the advantage of GBS in accuracy and sample efficiency compared to the existing parametric and nonparametric methods in simulations.
摘要
为商业成功,设计产品根据消费者的偏好非常重要。我们提议 Gradient-based Survey(GBS),一种多Attribute产品设计的灵活选择实验。这种实验通过一系列对半个配置进行对比,抽取消费者的偏好。与传统的随机Utility最大化理论不同,GBS不需要 Parametric Utility模型,因此更具鲁棒性。通过融合机器学习和实验设计,GBS可扩展到产品上百个特征,设计个性化产品 для多样化的消费者。我们通过模拟表明,GBS在准确性和样本效率方面比现有的参数化和非参数化方法有优势。
Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models
results: 论文的实验结果显示,使用Vecchia-Laplace近似法和迭代法可以大幅提高估计的速度,并且在一个大型卫星数据集上比起现有方法实现三倍的预测精度。Abstract
Latent Gaussian process (GP) models are flexible probabilistic non-parametric function models. Vecchia approximations are accurate approximations for GPs to overcome computational bottlenecks for large data, and the Laplace approximation is a fast method with asymptotic convergence guarantees to approximate marginal likelihoods and posterior predictive distributions for non-Gaussian likelihoods. Unfortunately, the computational complexity of combined Vecchia-Laplace approximations grows faster than linearly in the sample size when used in combination with direct solver methods such as the Cholesky decomposition. Computations with Vecchia-Laplace approximations thus become prohibitively slow precisely when the approximations are usually the most accurate, i.e., on large data sets. In this article, we present several iterative methods for inference with Vecchia-Laplace approximations which make computations considerably faster compared to Cholesky-based calculations. We analyze our proposed methods theoretically and in experiments with simulated and real-world data. In particular, we obtain a speed-up of an order of magnitude compared to Cholesky-based inference and a threefold increase in prediction accuracy in terms of the continuous ranked probability score compared to a state-of-the-art method on a large satellite data set. All methods are implemented in a free C++ software library with high-level Python and R packages.
摘要
潜在 Gaussian 过程(GP)模型是一种灵活的可信度非参数函数模型。Vecchia aproximations 是一种精准的GP模型 Approximations 可以在大量数据时提高计算效率,而 Laplace Approximations 是一种快速的方法,它具有 asymptotic convergence guarantees 来近似 marginal likelihoods 和 posterior predictive distributions 的非 Gaussian 类型。然而,Vecchia-Laplace approximations 的计算复杂度随着样本大小增加,使用 direct solver methods such as Cholesky decomposition 时会变得不可持久。因此,在大数据集时,Vecchia-Laplace approximations 的计算变得繁琐。在这篇文章中,我们提出了一些迭代法来实现Vecchia-Laplace approximations 的推理,使计算速度比 Cholesky-based 计算更快。我们还进行了理论分析和实验室测试,并在 simulated 和实际数据集上 obtaint 一个级别的速度提升和三倍的预测精度。所有方法都是在一个免费 C++ 软件库中实现的,并提供了高级 Python 和 R 包。
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation
results: 对 Wasserstein 和 CelebA 图像Dataset以及 MultiNLI 自然语言处理Dataset进行评估,发现该算法可以超过现有的概念除除法。Abstract
Out-of-distribution generalization in neural networks is often hampered by spurious correlations. A common strategy is to mitigate this by removing spurious concepts from the neural network representation of the data. Existing concept-removal methods tend to be overzealous by inadvertently eliminating features associated with the main task of the model, thereby harming model performance. We propose an iterative algorithm that separates spurious from main-task concepts by jointly identifying two low-dimensional orthogonal subspaces in the neural network representation. We evaluate the algorithm on benchmark datasets for computer vision (Waterbirds, CelebA) and natural language processing (MultiNLI), and show that it outperforms existing concept removal methods
摘要
neural networks 中的 out-of-distribution 泛化受到假 correlate 的干扰。一般的方法是通过 removing spurious concepts 来 mitigate 这种情况。现有的 concept-removal 方法往往过于积极,不小心 eliminating 主要任务相关的特征,从而害到模型性能。我们提出了一种迭代算法,jointly identifying two low-dimensional orthogonal subspaces 在 neural network representation 中,以分离假 correlations 和主要任务相关的特征。我们在 Waterbirds、CelebA 和 MultiNLI 等 benchmark datasets 上评估了该算法,并显示它在 existing concept removal 方法 上表现出色。
results: 在五个常用的图像归一化 benchmark 上达到了状态机的性能,包括全部 ImageNet-1K 数据集Abstract
The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.
摘要
核心是在含有先验知识的情况下构建监督信号。从经典k-means基于数据压缩到最近的对照集成监督,集群方法的演化都与监督信号的进步相对应。到目前为止,大量的内部监督信号从数据中被挖掘出来。然而,外部知识,如semantic description,尚未得到了适当的利用。在这种情况下,我们提议利用外部知识作为新的监督信号,即使它与给定数据看起来不相关。为了实现和验证我们的想法,我们设计了一种受外部知识引导的集群方法(Text-Aided Clustering,TAC)。TAC首先选择和检索WordNet词汇,以增强特征描述性。然后,为了提高图像集群性能,TAC与文本和图像模式之间进行协同整合,通过相互洗礼距离信息。实验表明,TAC在5个广泛使用的和3个更加挑战的图像集群 benchmark上达到了状态机器人的性能。
A Finite-Horizon Approach to Active Level Set Estimation
results: 实验表明,当cost of travel增加时,我们的方法可以更好地使用距离非偏视来提高估计精度,并在真实的空气质量数据上实现约一半的估计误差。Abstract
We consider the problem of active learning in the context of spatial sampling for level set estimation (LSE), where the goal is to localize all regions where a function of interest lies above/below a given threshold as quickly as possible. We present a finite-horizon search procedure to perform LSE in one dimension while optimally balancing both the final estimation error and the distance traveled for a fixed number of samples. A tuning parameter is used to trade off between the estimation accuracy and distance traveled. We show that the resulting optimization problem can be solved in closed form and that the resulting policy generalizes existing approaches to this problem. We then show how this approach can be used to perform level set estimation in higher dimensions under the popular Gaussian process model. Empirical results on synthetic data indicate that as the cost of travel increases, our method's ability to treat distance nonmyopically allows it to significantly improve on the state of the art. On real air quality data, our approach achieves roughly one fifth the estimation error at less than half the cost of competing algorithms.
摘要
我们在各种空间采样中考虑了活动学习,其目标是尽可能快地找到一个函数关注的区域是否超过了一定的阈值。我们提出了一种有限距离搜索过程,用于在一维中进行最优化的水平集估计,同时尽量减少最终估计误差和旅行距离。我们使用一个调整参数,以让最终估计误差和旅行距离之间进行负面交易。我们表明,这个优化问题可以在关闭式形式下解决,并且得到的策略可以折衔现有方法。然后,我们展示了如何使用这种方法来进行高维空间下的水平集估计,使用泊松过程模型。我们的实验结果表明,当成本增加时,我们的方法可以不偏袋见茫地减少估计误差。在实际空气质量数据上,我们的方法可以实现约一剑五分之一的估计误差,而且花费比竞争算法少得多。
Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression ?
results: 作者在一个 benchmark 数据集上测试了 BVS 和其变体,与 isotonic regression 进行比较,发现 BVS 和其变体可以更好地适应不同的输入特征,提高calibration的效果。Abstract
Binwise Variance Scaling (BVS) has recently been proposed as a post hoc recalibration method for prediction uncertainties of machine learning regression problems that is able of more efficient corrections than uniform variance (or temperature) scaling. The original version of BVS uses uncertainty-based binning, which is aimed to improve calibration conditionally on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in particular with alternative loss functions and a binning scheme based on an input-feature (X) in order to improve adaptivity, i.e. calibration conditional on X. The performances of BVS and its proposed variants are tested on a benchmark dataset for the prediction of atomization energies and compared to the results of isotonic regression.
摘要
Binwise Variance Scaling (BVS) 是一种最近提出的机器学习回归问题预测不确定性的后处修正方法,能够更有效地 corrections than uniform variance (或温度) scaling。原版本的 BVS 使用不确定性基于的分类,以提高预测条件上的准确性,即一致性。我在这里 explore 了 BVS 的一些变体,包括使用不同的损失函数和基于输入特征(X)的分类方案,以提高适应性,即预测条件下的准确性。我们对一个 benchmark 数据集进行了预测 atomization energies 的测试,并与ISOREG 的结果进行了比较。Here's the translation in Traditional Chinese: Binwise Variance Scaling (BVS) 是一种最近提出的机器学习回归问题的预测不确定性的后置修正方法,能够更有效地 corrections than uniform variance (或温度) scaling。原版本的 BVS 使用不确定性基于的分类,以提高预测条件上的准确性,即一致性。我在这里 explore 了 BVS 的一些变体,包括使用不同的损失函数和基于输入特征(X)的分类方案,以提高适应性,即预测条件下的准确性。我们对一个 benchmark 数据集进行了预测 atomization energies 的测试,并与ISOREG 的结果进行了比较。
Take the aTrain. Introducing an Interface for the Accessible Transcription of Interviews
results: 根据论文的描述,在现有的移动CPU上,转译时间约为音频档案的2-3倍,如果有入门级的图形卡,则转译速度增加到音频档案的20%。Abstract
aTrain is an open-source and offline tool for transcribing audio data in multiple languages with CPU and NVIDIA GPU support. It is specifically designed for researchers using qualitative data generated from various forms of speech interactions with research participants. aTrain requires no programming skills, runs on most computers, does not require an internet connection, and was verified not to upload data to any server. aTrain combines OpenAI's Whisper model with speaker recognition to provide output that integrates with the popular qualitative data analysis software tools MAXQDA and ATLAS.ti. It has an easy-to-use graphical interface and is provided as a Windows-App through the Microsoft Store allowing for simple installation by researchers. The source code is freely available on GitHub. Having developed aTrain with a focus on speed on local computers, we show that the transcription time on current mobile CPUs is around 2 to 3 times the duration of the audio file using the highest-accuracy transcription models. If an entry-level graphics card is available, the transcription speed increases to 20% of the audio duration.
摘要
aTrain 是一个开源、离线工具,用于转换多种语言的语音数据。它是特意针对对谈话参与者的质数数据进行研究而设计,并且不需要程式码技能,可以在大多数电脑上运行,不需要网页连线,并且确保没有上传数据到服务器。aTrain 结合 OpenAI 的 Whisper 模型和话者识别系统,以提供与 MAXQDA 和 ATLAS.ti 等受欢迎的质数数据分析软件集成。它具有易用的 графі式界面,通过 Microsoft Store 提供为 Windows 应用程序,让研究人员可以简单地安装。源代码则是免费公开在 GitHub 上。我们透过专注于本地电脑的速度,显示在现有的移动 CPU 上,转换时间约为音频档案的2-3倍,使用最高精度转换模型。如果有入门级的显卡可用,则转换速度将提高到音频档案的20%。
Flexible Payload Configuration for Satellites using Machine Learning
paper_authors: Marcele O. K. Mendonca, Flor G. Ortiz-Gomez, Jorge Querol, Eva Lagunas, Juan A. Vásquez Peralvo, Victor Monzon Baeza, Symeon Chatzinotas, Bjorn Ottersten
results: 通过对ML模型的表现进行评估,并考虑模型的资源分配决策对总体通信系统性能的影响,提出了一种Context-aware ML metric。Abstract
Satellite communications, essential for modern connectivity, extend access to maritime, aeronautical, and remote areas where terrestrial networks are unfeasible. Current GEO systems distribute power and bandwidth uniformly across beams using multi-beam footprints with fractional frequency reuse. However, recent research reveals the limitations of this approach in heterogeneous traffic scenarios, leading to inefficiencies. To address this, this paper presents a machine learning (ML)-based approach to Radio Resource Management (RRM). We treat the RRM task as a regression ML problem, integrating RRM objectives and constraints into the loss function that the ML algorithm aims at minimizing. Moreover, we introduce a context-aware ML metric that evaluates the ML model's performance but also considers the impact of its resource allocation decisions on the overall performance of the communication system.
摘要
卫星通信,现代连接的关键,扩展至海上、航空和远郊地区, terrestrial 网络无法实现。现有的 GEO 系统在多个扫描面上均匀分配功率和频率,使用多扫描面 fractional frequency reuse。然而, latest research 显示这种方法在多样化流量场景下存在限制,导致不充分利用。为解决这个问题,这篇论文提出一种基于机器学习(ML)的Radio Resource Management(RRM)方法。我们将 RRM 任务视为一个回归 ML 问题,将 RRM 目标和约束 integrate 到 ML 算法目标函数中。此外,我们还引入了一种 context-aware ML 指标,评估 ML 模型的性能,同时考虑它的资源分配决策对通信系统的总性能的影响。
results: 我们在七个 bencmarks 上进行了七个benchmark,包括分类和回归任务,结果显示了序列模型可以是通用 MCL 的有ffektive解决方案。Abstract
In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.
摘要
在这个工作中,我们想要建立两个机器学习研究领域之间的强有力连接:不间断学习和序列模型。即我们提议将不间断学习问题设置为序列模型问题,以便使用高级序列模型进行不间断学习。根据这种设置,不间断学习过程变成了序列模型的前向传播。通过采用meta-不间断学习(MCL)框架,我们可以在多个不间断学习集合上训练序列模型。为了示例,我们展示了使用Transformers和其高效变体作为MCL方法的应用。我们在七个标准准确的benchmark上进行了七种不同的实验,包括分类和回归任务,结果表明序列模型可以是通用MCL的有效解决方案。
Interpretable Spectral Variational AutoEncoder (ISVAE) for time series clustering
results: 实验结果表明,ISVAE模型比传统的VAE模型在分类率上表现更高,并且可以更好地处理复杂的数据配置。此外,${f_0}$的演化征imatters表明了群集之间的相似性。Abstract
The best encoding is the one that is interpretable in nature. In this work, we introduce a novel model that incorporates an interpretable bottleneck-termed the Filter Bank (FB)-at the outset of a Variational Autoencoder (VAE). This arrangement compels the VAE to attend on the most informative segments of the input signal, fostering the learning of a novel encoding ${f_0}$ which boasts enhanced interpretability and clusterability over traditional latent spaces. By deliberately constraining the VAE with this FB, we intentionally constrict its capacity to access broad input domain information, promoting the development of an encoding that is discernible, separable, and of reduced dimensionality. The evolutionary learning trajectory of ${f_0}$ further manifests as a dynamic hierarchical tree, offering profound insights into cluster similarities. Additionally, for handling intricate data configurations, we propose a tailored decoder structure that is symmetrically aligned with FB's architecture. Empirical evaluations highlight the superior efficacy of ISVAE, which compares favorably to state-of-the-art results in clustering metrics across real-world datasets.
摘要
最佳编码是可解释的编码。在这项工作中,我们提出了一种新的模型,其中包含了一个可解释的瓶颈(Filter Bank,FB),这个瓶颈位于Variational Autoencoder(VAE)的开头。这种设计使得VAE需要关注输入信号中最重要的信息,从而促进了学习一个新的编码${f_0}$,该编码具有更高的可解释性和分布性。通过强制VAE通过FB进行制约,我们故意削弱VAE对输入信号范围广的信息访问权限,从而促进了编码的可读性、分割性和维度减少。${f_0}$的演化学习轨迹更显示出了动态层次树的形式,提供了深刻的群集相似性的启示。此外,为处理复杂的数据配置,我们提议一种适应FB的编码结构。实验证明,ISVAE的效果明显高于州际级的结果,在真实世界数据集上达到了高度的分 clustering metrics。
Accelerated Policy Gradient: On the Nesterov Momentum for Reinforcement Learning
results: 我们表明了 NAG 在 true gradient 下可以在 $\tilde{O}(1/t^2)$ 时间复杂度下连续 converges 到优化的政策。此外,我们还通过数值验证表明了 NAG 可以在实际应用中提高 converge 性。Abstract
Policy gradient methods have recently been shown to enjoy global convergence at a $\Theta(1/t)$ rate in the non-regularized tabular softmax setting. Accordingly, one important research question is whether this convergence rate can be further improved, with only first-order updates. In this paper, we answer the above question from the perspective of momentum by adapting the celebrated Nesterov's accelerated gradient (NAG) method to reinforcement learning (RL), termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving faster global convergence, we formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a $\tilde{O}(1/t^2)$ rate. To the best of our knowledge, this is the first characterization of the global convergence rate of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the initialization, APG could end up reaching a locally nearly-concave regime, where APG could benefit significantly from the momentum, within finite iterations. By means of numerical validation, we confirm that APG exhibits $\tilde{O}(1/t^2)$ rate as well as show that APG could significantly improve the convergence behavior over the standard policy gradient.
摘要
We formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a $\tilde{O}(1/t^2)$ rate. This is the first characterization of the global convergence rate of NAG in the context of RL. Our analysis relies on one interesting finding: regardless of the initialization, APG could end up reaching a locally nearly-concave regime, where APG could benefit significantly from the momentum, within finite iterations.Numerical validation confirms that APG exhibits a $\tilde{O}(1/t^2)$ rate and shows that APG could significantly improve the convergence behavior over the standard policy gradient.
paper_authors: Sebastian Egginger, Alona Sakhnenko, Jeanette Miriam Lorenz
for: 本研究旨在investigating the effects of hyperparameter choice on the model performance and the generalization gap between classical and quantum kernels, and exploring the use of the geometric difference as a tool for evaluating the potential for quantum advantage.
methods: 本研究使用了 quantum kernel methods and hyperparameter optimization techniques to evaluate the performance of quantum and classical machine learning models on 11 datasets. The geometric difference was used as a closeness measure between the two kernel-based machine learning approaches.
results: 研究发现,hyperparameter optimization是critical for achieving good model performance and reducing the generalization gap between classical and quantum kernels. The geometric difference can be a useful tool for evaluating the potential for quantum advantage, and can help identify commodities that can be exploited when examining a new dataset.Abstract
Quantum kernel methods are a promising method in quantum machine learning thanks to the guarantees connected to them. Their accessibility for analytic considerations also opens up the possibility of prescreening datasets based on their potential for a quantum advantage. To do so, earlier works developed the geometric difference, which can be understood as a closeness measure between two kernel-based machine learning approaches, most importantly between a quantum kernel and classical kernel. This metric links the quantum and classical model complexities. Therefore, it raises the question of whether the geometric difference, based on its relation to model complexity, can be a useful tool in evaluations other than for the potential for quantum advantage. In this work, we investigate the effects of hyperparameter choice on the model performance and the generalization gap between classical and quantum kernels. The importance of hyperparameter optimization is well known also for classical machine learning. Especially for the quantum Hamiltonian evolution feature map, the scaling of the input data has been shown to be crucial. However, there are additional parameters left to be optimized, like the best number of qubits to trace out before computing a projected quantum kernel. We investigate the influence of these hyperparameters and compare the classically reliable method of cross validation with the method of choosing based on the geometric difference. Based on the thorough investigation of the hyperparameters across 11 datasets we identified commodities that can be exploited when examining a new dataset. In addition, our findings contribute to better understanding of the applicability of the geometric difference.
摘要
量子kernels方法是量子机器学习中的一种有前途的方法,这主要归功于它们的 garantías。它们的可见性使得可以对数据进行预选择,以确定它们是否具有量子优势。以前的工作在开发了 геомétríain difference,这可以理解为两种基于kernel的机器学习方法之间的距离度量,主要是quantum kernel和классическийkernel之间的距离。这个指标连接了量子和классиical模型复杂性。因此,它提出了问题,是否可以通过其与模型复杂性的关系来使用 geometric difference 作为评估工具?在这种工作中,我们investigate了hyperparameter的选择对模型性能和量子和классиical kernel之间的泛化差异的影响。特别是 для量子 Hamiltonian 演化特征图,输入数据的涨落Scaling 已经被证明是关键。然而,还有其他参数需要优化,例如最佳的量子bits数量来计算projected quantum kernel。我们 investigate了这些超参数的影响,并将cross validation 方法与基于 geometric difference 的选择方法进行比较。通过对 11 个数据集进行了全面的超参数调整,我们发现了一些可以利用的商品,并对量子和классиical kernel之间的泛化差异进行了更好的理解。
Building a Graph-based Deep Learning network model from captured traffic traces
results: 我们的实验结果表明,提议的解决方案能够学习和泛化到未看过的捕捉网络场景。Abstract
Currently the state of the art network models are based or depend on Discrete Event Simulation (DES). While DES is highly accurate, it is also computationally costly and cumbersome to parallelize, making it unpractical to simulate high performance networks. Additionally, simulated scenarios fail to capture all of the complexities present in real network scenarios. While there exists network models based on Machine Learning (ML) techniques to minimize these issues, these models are also trained with simulated data and hence vulnerable to the same pitfalls. Consequently, the Graph Neural Networking Challenge 2023 introduces a dataset of captured traffic traces that can be used to build a ML-based network model without these limitations. In this paper we propose a Graph Neural Network (GNN)-based solution specifically designed to better capture the complexities of real network scenarios. This is done through a novel encoding method to capture information from the sequence of captured packets, and an improved message passing algorithm to better represent the dependencies present in physical networks. We show that the proposed solution it is able to learn and generalize to unseen captured network scenarios.
摘要
现在的状态艺术网络模型都基于不可countdown事件模拟(DES)。虽然DES具有高度准确的优点,但也有计算成本高和并行化困难,使得模拟高性能网络不实际。此外,模拟场景不能捕捉实际网络场景中的所有复杂性。而现有的网络模型基于机器学习(ML)技术来减少这些问题,但这些模型又是通过模拟数据进行训练,因此也受到相同的局限性。因此,2023年的图解网络挑战(GNN)引入了一个包含流量轨迹的数据集,可以用于构建一个基于机器学习(ML)的网络模型,不受上述局限性的影响。在本文中,我们提出了一种基于图解网络(GNN)的解决方案,通过一种新的编码方法来捕捉从流量序列中获得的信息,以及一种改进的消息传递算法来更好地表示物理网络中的依赖关系。我们示出了我们的解决方案能够学习和掌握未看过的捕捉网络场景。
Online Convex Optimization with Switching Cost and Delayed Gradients
results: 论文显示了OMGD算法的竞争比例upper bound为$4(L + 5) + \frac{16(L + 5)}{\mu}$,并且证明了这个Upper bound是order-wise tight。此外,论文还证明了任何在线算法的竞争比例至少为$\max{\Omega(L), \Omega(\frac{L}{\sqrt{\mu})}$。Abstract
We consider the online convex optimization (OCO) problem with quadratic and linear switching cost in the limited information setting, where an online algorithm can choose its action using only gradient information about the previous objective function. For $L$-smooth and $\mu$-strongly convex objective functions, we propose an online multiple gradient descent (OMGD) algorithm and show that its competitive ratio for the OCO problem with quadratic switching cost is at most $4(L + 5) + \frac{16(L + 5)}{\mu}$. The competitive ratio upper bound for OMGD is also shown to be order-wise tight in terms of $L,\mu$. In addition, we show that the competitive ratio of any online algorithm is $\max\{\Omega(L), \Omega(\frac{L}{\sqrt{\mu})\}$ in the limited information setting when the switching cost is quadratic. We also show that the OMGD algorithm achieves the optimal (order-wise) dynamic regret in the limited information setting. For the linear switching cost, the competitive ratio upper bound of the OMGD algorithm is shown to depend on both the path length and the squared path length of the problem instance, in addition to $L, \mu$, and is shown to be order-wise, the best competitive ratio any online algorithm can achieve. Consequently, we conclude that the optimal competitive ratio for the quadratic and linear switching costs are fundamentally different in the limited information setting.
摘要
我们考虑在有限信息设定下的线上凸优化(OCO)问题,其中一个线上算法可以根据过去的目标函数GradientInformation选择行动。对于$L$-smooth和$\mu$-强制凸目标函数,我们提出了一个线上多重梯度降低(OMGD)算法,并证明其在具有quadratic switching cost的OCO问题中的竞争比率不大于$4(L + 5) + \frac{16(L + 5)}{\mu}$。此外,我们还证明了OMGD算法的竞争比率Upper bound是order-wise tight in terms of $L,\mu$。另外,我们还证明了在有限信息设定下,任何线上算法的竞争比率都是$\max\{\Omega(L), \Omega(\frac{L}{\sqrt{\mu})\}$。此外,我们还证明了OMGD算法在有限信息设定下具有最佳(order-wise)动态遗憾。在linear switching cost的情况下,我们证明了OMGD算法的竞争比率Upper bound取决于问题实体的路径长度和平方路径长度,而且随着$L, \mu$的变化而变化。此外,我们还证明了OMGD算法在linear switching cost的情况下具有order-wise最佳的竞争比率。因此,我们结论到了quadratic和linear switching cost在有限信息设定下的竞争比率是基本不同的。
SQ Lower Bounds for Learning Mixtures of Linear Classifiers
paper_authors: Ilias Diakonikolas, Daniel M. Kane, Yuxin Sun
For: 学习混合线性分类器下 Gaussian covariates 问题。* Methods: 使用 Statistical Query (SQ) 算法来解决问题,并提供了一个新的圆形设计技术。* Results: 得到了一个 Statistical Query (SQ) 下界,表明现有算法的复杂性为 $n^{\mathrm{poly}(1/\Delta) \log(r)} $,其中 $\Delta$ 是 $\mathbf{v}_\ell$ 对应的下界Pairwise $\ell_2$-separation。Abstract
We study the problem of learning mixtures of linear classifiers under Gaussian covariates. Given sample access to a mixture of $r$ distributions on $\mathbb{R}^n$ of the form $(\mathbf{x},y_{\ell})$, $\ell\in [r]$, where $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n)$ and $y_\ell=\mathrm{sign}(\langle\mathbf{v}_\ell,\mathbf{x}\rangle)$ for an unknown unit vector $\mathbf{v}_\ell$, the goal is to learn the underlying distribution in total variation distance. Our main result is a Statistical Query (SQ) lower bound suggesting that known algorithms for this problem are essentially best possible, even for the special case of uniform mixtures. In particular, we show that the complexity of any SQ algorithm for the problem is $n^{\mathrm{poly}(1/\Delta) \log(r)}$, where $\Delta$ is a lower bound on the pairwise $\ell_2$-separation between the $\mathbf{v}_\ell$'s. The key technical ingredient underlying our result is a new construction of spherical designs that may be of independent interest.
摘要
我们研究混合线性分类器学习问题,假设我们有一个混合的$r$个分布在 $\mathbb{R}^n$ 上,每个分布的形式是 $({\mathbf{x},y_{\ell})$, $\ell\in [r] $,其中 $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n) $ 是一个标准均值为零的均值为 $\mathbf{I}_n $ 的高维Normal分布,$y_{\ell} = \text{sign}(\langle \mathbf{v}_{\ell}, \mathbf{x} \rangle)$ 是一个未知的单位向量 $\mathbf{v}_{\ell} $ 的某种标记。我们的目标是通过总变化距离来学习这个下面的分布。我们的主要结果是一个统计查询(SQ)下界,表明现有的算法是可能最佳的,即使特殊情况下是均匀混合。具体来说,我们证明任何 SQ 算法的复杂度为 $n^{\mathrm{poly}(1/\Delta) \log(r)}$, 其中 $\Delta$ 是 $\mathbf{v}_{\ell}$ 之间的对角线 $\ell_2$ separation 的下界。我们的技术核心是一种新的圆柱体设计,可能具有独立的利用价值。
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function
results: 该论文证明了这种方法可以同时提供不确定计算的函数值、梯度和希格曼矩阵,并且可以在非对称优化问题中实现$\epsilon$-近似二阶优 оптимальность。此外,这种方法的迭代复杂度与前一个研究中的精确计算相同。Abstract
Trust-region (TR) and adaptive regularization using cubics (ARC) have proven to have some very appealing theoretical properties for non-convex optimization by concurrently computing function value, gradient, and Hessian matrix to obtain the next search direction and the adjusted parameters. Although stochastic approximations help largely reduce the computational cost, it is challenging to theoretically guarantee the convergence rate. In this paper, we explore a family of stochastic TR and ARC methods that can simultaneously provide inexact computations of the Hessian matrix, gradient, and function values. Our algorithms require much fewer propagations overhead per iteration than TR and ARC. We prove that the iteration complexity to achieve $\epsilon$-approximate second-order optimality is of the same order as the exact computations demonstrated in previous studies. Additionally, the mild conditions on inexactness can be met by leveraging a random sampling technology in the finite-sum minimization problem. Numerical experiments with a non-convex problem support these findings and demonstrate that, with the same or a similar number of iterations, our algorithms require less computational overhead per iteration than current second-order methods.
摘要
信任区域(TR)和适应正则化使用立方体(ARC)在非对称优化中有非常吸引人的理论性质。它们同时计算函数值、梯度和偏导数矩阵,以获取下一步搜索方向和调整参数。 although stochastic approximations can significantly reduce computational cost, it is challenging to theoretically guarantee the convergence rate.在这篇论文中,我们探讨了一家Stochastic TR和ARC方法,可同时提供不准确的函数值、梯度和偏导数矩阵计算。我们的算法需要每次迭代 fewer propagations overhead than TR和ARC。我们证明,以 Achieve $\epsilon$-近似第二阶优化的迭代复杂度与前一个研究中的精确计算相同顺序。此外,我们的方法可以通过利用随机抽样技术在finite-sum minimization问题中实现轻度的不准确性条件。numerical experiments with a non-convex problem support these findings and demonstrate that, with the same or a similar number of iterations, our algorithms require less computational overhead per iteration than current second-order methods.
Effective and Efficient Federated Tree Learning on Hybrid Data
paper_authors: Qinbin Li, Chulin Xie, Xiaojun Xu, Xiaoyuan Liu, Ce Zhang, Bo Li, Bingsheng He, Dawn Song
for: 该论文旨在 Addressing the challenges of federated learning in hybrid data settings, where data from different parties may differ in both features and samples.
results: 实验表明,HybridTree 可以与中央集成集成环境相比,达到相同的准确率,而且可以减少计算和通信协议的开销,最高可以达到 8 倍的速度提升。Abstract
Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines.
摘要
《联合学习》已经成为一种有前途的分布式学习 paradigma,它使得多个党 collaboration 学习,无需传输原始数据。然而,现有大多数联合学习研究都集中在水平或垂直数据设置中,即不同党的数据假设来自同一个特征或样本空间。在实际应用中,常见的情景是混合数据设置,其中党的数据可能具有不同的特征和样本。为 Addressing 此问题,我们提出 HybridTree,一种新的联合学习方法,可以在混合数据上进行联合树学习。我们发现了共同拆分规则在树中的存在,这些规则帮助我们 theoretically 表明党的知识可以在树的下层级中被包含。基于我们的理论分析,我们提出一种层级解决方案,不需要频繁的通信协议来训练树。我们的实验表明,HybridTree 可以与中央集成设置具有相同的准确率,同时具有较低的计算和通信协议负担。HybridTree 可以与其他基准值进行比较,达到 8 倍的速度提升。
Accelerate Presolve in Large-Scale Linear Programming via Reinforcement Learning
results: 实验结果表明,RL4Presolve可以有效地提高大规模LP的解决效率,特别是在来自业界的benchmark中。此外,通过提取学习政策中的规则,可以将RL4Presolve简单地部署到华为的供应链中。这些结果表明,将机器学习技术应用于现代LP解决器可以实现可负担的经济和学术潜力。Abstract
Large-scale LP problems from industry usually contain much redundancy that severely hurts the efficiency and reliability of solving LPs, making presolve (i.e., the problem simplification module) one of the most critical components in modern LP solvers. However, how to design high-quality presolve routines -- that is, the program determining (P1) which presolvers to select, (P2) in what order to execute, and (P3) when to stop -- remains a highly challenging task due to the extensive requirements on expert knowledge and the large search space. Due to the sequential decision property of the task and the lack of expert demonstrations, we propose a simple and efficient reinforcement learning (RL) framework -- namely, reinforcement learning for presolve (RL4Presolve) -- to tackle (P1)-(P3) simultaneously. Specifically, we formulate the routine design task as a Markov decision process and propose an RL framework with adaptive action sequences to generate high-quality presolve routines efficiently. Note that adaptive action sequences help learn complex behaviors efficiently and adapt to various benchmarks. Experiments on two solvers (open-source and commercial) and eight benchmarks (real-world and synthetic) demonstrate that RL4Presolve significantly and consistently improves the efficiency of solving large-scale LPs, especially on benchmarks from industry. Furthermore, we optimize the hard-coded presolve routines in LP solvers by extracting rules from learned policies for simple and efficient deployment to Huawei's supply chain. The results show encouraging economic and academic potential for incorporating machine learning to modern solvers.
摘要
大规模LP问题从行业 обычно含有很多重复性,这会严重地降低解决LP的效率和可靠性,因此宏观问题简化模块(i.e., 问题简化模块)成为现代LP解决器中最 kritical 的一部分。然而,如何设计高质量的宏观问题简化程序 --- 即确定(P1)哪些简化器选择,(P2)在哪个顺序执行,以及(P3)何时停止 --- 仍然是一项非常困难的任务,这主要归结于宏观问题简化程序的广泛需求和搜索空间的庞大。由于任务具有顺序决策性和缺乏专家示范,我们提出了一种简单和高效的机器学习(RL)框架 --- 即RL4Presolve --- 以同时解决(P1)-(P3)。具体来说,我们将问题简化任务视为一个Markov决策过程,并提出了一种RL框架,其中包含可适应行为序列来生成高质量的宏观问题简化程序。注意,可适应行为序列可以高效地学习复杂的行为并适应不同的标准。在两种解决器(开源和商业)和八个标准(实际世界和 sintetic)上进行了实验,RL4Presolve显示可以有效地提高大规模LP的解决效率,特别是在行业标准上。此外,我们还使用RL学习到来自学习的策略中的规则,以便简单和高效地在Huawei的供应链中部署。结果表明,通过把机器学习技术应用到现代解决器中,可以获得有优 экономиче和学术潜力。
On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
results: 论文发现了不同的目标规定 formalism 之间存在一定的限制,并且没有任何一种 formalism 同时具有优化和表达能力。例如,论文证明了 Regularised RL、Outer Nonlinear Markov Rewards、Reward Machines、Linear Temporal Logic 和 Limit Average Rewards 等 formalism 可以表达其他 formalism 无法表达的目标。这些结论有关于RL中目标规定 formalism的选择和实践中的表达限制。Abstract
To solve a task with reinforcement learning (RL), it is necessary to formally specify the goal of that task. Although most RL algorithms require that the goal is formalised as a Markovian reward function, alternatives have been developed (such as Linear Temporal Logic and Multi-Objective Reinforcement Learning). Moreover, it is well known that some of these formalisms are able to express certain tasks that other formalisms cannot express. However, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of expressivity. In this work, we fill this gap in the existing literature by providing a comprehensive comparison of the expressivities of 17 objective-specification formalisms in RL. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express an objective that the others cannot. Our findings have implications for both policy optimisation and reward learning. Firstly, we identify expressivity limitations which are important to consider when specifying objectives in practice. Secondly, our results highlight the need for future research which adapts reward learning to work with a variety of formalisms, since many existing reward learning methods implicitly assume that desired objectives can be expressed with Markovian rewards. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms.
摘要
要解决一个任务使用强化学习(RL),需要正式 specify 该任务的目标。大多数 RL 算法需要将目标 formalized 为 Markov 奖励函数,但是有其他形式(如线性时间逻辑和多目标强化学习)也有被开发出来。然而,到目前为止,没有任何 thorougly 分析这些形式之间的关系。在这种情况下,我们填充了现有文献中的这种 gap by 提供了17种目标规定 formalism 在RL中的比较。我们将这些 formalism 按照其表达力排序,并将其显示为一个 Hasse диаграм。我们发现了不同 formalism 的一些限制,并证明了每种 formalism 都有一些可以表达的任务,而其他 formalism 不能表达。例如,我们证明了 Regularized RL、Outer Nonlinear Markov Rewards、Reward Machines、线性时间逻辑和 Limit Average Rewards 可以表达出其他 formalism 不能表达的任务。我们的发现对于policy优化和奖励学习都有重要的意义。首先,我们identified 表达力的限制,这些限制在实践中需要考虑。其次,我们的结果表明需要将奖励学习适应到不同 formalism 中,因为现有的奖励学习方法通常假设desired objective 可以用 Markov 奖励函数表达。我们的工作对RL objective-specification formalism 的costs and benefits 提供了更加一致的理解。
Equivariant Bootstrapping for Uncertainty Quantification in Imaging Inverse Problems
for: This paper aims to accurately quantify the uncertainty in solutions to severely ill-posed scientific imaging problems, which is critical for interpreting experimental results and using reconstructed images as scientific evidence.
methods: The proposed uncertainty quantification methodology is based on an equivariant formulation of the parametric bootstrap algorithm, which leverages symmetries and invariance properties commonly encountered in imaging problems. The method is general and can be applied with any image reconstruction technique, including unsupervised training strategies.
results: The proposed method delivers remarkably accurate high-dimensional confidence regions and outperforms alternative uncertainty quantification strategies in terms of estimation accuracy, uncertainty quantification accuracy, and computing time. The method is demonstrated through a series of numerical experiments.Abstract
Scientific imaging problems are often severely ill-posed, and hence have significant intrinsic uncertainty. Accurately quantifying the uncertainty in the solutions to such problems is therefore critical for the rigorous interpretation of experimental results as well as for reliably using the reconstructed images as scientific evidence. Unfortunately, existing imaging methods are unable to quantify the uncertainty in the reconstructed images in a manner that is robust to experiment replications. This paper presents a new uncertainty quantification methodology based on an equivariant formulation of the parametric bootstrap algorithm that leverages symmetries and invariance properties commonly encountered in imaging problems. Additionally, the proposed methodology is general and can be easily applied with any image reconstruction technique, including unsupervised training strategies that can be trained from observed data alone, thus enabling uncertainty quantification in situations where there is no ground truth data available. We demonstrate the proposed approach with a series of numerical experiments and through comparisons with alternative uncertainty quantification strategies from the state-of-the-art, such as Bayesian strategies involving score-based diffusion models and Langevin samplers. In all our experiments, the proposed method delivers remarkably accurate high-dimensional confidence regions and outperforms the competing approaches in terms of estimation accuracy, uncertainty quantification accuracy, and computing time.
摘要
Optimising Distributions with Natural Gradient Surrogates
results: 扩展了可以使用自然偏导法优化的 distribuition 类型,以及fast、易于理解、简单实现和不需要详细模型Derivation。Abstract
Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.
摘要
自然均方法已经广泛应用于估计概率分布参数,经常导致快速收敛的过程。然而,许多感兴趣的分布中,计算自然均方的问题充满挑战。在这种情况下,我们提出了一种新的技巧,即将估计变换为一种对准ocker分布参数的估计问题,其中计算自然均方是容易的。我们给出了一些现有的方法,可以看作是应用这种技巧,并提出了一种新的方法,可以应用于各种问题。我们的方法可以扩展到更多的分布,并且快速、易于理解、使用标准自动极化软件实现,不需要详细的模型特定的 derivations。我们在最大 LIKELIHOOD估计和variational推断任务中进行了示例。
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
results: 实验结果表明,该模型在识别情绪、音频分类和检索 bencmarks 中表现出色,具有适用于多语言和各种各样的声音条件的共享表示能力,同时还能够编码潜在的情感维度。Abstract
This paper proposes a novel framework for multilingual speech and sound representation learning using contrastive learning. The lack of sizeable labelled datasets hinders speech-processing research across languages. Recent advances in contrastive learning provide self-supervised techniques to learn from unlabelled data. Motivated by reducing data dependence and improving generalisation across diverse languages and conditions, we develop a multilingual contrastive framework. This framework enables models to acquire shared representations across languages, facilitating cross-lingual transfer with limited target language data. Additionally, capturing emotional cues within speech is challenging due to subjective perceptual assessments. By learning expressive representations from diverse, multilingual data in a self-supervised manner, our approach aims to develop speech representations that encode emotive dimensions. Our method trains encoders on a large corpus of multi-lingual audio data. Data augmentation techniques are employed to expand the dataset. The contrastive learning approach trains the model to maximise agreement between positive pairs and minimise agreement between negative pairs. Extensive experiments demonstrate state-of-the-art performance of the proposed model on emotion recognition, audio classification, and retrieval benchmarks under zero-shot and few-shot conditions. This provides an effective approach for acquiring shared and generalised speech representations across languages and acoustic conditions while encoding latent emotional dimensions.
摘要
We train encoders on a large corpus of multi-lingual audio data, and employ data augmentation techniques to expand the dataset. The contrastive learning approach trains the model to maximize agreement between positive pairs and minimize agreement between negative pairs. Extensive experiments demonstrate state-of-the-art performance of the proposed model on emotion recognition, audio classification, and retrieval benchmarks under zero-shot and few-shot conditions. This provides an effective approach for acquiring shared and generalised speech representations across languages and acoustic conditions while encoding latent emotional dimensions.Here's the Simplified Chinese translation:这篇论文提出了一种新的多语言语音和声音表示学习框架,使用对比学习。由于语言上的大量标注数据缺乏,这阻碍了跨语言语音处理研究的进步。但是,最近的对比学习技术提供了一种无监督的学习方法,可以从无标注数据中学习。我们的方法旨在通过学习多语言数据,以便在不同语言和条件下实现交互转移,并且编码潜在的情感维度。我们将编码器训练在一个大量多语言音频数据集上,并使用数据扩展技术来扩大数据集。对比学习方法将模型训练以最大化正方向对的匹配,并最小化负方向对的匹配。广泛的实验表明,提议的模型在情感识别、音频分类和检索benchmark上实现了顶尖性能,包括零shot和几shot情况下。这提供了一种有效的方法,可以在不同语言和音频条件下获得共享和普适的语音表示,同时编码潜在的情感维度。
Towards Graph Foundation Models: A Survey and Beyond
results: 文章提供了当前基于图像学的基本模型领域的全面的概述,以及这个领域的未来研究方向。Abstract
Emerging as fundamental building blocks for diverse artificial intelligence applications, foundation models have achieved notable success across natural language processing and many other domains. Parallelly, graph machine learning has witnessed a transformative shift, with shallow methods giving way to deep learning approaches. The emergence and homogenization capabilities of foundation models have piqued the interest of graph machine learning researchers, sparking discussions about developing the next graph learning paradigm that is pre-trained on broad graph data and can be adapted to a wide range of downstream graph tasks. However, there is currently no clear definition and systematic analysis for this type of work. In this article, we propose the concept of graph foundation models (GFMs), and provide the first comprehensive elucidation on their key characteristics and technologies. Following that, we categorize existing works towards GFMs into three categories based on their reliance on graph neural networks and large language models. Beyond providing a comprehensive overview of the current landscape of graph foundation models, this article also discusses potential research directions for this evolving field.
摘要
emerging as fundamental building blocks for diverse artificial intelligence applications, foundation models have achieved notable success across natural language processing and many other domains. 同时, graph machine learning has witnessed a transformative shift, with shallow methods giving way to deep learning approaches. the emergence and homogenization capabilities of foundation models have piqued the interest of graph machine learning researchers, sparking discussions about developing the next graph learning paradigm that is pre-trained on broad graph data and can be adapted to a wide range of downstream graph tasks. however, there is currently no clear definition and systematic analysis for this type of work. in this article, we propose the concept of graph foundation models (gfms), and provide the first comprehensive elucidation on their key characteristics and technologies. following that, we categorize existing works towards gfms into three categories based on their reliance on graph neural networks and large language models. beyond providing a comprehensive overview of the current landscape of graph foundation models, this article also discusses potential research directions for this evolving field.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
results: 这篇论文提出了一些历史上对数据流机器学习的假设,并将这些假设放在历史上的学术背景中进行了回顾。Abstract
Machine learning from data streams is an active and growing research area. Research on learning from streaming data typically makes strict assumptions linked to computational resource constraints, including requirements for stream mining algorithms to inspect each instance not more than once and be ready to give a prediction at any time. Here we review the historical context of data streams research placing the common assumptions used in machine learning over data streams in their historical context.
摘要
<>将数据流中的机器学习作为活跃和快速发展的研究领域。研究从数据流中学习通常做出严格的计算资源限制,包括流程挖掘算法不能再次检查每个实例,并且要准备任何时间给出预测。我们在这里将数据流研究的历史背景和常见的机器学习假设置在历史上的位置。Translation:机器学习从数据流中是一个活跃和快速发展的研究领域。研究从数据流中学习通常做出严格的计算资源限制,包括流程挖掘算法不能再次检查每个实例,并且要准备任何时间给出预测。我们在这里将数据流研究的历史背景和常见的机器学习假设置在历史上的位置。
De novo protein design using geometric vector field networks
results: 本论文的实验结果显示,VFN在蛋白质diffusion中表现出色,比起先前的IPA模型,VFN在设计性(67.04% vs. 53.58%)和多样性(66.54% vs. 51.98%)等方面均有较好的表现。此外,VFN也在倒拾蛋白质(frame和原子模型)中表现出色,比起先前的PiFold模型(54.7% vs. 51.66%),VFN在序列恢复率上有较好的表现。此外,本论文还提出了一种将VFN与ESM模型结合的方法,这种方法在先前的ESM-based SoTA(62.67% vs. 55.65%)上有着较好的表现。Abstract
Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far, only several simple encoders, such as IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we proffer the Vector Field Network (VFN), which enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits an impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding (frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate. We also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin.
摘要
新技术如蛋白diffusion已经使得蛋白结构设计得到了重要的进步,这是生命科学中非常重要的话题。这些方法通常依赖于蛋白结构编码器来模拟蛋白质量框架,其中原子不存在。以前的编码器大多数依赖于原子粒子特征,如原子之间的角度和距离,这些特征在这种情况下不可用。只有一些简单的编码器,如IPA,已经被提出,这暴露了框架模型化为瓶颈。在这种工作中,我们提议使用 Vector Field Network(VFN),它使得网络层可以通过学习vector计算来处理坐标相关的操作。VFN在蛋白diffusion(框架模型)中表现出了非常出色的性能优势,比IPA更高,达到67.04% vs. 53.58%的设计性能和66.54% vs. 51.98%的多样性。在 inverse folding(框架和原子模型)中,VFN也超越了之前的SoTA模型,PiFold(54.7% vs. 51.66%),在序列恢复率方面表现出色。我们还提出了使用VFN和ESM模型的方法,该方法在之前的ESM-based SoTA(62.67% vs. 55.65%)之上显著提高了性能。
Adversarial Training for Physics-Informed Neural Networks
paper_authors: Yao Li, Shengzhu Shi, Zhichang Guo, Boying Wu
For: 解决复杂的偏微分方程(PDEs)中的缺乏稳定性问题,提高Physics-informed neural networks(PINNs)的测试精度和可靠性。* Methods: 基于投影gradient descent对抗攻击(PGD-based adversarial attack),提出了一种名为AT-PINNs的对抗训练策略,可以增强PINNs的Robustness和稳定性。AT-PINNs可以通过在训练过程中使用对抗样本来准确地识别模型失败位置,并在训练过程中使模型更加注重这些位置。* Results: 应用AT-PINNs于各种复杂的PDEs,包括多尺度约束的圆柱方程、多峰解的波兰射方程、普朗克方程的锐度解和Allen-Cahn方程。结果表明,AT-PINNs可以有效地定位和减少失败区域,并且适用于解决复杂的PDEs,因为对于失败区域的定位无关于失败区域的大小或分布的复杂性。Abstract
Physics-informed neural networks have shown great promise in solving partial differential equations. However, due to insufficient robustness, vanilla PINNs often face challenges when solving complex PDEs, especially those involving multi-scale behaviors or solutions with sharp or oscillatory characteristics. To address these issues, based on the projected gradient descent adversarial attack, we proposed an adversarial training strategy for PINNs termed by AT-PINNs. AT-PINNs enhance the robustness of PINNs by fine-tuning the model with adversarial samples, which can accurately identify model failure locations and drive the model to focus on those regions during training. AT-PINNs can also perform inference with temporal causality by selecting the initial collocation points around temporal initial values. We implement AT-PINNs to the elliptic equation with multi-scale coefficients, Poisson equation with multi-peak solutions, Burgers equation with sharp solutions and the Allen-Cahn equation. The results demonstrate that AT-PINNs can effectively locate and reduce failure regions. Moreover, AT-PINNs are suitable for solving complex PDEs, since locating failure regions through adversarial attacks is independent of the size of failure regions or the complexity of the distribution.
摘要
物理学 Informed Neural Networks (PINNs) 已经展示了解决partial differential equations (PDEs) 的巨大承诺. 然而,由于不充分的Robustness,vanilla PINNs 经常在解决复杂的PDEs中遇到挑战,特别是包含多尺度行为或解决具有锐利或振荡特征的PDEs. 为了解决这些问题,我们基于Projected gradient descent adversarial attack (PGD-AA)提出了一种名为AT-PINNs的对抗训练策略。AT-PINNs可以增强PINNs的Robustness,通过在训练过程中使用对抗样本,准确地识别模型失败的位置并使模型在训练过程中专注于这些位置。AT-PINNs还可以通过选择时间初值附近的初始坐标来进行时间 causality 的推理。我们将AT-PINNs应用到了各种PDEs,包括具有多尺度系数的圆柱 equation、Poisson equation with multi-peak solutions、Burgers equation with sharp solutions和Allen-Cahn equation。结果表明,AT-PINNs可以有效地定位和减少失败区域。此外,AT-PINNs适用于解决复杂的PDEs,因为通过对抗攻击定位失败区域是独立于失败区域的大小或分布复杂性的。
NeuroCUT: A Neural Approach for Robust Graph Partitioning
results: NeuroCut在实验中表现出色,能够找到高质量的分区,具有强大的泛化性和对图结构 modificatiopn的抗颤势性。Abstract
Graph partitioning aims to divide a graph into $k$ disjoint subsets while optimizing a specific partitioning objective. The majority of formulations related to graph partitioning exhibit NP-hardness due to their combinatorial nature. As a result, conventional approximation algorithms rely on heuristic methods, sometimes with approximation guarantees and sometimes without. Unfortunately, traditional approaches are tailored for specific partitioning objectives and do not generalize well across other known partitioning objectives from the literature. To overcome this limitation, and learn heuristics from the data directly, neural approaches have emerged, demonstrating promising outcomes. In this study, we extend this line of work through a novel framework, NeuroCut. NeuroCut introduces two key innovations over prevailing methodologies. First, it is inductive to both graph topology and the partition count, which is provided at query time. Second, by leveraging a reinforcement learning based framework over node representations derived from a graph neural network, NeuroCut can accommodate any optimization objective, even those encompassing non-differentiable functions. Through empirical evaluation, we demonstrate that NeuroCut excels in identifying high-quality partitions, showcases strong generalization across a wide spectrum of partitioning objectives, and exhibits resilience to topological modifications.
摘要
graf分割的目标是将 Graf 分成 k 个不交叉的子集,同时最大化特定的分割目标。大多数相关的形式化问题都会显示NP困难,因为它们具有各种 combinatorial 特性。因此,现有的approximation算法通常使用了heuristic方法,有时具有approximation保证,有时没有。 unfortunately,传统的方法通常是为特定的分割目标设计的,不能总是泛化到其他从文献中知道的分割目标。为了解决这个限制,并从数据中直接学习heuristics,神经方法出现了。在这项研究中,我们通过一个新的框架,NeuroCut,进一步推动这一线的发展。NeuroCut 具有两个关键创新:首先,它是对 Graf 结构和分割 count inductive的,可以在查询时提供。其次,通过利用基于节点表示学习的reinforcement learning框架,NeuroCut 可以处理任何优化目标,包括不可导函数。通过实验评估,我们示出NeuroCut 可以提供高质量的分割,具有强大的泛化能力,并且对 Graf 结构的修改 display 强大的抗衡性。
A Quasi-Wasserstein Loss for Learning Graph Neural Networks
results: 实验表明,提出的 QW 损失函数可以应用于多种 GNN 模型,并且能够提高其性能在节点级预测和回归任务中。此外,该损失函数还可以提供一种新的拟合学习和预测方法。Abstract
When learning graph neural networks (GNNs) in node-level prediction tasks, most existing loss functions are applied for each node independently, even if node embeddings and their labels are non-i.i.d. because of their graph structures. To eliminate such inconsistency, in this study we propose a novel Quasi-Wasserstein (QW) loss with the help of the optimal transport defined on graphs, leading to new learning and prediction paradigms of GNNs. In particular, we design a "Quasi-Wasserstein" distance between the observed multi-dimensional node labels and their estimations, optimizing the label transport defined on graph edges. The estimations are parameterized by a GNN in which the optimal label transport may determine the graph edge weights optionally. By reformulating the strict constraint of the label transport to a Bregman divergence-based regularizer, we obtain the proposed Quasi-Wasserstein loss associated with two efficient solvers learning the GNN together with optimal label transport. When predicting node labels, our model combines the output of the GNN with the residual component provided by the optimal label transport, leading to a new transductive prediction paradigm. Experiments show that the proposed QW loss applies to various GNNs and helps to improve their performance in node-level classification and regression tasks.
摘要
当学习图 neural network (GNN) 在节点级预测任务时,大多数现有的损失函数都是对每个节点独立应用的,即使节点表示和其标签不是独立的,因为它们的图结构。为了消除这种不一致,在本研究中我们提出了一种新的 quasi-Wasserstein (QW) 损失函数,基于图上的最优运输定义。在特定情况下,我们定义了 observe 多维节点标签和其估计之间的 "quasi-Wasserstein" 距离,并且优化了图边上的标签运输定义。这些估计是通过一个 GNN 来 parameterize,其中优化的标签运输可能会确定图边权重。通过将 строго的标签运输约束转换为 Bregman 分布定义based REG regularizer,我们获得了我们的提议的 QW 损失函数,并且可以使用两种高效的算法来学习 GNN 和标签运输。在预测节点标签时,我们将 GNN 的输出与标签运输的 residual 组件相加,这导致了一种新的混合预测 paradigm。实验表明,我们的提议 QW 损失函数可以应用于多种 GNN 和提高它们在节点级预测和回归任务中的性能。
Unintended Memorization in Large ASR Models, and How to Mitigate It
results: 在state-of-the-art ASR模型中发现了memorization问题,并通过gradient clipping来 Mitigate memorization。在大规模分布式训练中,clip each example’s gradient可以保持中性模型质量和计算成本,同时提供强的隐私保护。Abstract
It is well-known that neural networks can unintentionally memorize their training examples, causing privacy concerns. However, auditing memorization in large non-auto-regressive automatic speech recognition (ASR) models has been challenging due to the high compute cost of existing methods such as hardness calibration. In this work, we design a simple auditing method to measure memorization in large ASR models without the extra compute overhead. Concretely, we speed up randomly-generated utterances to create a mapping between vocal and text information that is difficult to learn from typical training examples. Hence, accurate predictions only for sped-up training examples can serve as clear evidence for memorization, and the corresponding accuracy can be used to measure memorization. Using the proposed method, we showcase memorization in the state-of-the-art ASR models. To mitigate memorization, we tried gradient clipping during training to bound the influence of any individual example on the final model. We empirically show that clipping each example's gradient can mitigate memorization for sped-up training examples with up to 16 repetitions in the training set. Furthermore, we show that in large-scale distributed training, clipping the average gradient on each compute core maintains neutral model quality and compute cost while providing strong privacy protection.
摘要
很多人知道神经网络可能会无意地记忆训练示例,这引起了隐私问题。然而,对大型非自动回归自动语音识别(ASR)模型的审核记忆存在高计算成本的问题,使得现有方法如困难度调整不太实用。在这种情况下,我们提出了一种简单的审核方法,可以不增加计算成本来测量大型ASR模型的记忆。具体来说,我们将随机生成的语音快速播放,以创建语音和文本信息之间的易于学习的映射。因此,只有对快速播放的训练示例进行准确预测时,可以作为记忆的证据,并且可以用这个精度来测量记忆。使用我们的方法,我们显示了state-of-the-art ASR模型中的记忆。为了解决记忆问题,我们尝试使用梯度截断法在训练时进行 bounding 梯度的影响。我们经验显示,对每个示例的梯度进行截断可以 Mitigate 记忆,并且可以在快速播放示例中进行16次复制。此外,我们还显示了在大规模分布式训练中,对每个计算核心的梯度平均截断可以保持中立的模型质量和计算成本,同时提供强的隐私保护。
On the Evaluation of Generative Models in Distributed Learning Tasks
results: 论文发现在分布式学习 задача中,使用FID和KID评估生成模型的结果可能会不同,具体来说是FID-avg和FID-all的评估结果可能会不同,而KID-avg和KID-all的评估结果则相同。Abstract
The evaluation of deep generative models including generative adversarial networks (GANs) and diffusion models has been extensively studied in the literature. While the existing evaluation methods mainly target a centralized learning problem with training data stored by a single client, many applications of generative models concern distributed learning settings, e.g. the federated learning scenario, where training data are collected by and distributed among several clients. In this paper, we study the evaluation of generative models in distributed learning tasks with heterogeneous data distributions. First, we focus on the Fr\'echet inception distance (FID) and consider the following FID-based aggregate scores over the clients: 1) FID-avg as the mean of clients' individual FID scores, 2) FID-all as the FID distance of the trained model to the collective dataset containing all clients' data. We prove that the model rankings according to the FID-all and FID-avg scores could be inconsistent, which can lead to different optimal generative models according to the two aggregate scores. Next, we consider the kernel inception distance (KID) and similarly define the KID-avg and KID-all aggregations. Unlike the FID case, we prove that KID-all and KID-avg result in the same rankings of generative models. We perform several numerical experiments on standard image datasets and training schemes to support our theoretical findings on the evaluation of generative models in distributed learning problems.
摘要
文章研究了深度生成模型(包括生成对抗网络)在分布式学习任务中的评价方法。现有评价方法主要针对中央式学习问题,即训练数据由单个客户端存储。然而,许多生成模型应用场景是分布式学习场景,例如联邦学习场景,其中训练数据由多个客户端分布存储。本文研究了分布式学习任务中各客户端数据分布不同的生成模型评价方法。首先,我们关注Fréchet吸引距离(FID),并考虑以下FID基于客户端的综合分数:1)FID-avg,即客户端个体FID分数的平均值,2)FID-all,即训练模型与所有客户端数据集的FID距离。我们证明了FID-all和FID-avg的模型排名可能不一致,可能导致不同的优化生成模型。接下来,我们考虑核心吸引距离(KID),并定义KID-avg和KID-all综合分数。与FID不同的是,我们证明了KID-all和KID-avg的模型排名是一致的。我们在标准图像集和训练方案上进行了多个数值实验来支持我们的理论发现。
Learning under Label Proportions for Text Classification
paper_authors: Jatin Chauhan, Xiaoxuan Wang, Wei Wang
for: 本研究旨在探讨在�xygen Learning from Label Proportions(LLP)的挑战性设置下进行NLPR Training,其数据提供在汇总形式下,仅提供每个类别的样本比例作为ground truth。
methods: 本研究提出了一种新的形式ulation,以及一种learnability result,以提供一个generalization bound under LLP。此外,该研究还使用了一种自我supervised objective。
results: 根据实验结果,该方法在大规模模型和多个维度上的文本数据上 achieved better results compared to基elines in almost 87% of the experimental configurations, across multiple metrics。Abstract
We present one of the preliminary NLP works under the challenging setup of Learning from Label Proportions (LLP), where the data is provided in an aggregate form called bags and only the proportion of samples in each class as the ground truth. This setup is inline with the desired characteristics of training models under Privacy settings and Weakly supervision. By characterizing some irregularities of the most widely used baseline technique DLLP, we propose a novel formulation that is also robust. This is accompanied with a learnability result that provides a generalization bound under LLP. Combining this formulation with a self-supervised objective, our method achieves better results as compared to the baselines in almost 87% of the experimental configurations which include large scale models for both long and short range texts across multiple metrics.
摘要
我们介绍了一项初步的自然语言处理(NLP)工作,在“学习从标签含量(LLP)”的挑战性设置下进行训练,其中数据提供在归一化的形式下,即袋(bag),并且只有每个类别的样本占总数的比例作为真实的地面信息。这种设置符合训练模型下的隐私设置和弱监督。我们对最常用的基线技术DLLP的不规则性进行描述,并提出了一种新的形式ulation,这种形式ulation具有 robustness。此外,我们还提供了一个learnability result,它在LLP下提供了一个通用的泛化 bound。将这种形式ulation与一种自我超vised目标函数相结合,我们的方法在大规模的实验配置中(包括长文本和短文本) across multiple metrics Achieves better results than baselines in nearly 87% of the cases.
results: 在异常点识别和医学影像数据集上,与标准DAM训练方法相比,提出的AUC-mixup方法显示出更高的泛化性能。Abstract
While deep AUC maximization (DAM) has shown remarkable success on imbalanced medical tasks, e.g., chest X-rays classification and skin lesions classification, it could suffer from severe overfitting when applied to small datasets due to its aggressive nature of pushing prediction scores of positive data away from that of negative data. This paper studies how to improve generalization of DAM by mixup data augmentation -- an approach that is widely used for improving generalization of the cross-entropy loss based deep learning methods. %For overfitting issues arising from limited data, the common approach is to employ mixup data augmentation to boost the models' generalization performance by enriching the training data. However, AUC is defined over positive and negative pairs, which makes it challenging to incorporate mixup data augmentation into DAM algorithms. To tackle this challenge, we employ the AUC margin loss and incorporate soft labels into the formulation to effectively learn from data generated by mixup augmentation, which is referred to as the AUC-mixup loss. Our experimental results demonstrate the effectiveness of the proposed AUC-mixup methods on imbalanced benchmark and medical image datasets compared to standard DAM training methods.
摘要
While deep AUC maximization (DAM) has shown remarkable success on imbalanced medical tasks, such as chest X-rays classification and skin lesions classification, it can suffer from severe overfitting when applied to small datasets due to its aggressive nature of pushing prediction scores of positive data away from that of negative data. This paper studies how to improve the generalization of DAM by using mixup data augmentation -- an approach that is widely used for improving the generalization of cross-entropy loss-based deep learning methods. For overfitting issues arising from limited data, the common approach is to employ mixup data augmentation to boost the models' generalization performance by enriching the training data. However, AUC is defined over positive and negative pairs, which makes it challenging to incorporate mixup data augmentation into DAM algorithms. To tackle this challenge, we employ the AUC margin loss and incorporate soft labels into the formulation to effectively learn from data generated by mixup augmentation, which is referred to as the AUC-mixup loss. Our experimental results demonstrate the effectiveness of the proposed AUC-mixup methods on imbalanced benchmark and medical image datasets compared to standard DAM training methods.
Deep learning based on Transformer architecture for power system short-term voltage stability assessment with class imbalance
results: 数据测试表明,提出的方法在面临100:1的数据不均衡和噪音环境时仍然保持了稳定的性能,并且在增加可再生能源的情况下也保持了一致的效果。比较结果表明,CWGAN-GP生成的数据更具备均衡性,而StaaT也超过了其他深度学习算法。这种方法可以应用于实际短时电压稳定评估中,frequently face着数据不均衡和噪音挑战。Abstract
Most existing data-driven power system short-term voltage stability assessment (STVSA) approaches presume class-balanced input data. However, in practical applications, the occurrence of short-term voltage instability following a disturbance is minimal, leading to a significant class imbalance problem and a consequent decline in classifier performance. This work proposes a Transformer-based STVSA method to address this challenge. By utilizing the basic Transformer architecture, a stability assessment Transformer (StaaT) is developed {as a classification model to reflect the correlation between the operational states of the system and the resulting stability outcomes}. To combat the negative impact of imbalanced datasets, this work employs a conditional Wasserstein generative adversarial network with gradient penalty (CWGAN-GP) for synthetic data generation, aiding in the creation of a balanced, representative training set for the classifier. Semi-supervised clustering learning is implemented to enhance clustering quality, addressing the lack of a unified quantitative criterion for short-term voltage stability. {Numerical tests on the IEEE 39-bus test system extensively demonstrate that the proposed method exhibits robust performance under class imbalances up to 100:1 and noisy environments, and maintains consistent effectiveness even with an increased penetration of renewable energy}. Comparative results reveal that the CWGAN-GP generates more balanced datasets than traditional oversampling methods and that the StaaT outperforms other deep learning algorithms. This study presents a compelling solution for real-world STVSA applications that often face class imbalance and data noise challenges.
摘要
现有的数据驱动电力系统短期电压稳定评估(STVSA)方法大多假设输入数据具有均衡的分布。然而,在实际应用中,短期电压不稳定的发生率很低,导致数据分布受到很大的偏好问题,从而导致分类器性能下降。这项工作提出了一种基于Transformer的STVSA方法来解决这个挑战。通过利用基本Transformer架构,我们开发了一种稳定评估Transformer(StaaT),用于反映系统运行状态和导致的稳定结果之间的相关性。为了解决偏好数据的负面影响,这项工作采用了 conditional Wasserstein生成敌方网络(CWGAN-GP) для生成人工数据,以帮助创建一个均衡、代表性的训练集 для分类器。 semi-supervised clustering learning 技术被应用以提高归一化质量,因为没有短期电压稳定的准确量标准。 {numeraire tests on the IEEE 39-bus test system extensively demonstrate that the proposed method exhibits robust performance under class imbalances up to 100:1 and noisy environments, and maintains consistent effectiveness even with an increased penetration of renewable energy}. comparative results reveal that the CWGAN-GP generates more balanced datasets than traditional oversampling methods and that the StaaT outperforms other deep learning algorithms. this study presents a compelling solution for real-world STVSA applications that often face class imbalance and data noise challenges.
Subject-specific Deep Neural Networks for Count Data with High-cardinality Categorical Features
results: 通过实验和实际数据分析,证明了提议方法的优势,包括提高预测性能和学习效率,以及适用于高纬度ategorical特征的分布数据处理。Abstract
There is a growing interest in subject-specific predictions using deep neural networks (DNNs) because real-world data often exhibit correlations, which has been typically overlooked in traditional DNN frameworks. In this paper, we propose a novel hierarchical likelihood learning framework for introducing gamma random effects into the Poisson DNN, so as to improve the prediction performance by capturing both nonlinear effects of input variables and subject-specific cluster effects. The proposed method simultaneously yields maximum likelihood estimators for fixed parameters and best unbiased predictors for random effects by optimizing a single objective function. This approach enables a fast end-to-end algorithm for handling clustered count data, which often involve high-cardinality categorical features. Furthermore, state-of-the-art network architectures can be easily implemented into the proposed h-likelihood framework. As an example, we introduce multi-head attention layer and a sparsemax function, which allows feature selection in high-dimensional settings. To enhance practical performance and learning efficiency, we present an adjustment procedure for prediction of random parameters and a method-of-moments estimator for pretraining of variance component. Various experiential studies and real data analyses confirm the advantages of our proposed methods.
摘要
有越来越多的研究者对特定领域预测使用深度神经网络(DNN),因为实际数据经常具有相关性,传统的DNN框架中通常会忽略这些相关性。在这篇论文中,我们提出了一种新的层次可能性学习框架,以在Poisson DNN中引入γ随机效应,以提高预测性能,同时捕捉输入变量的非线性效应和特定颗集效应。我们的方法同时实现最大可能性估计器和不偏预测器,通过优化单个目标函数。这种方法使得可以快速处理受集分布的端到端算法,这些分布frequently包含高cardinality的分类特征。此外,我们可以轻松地将当前的网络架构 integrate into our proposed h-likelihood framework。例如,我们引入多头注意层和简洁最大化函数,这些功能允许在高维度设置中进行特征选择。为了提高实际性和学习效率,我们提出了预测随机参数的调整方法和预测变量组件的方法-of-moments估计器。多种实验和实际数据分析证明了我们的提出的方法的优势。
Free-text Keystroke Authentication using Transformers: A Comparative Study of Architectures and Loss Functions
results: 这个研究发现,使用 bi-encoder 架构、批量全 triplet 损失函数和圆形距离度量可以实现最佳性能,具体Equla Error Rate 为0.0186%。此外,还探讨了不同的相似度评估方法,以提高模型的精度。Abstract
Keystroke biometrics is a promising approach for user identification and verification, leveraging the unique patterns in individuals' typing behavior. In this paper, we propose a Transformer-based network that employs self-attention to extract informative features from keystroke sequences, surpassing the performance of traditional Recurrent Neural Networks. We explore two distinct architectures, namely bi-encoder and cross-encoder, and compare their effectiveness in keystroke authentication. Furthermore, we investigate different loss functions, including triplet, batch-all triplet, and WDCL loss, along with various distance metrics such as Euclidean, Manhattan, and cosine distances. These experiments allow us to optimize the training process and enhance the performance of our model. To evaluate our proposed model, we employ the Aalto desktop keystroke dataset. The results demonstrate that the bi-encoder architecture with batch-all triplet loss and cosine distance achieves the best performance, yielding an exceptional Equal Error Rate of 0.0186%. Furthermore, alternative algorithms for calculating similarity scores are explored to enhance accuracy. Notably, the utilization of a one-class Support Vector Machine reduces the Equal Error Rate to an impressive 0.0163%. The outcomes of this study indicate that our model surpasses the previous state-of-the-art in free-text keystroke authentication. These findings contribute to advancing the field of keystroke authentication and offer practical implications for secure user verification systems.
摘要
“键盘生物метри学是一种有前途的方法 для用户识别和验证,利用个人键盘实习独特的模式。在本研究中,我们提出了基于Transformer的网络,使用自我对项来提取键盘序列中有用的特征,超越传统的Recurrent Neural Networks的表现。我们探索了两种不同的架构,分别是双向encoder和cross-encoder,并比较它们在键盘验证中的效果。此外,我们寻找了不同的损失函数,包括三重、批量三重和WDCL损失函数,以及不同的距离度量,如Euclidean、曼哈顿和内角距离。这些实验允许我们优化训练过程,提高模型的性能。为了评估我们的提案模型,我们使用了阿尔托桌面键盘数据集。结果显示,双向encoder架构加 batch-all triplet损失函数和内角距离可以取得最佳性能,具体Equla Error Rate为0.0186%。此外,我们还探索了不同的相似度计算算法,以提高准确性。例如,使用一个一阶支持向量机可以降低Equla Error Rate至0.0163%。研究结果显示,我们的模型超越了过去的州际前进于自由文本键盘验证。这些发现对于键盘验证领域的进步做出了贡献,并且提供了实际的应用于安全用户验证系统。”
paper_authors: M. Rahmani Dehaghani, Atieh Sahraeidolatkhaneh, Morgan Nilsen, Fredrik Sikström, Pouyan Sajadi, Yifan Tang, G. Gary Wang for: This paper focuses on developing a parameter-signature-property modeling and control approach to enhance the quality of additively manufactured parts using hot-wire directed energy deposition with a laser beam (DED-LB/w).methods: The paper employs a dynamic modeling approach to investigate the relationship between process parameters and melt pool width, as well as a fully connected artificial neural network to predict the final part property (bead width) based on melt pool signatures.results: The proposed parameter-signature-property modeling approach shows clear advantages in controlling the width of the part compared to a control loop with only process signature (melt pool width) information. The approach has the potential to be applied to control other part properties that cannot be directly measured or monitored in situ.Abstract
Hot-wire directed energy deposition using a laser beam (DED-LB/w) is a method of metal additive manufacturing (AM) that has benefits of high material utilization and deposition rate, but parts manufactured by DED-LB/w suffer from a substantial heat input and undesired surface finish. Hence, monitoring and controlling the process parameters and signatures during the deposition is crucial to ensure the quality of final part properties and geometries. This paper explores the dynamic modeling of the DED-LB/w process and introduces a parameter-signature-property modeling and control approach to enhance the quality of modeling and control of part properties that cannot be measured in situ. The study investigates different process parameters that influence the melt pool width (signature) and bead width (property) in single and multi-layer beads. The proposed modeling approach utilizes a parameter-signature model as F_1 and a signature-property model as F_2. Linear and nonlinear modeling approaches are compared to describe a dynamic relationship between process parameters and a process signature, the melt pool width (F_1). A fully connected artificial neural network is employed to model and predict the final part property, i.e., bead width, based on melt pool signatures (F_2). Finally, the effectiveness and usefulness of the proposed parameter-signature-property modeling is tested and verified by integrating the parameter-signature (F_1) and signature-property (F_2) models in the closed-loop control of the width of the part. Compared with the control loop with only F_1, the proposed method shows clear advantages and bears potential to be applied to control other part properties that cannot be directly measured or monitored in situ.
摘要
热束导电能量沉积使用激光束(DED-LB/w)是一种金属添加生产(AM)的方法,它具有高材料利用率和沉积速率的优点,但是制造出来的部件受到了大量的热输入和不想要的表面镀层。因此,对沉积过程参数和特征的监测和控制是至关重要,以确保最终部件的性能和几何尺寸。本文研究了DED-LB/w процесс的动态模型化,并提出了参数-特征-性能模型控制方法,以提高模型和控制不可直接测量或监测的部件性能的能力。研究表示,不同的处理参数对沉积过程中的溶融池宽度(特征)和束宽度(性能)的影响。对于单层和多层束,提出了参数-特征模型和特征-性能模型两种模型方法。使用全连接人工神经网络模型和预测最终部件性能,基于溶融池特征。最后,通过将参数-特征模型和特征-性能模型在关闭控制 loop中集成,证明了提posed方法的效iveness和实用性。相比只使用参数-特征模型控制 loop,提posed方法显示了明显的优势,并可以应用于控制其他不可直接测量或监测的部件性能。
Denoising total scattering data using Compressed Sensing
results: 该论文表明,通过使用压缩感知技术,可以将单个Diffraction测量转化为一个有效无限多的虚拟测量,从而实现超分辨率成像。Abstract
To obtain the best resolution for any measurement there is an ever-present challenge to achieve maximal differentiation between signal and noise over as fine of sampling dimensions as possible. In diffraction science these issues are particularly pervasive when analyzing small crystals, systems with diffuse scattering, or other systems in which the signal of interest is extremely weak and incident flux and instrument time is limited. We here demonstrate that the tool of compressed sensing, which has successfully been applied to photography, facial recognition, and medical imaging, can be effectively applied to diffraction images to dramatically improve the signal-to-noise ratio (SNR) in a data-driven fashion without the need for additional measurements or modification of existing hardware. We outline a technique that leverages compressive sensing to bootstrap a single diffraction measurement into an effectively arbitrary number of virtual measurements, thereby providing a means of super-resolution imaging.
摘要
results: 通过使用开源的 Samsung S7 ISP 和 MIT-Adobe FiveK 数据集,实现了高达 99.6% 的检测精度,False Positives 低于 0.6%,并在 70% 损害的图像中实现了平均像素误差在 1.5% 之间的修复。Abstract
Efficient and effective on-line detection and correction of bad pixels can improve yield and increase the expected lifetime of image sensors. This paper presents a comprehensive Deep Learning (DL) based on-line detection-correction approach, suitable for a wide range of pixel corruption rates. A confidence calibrated segmentation approach is introduced, which achieves nearly perfect bad pixel detection, even with few training samples. A computationally light-weight correction algorithm is proposed for low rates of pixel corruption, that surpasses the accuracy of traditional interpolation-based techniques. We also propose an autoencoder based image reconstruction approach which alleviates the need for prior bad pixel detection and yields promising results for high rates of pixel corruption. Unlike previous methods, which use proprietary images, we demonstrate the efficacy of the proposed methods on the open-source Samsung S7 ISP and MIT-Adobe FiveK datasets. Our approaches yield up to 99.6% detection accuracy with <0.6% false positives and corrected images within 1.5% average pixel error from 70% corrupted images.
摘要
高效和有效的在线检测和修正坏像素可以提高图像传感器的产量和预期的寿命。这篇论文提出了一种基于深度学习(DL)的全面在线检测修正方法,适用于各种坏像素损害率。我们引入了一种决度规则化的分割方法,可以在少量训练样本下达到几乎完美的坏像素检测效果。我们还提出了一种 Computational 轻量级的修正算法,可以在低坏像素率下超越传统的 interpolate-based 技术。此外,我们还提出了一种基于 autoencoder 的图像重建方法,可以消除先前的坏像素检测,并且在高坏像素率下实现了出色的 результаados。与先前的方法不同,我们使用开源的 Samsung S7 ISP 和 MIT-Adobe FiveK 数据集来证明方法的可行性。我们的方法可以达到 99.6% 的检测精度,False Positives <0.6%,并且在 70% 损害的图像上修正了 <1.5% 的平均像素误差。
results: 本研究提出了一种新的快速频率估计方程,可以在单相系统中实现高精度的频率估计。数值示例表明,该方程在不均匀、单相系统中具有高精度和稳定性。Abstract
The paper discusses the relationships between electrical quantities, namely voltages and frequency, and affine differential geometry ones, namely affine arc length and curvature. Moreover, it establishes a link between frequency and time derivatives of voltage, through the utilization of affine differential geometry invariants. Based on this link, a new instantaneous frequency estimation formula is proposed, which is particularly suited for unbalanced systems. An application of the proposed formula to single-phase systems is also provided. Several numerical examples based on balanced, unbalanced, as well as single-phase systems illustrate the findings of the paper.
摘要
文章讨论了电量和几何量之间的关系,即电压和频率,以及几何均衡量的抽象弹性长度和曲率。文章还将频率与电压的时间导数联系起来,通过使用几何均衡量的 invariants。基于这个联系,文章提出了一种新的快速频率估算公式,特别适用于不均衡系统。文章还提供了应用于单相系统的示例。文章的结论由多个平衡、不平衡和单相系统的数据 illustrate。Note: "几何均衡量" (affine differential geometry) is a bit of a mouthful in Chinese, so I translated it as "几何均衡量" (affine differential geometry) instead of using the more common "几何学" (geometry) or "几何均衡" (affine geometry).
Channel Estimation via Loss Field: Accurate Site-Trained Modeling for Shadowing Prediction
methods: 该论文提出了一种新的通道模型,即Channel Estimation using Loss Field(CELF),该模型使用了部署在地区的通道损失测量数据和bayesian线性回归方法来估算地区具有特定损失场的loss field。
results: 论文使用了广泛的测量数据显示,CELF可以降低通道估计的方差 by up to 56%,并且在 variance reduction和训练效率方面超过了3种popular机器学习方法。Abstract
Future mobile ad hoc networks will share spectrum between many users. Channels will be assigned on the fly to guarantee signal and interference power requirements for requested links. Channel losses must be re-estimated between many pairs of users as they move and as environmental conditions change. Computational complexity must be low, precluding the use of some accurate but computationally intensive site-specific channel models. Channel model errors must be low, precluding the use of standard statistical channel models. We propose a new channel model, CELF, which uses channel loss measurements from a deployed network in the area and a Bayesian linear regression method to estimate a site-specific loss field for the area. The loss field is explainable as the site's 'shadowing' of the radio propagation across the area of interest, but it requires no site-specific terrain or building information. Then, for any arbitrary pair of transmitter and receiver positions, CELF sums the loss field near the link line to estimate its channel loss. We use extensive measurements to show that CELF lowers the variance of channel estimates by up to 56%. It outperforms 3 popular machine learning methods in variance reduction and training efficiency.
摘要
未来的移动广播网络将共享多个用户的频率谱,为请求链接确保信号和干扰电磁谱的功率要求。在多个用户之间移动和环境条件发生变化时,通道将在实时基础上分配。由于计算复杂性需要低,因此排除了一些精度高但计算复杂度高的站点特定通道模型。通道模型错误也需要低,因此排除了标准的统计学通道模型。我们提出了一种新的通道模型,即 CEFL,它使用已部署网络中的通道损失测量和 bayesian 线性回归方法来估算区域特定的损失场。这个损失场可以解释为当地的“遮挡”,但无需站点特定的地形或建筑信息。然后,为任意传输器和接收器位置对,CEFL将近邻链接线上的损失场总和来估算链接损失。我们使用了广泛的测量数据表明,CEFL可以降低通道估计的方差,最多降低56%。同时,它在 variance 降低和训练效率上比3种受欢迎的机器学习方法表现更好。
Measuring Thermal Profiles in High Explosives using Neural Networks
results: 通过对实验和 simulations中的数据进行分析,本研究发现了一种可以评估高爆物质的安全状况的方法,并且可以在各种应用场景中提供温度profile的内部测量。研究还发现,使用更多的声学Receiver和更高的温度预测分辨率可以提高算法的准确性。Abstract
We present a new method for calculating the temperature profile in high explosive (HE) material using a Convolutional Neural Network (CNN). To train/test the CNN, we have developed a hybrid experiment/simulation method for collecting acoustic and temperature data. We experimentally heat cylindrical containers of HE material until detonation/deflagration, where we continuously measure the acoustic bursts through the HE using multiple acoustic transducers lined around the exterior container circumference. However, measuring the temperature profile in the HE in experiment would require inserting a high number of thermal probes, which would disrupt the heating process. Thus, we use two thermal probes, one at the HE center and one at the wall. We then use finite element simulation of the heating process to calculate the temperature distribution, and correct the simulated temperatures based on the experimental center and wall temperatures. We calculate temperature errors on the order of 15{\deg}C, which is approximately 12% of the range of temperatures in the experiment. We also investigate how the algorithm accuracy is affected by the number of acoustic receivers used to collect each measurement and the resolution of the temperature prediction. This work provides a means of assessing the safety status of HE material, which cannot be achieved using existing temperature measurement methods. Additionally, it has implications for range of other applications where internal temperature profile measurements would provide critical information. These applications include detecting chemical reactions, observing thermodynamic processes like combustion, monitoring metal or plastic casting, determining the energy density in thermal storage capsules, and identifying abnormal battery operation.
摘要
我们提出了一种新的方法来计算高爆物(HE)材料中的温度分布,使用卷积神经网络(CNN)。为了训练/测试CNN,我们开发了一种混合实验/模拟方法来收集振荡和温度数据。我们通过对HE材料中的圆柱形容器进行热处理,直到发生激发/燃烧,并在HE表面附近安装多个声学传感器来记录振荡。但是,在实验中测量HE材料中的温度分布需要插入大量的热度探针,这会对热处理进行干扰。因此,我们使用了两个热度探针,一个位于HE的中心和一个位于容器壁上。我们然后使用HE材料的热处理的数学模拟来计算温度分布,并根据实验中心和壁温度进行修正。我们计算的温度误差在15℃之间,相当于实验中温度范围的12%。我们还研究了使用声学传感器来收集测量数据的数量和分辨率如何影响算法的准确性。这项工作为HE材料的安全状况评估提供了一种新的方法,同时也对其他应用有着潜在的影响。这些应用包括检测化学反应、观察燃烧过程、监测金属或塑料铸造、测量热存储囊中的能量密度、并识别异常电池运行。
Ordered Reliability Direct Error Pattern Testing Decoding Algorithm
paper_authors: Reza Hadavian, Xiaoting Huang, Dmitri Truhachev, Kamal El-Sankary, Hamid Ebrahimzad, Hossein Najafi
for: 这篇论文是为了提出一种新的通用软决策解码算法,用于二进制块编码。
methods: 该算法使用ordered reliability direct error pattern testing(ORDEPT)技术,并对各种流行的短高速编码进行了测试,结果显示ORDEPT在与相同复杂性的其他解码算法相比,具有较低的解码错误概率和延迟。
results: 该paper的结果表明,ORDEPT可以高效地查找多个候选码word,并在迭代解码中提高产生软输出的能力。Abstract
We introduce a novel universal soft-decision decoding algorithm for binary block codes called ordered reliability direct error pattern testing (ORDEPT). Our results, obtained for a variety of popular short high-rate codes, demonstrate that ORDEPT outperforms state-of-the-art decoding algorithms of comparable complexity such as ordered reliability bits guessing random additive noise decoding (ORBGRAND) in terms of the decoding error probability and latency. The improvements carry on to the iterative decoding of product codes and convolutional product-like codes, where we present a new adaptive decoding algorithm and demonstrate the ability of ORDEPT to efficiently find multiple candidate codewords to produce soft output.
摘要
我们介绍了一种新的通用软决策解码算法,即顺序可靠性直接错误模式测试(ORDEPT),用于二进制块码。我们的结果,在各种受欢迎的短高速码中,示出了ORDEPT比同等复杂度的批量解码算法,如顺序可靠性位元随机加速错误推测解码(ORBGRAND),在解码错误probability和延迟方面表现更好。这些改进继续延伸到产生转换码和几何产生码的迭代解码中,我们提出了一个新的适应解码算法,并证明了ORDEPT可以高效地找到多个候选码word来生成软出力。
One-Bit Byzantine-Tolerant Distributed Learning via Over-the-Air Computation
results: 研究人员通过分析和验证了该框架在Byzantine攻击和无线环境下的性能,并证明了其在分布式学习中的稳定性和可靠性。Abstract
Distributed learning has become a promising computational parallelism paradigm that enables a wide scope of intelligent applications from the Internet of Things (IoT) to autonomous driving and the healthcare industry. This paper studies distributed learning in wireless data center networks, which contain a central edge server and multiple edge workers to collaboratively train a shared global model and benefit from parallel computing. However, the distributed nature causes the vulnerability of the learning process to faults and adversarial attacks from Byzantine edge workers, as well as the severe communication and computation overhead induced by the periodical information exchange process. To achieve fast and reliable model aggregation in the presence of Byzantine attacks, we develop a signed stochastic gradient descent (SignSGD)-based Hierarchical Vote framework via over-the-air computation (AirComp), where one voting process is performed locally at the wireless edge by taking advantage of Bernoulli coding while the other is operated over-the-air at the central edge server by utilizing the waveform superposition property of the multiple-access channels. We comprehensively analyze the proposed framework on the impacts including Byzantine attacks and the wireless environment (channel fading and receiver noise), followed by characterizing the convergence behavior under non-convex settings. Simulation results validate our theoretical achievements and demonstrate the robustness of our proposed framework in the presence of Byzantine attacks and receiver noise.
摘要
分布式学习已成为智能应用领域的扩展 Computational parallelism 方法之一,从互联网东西 (IoT) 到自动驾驶和医疗行业。这篇论文研究了无线数据中心网络中的分布式学习,该网络包括中央边缘服务器和多个边缘工作者,共同训练共享全球模型,并且从并行计算中受益。然而,分布式结构导致学习过程中的容易受到故障和恶意攻击,以及由 periodic 信息交换过程引起的严重通信和计算开销。为了在存在恶意攻击情况下实现快速和可靠的模型聚合,我们提出了基于签名随机梯度下降 (SignSGD) 的层次投票框架,该框架通过 wireless 边缘上进行本地 Bernoulli 编码,而在中央边缘服务器上通过多ступChannel 的波形重叠性特性进行无线计算。我们系统分析了提议的框架,包括恶意攻击和无线环境(通道抑降和接收噪声)的影响,然后对非拟合情况进行分析。实验结果证明我们的理论成果,并在存在恶意攻击和接收噪声情况下展示了我们的提议框架的可靠性。
Parallel Log Spectra index (PaLOSi): a quality metric in large scale resting EEG preprocessing
results: 这 paper 的结果表明,PaLOS 存在可能导致不正确的连接分析结果,并且提出了一种基于 common principal component analysis 的 PaLOS index (PaLOSi),可以检测 PaLOS 的存在。PaLOSi 的性能在 30094 个 EEG 数据集上进行了测试,结果显示 PaLOSi 可以检测不正确的预处理结果,并且具有较好的Robustness。Abstract
Toward large scale electrophysiology data analysis, many preprocessing pipelines are developed to reject artifacts as the prerequisite step before the downstream analysis. A mainstay of these pipelines is based on the data driven approach -- Independent Component Analysis (ICA). Nevertheless, there is little effort put to the preprocessing quality control. In this paper, attentions to this issue were carefully paid by our observation that after running ICA based preprocessing pipeline: some subjects showed approximately Parallel multichannel Log power Spectra (PaLOS), namely, multichannel power spectra are proportional to each other. Firstly, the presence of PaLOS and its implications to connectivity analysis were described by real instance and simulation; secondly, we built its mathematical model and proposed the PaLOS index (PaLOSi) based on the common principal component analysis to detect its presence; thirdly, the performance of PaLOSi was tested on 30094 cases of EEG from 5 databases. The results showed that 1) the PaLOS implies a sole source which is physiologically implausible. 2) PaLOSi can detect the excessive elimination of brain components and is robust in terms of channel number, electrode layout, reference, and the other factors. 3) PaLOSi can output the channel and frequency wise index to help for in-depth check. This paper presented the PaLOS issue in the quality control step after running the preprocessing pipeline and the proposed PaLOSi may serve as a novel data quality metric in the large-scale automatic preprocessing.
摘要
大规模电physiology数据分析中,许多预处理管道被开发出来拒绝噪声作为下游分析的前提步骤。主流的预处理管道基于数据驱动方法---独立组件分析(ICA)。然而,对预处理质量控制的努力不多。在这篇论文中,我们仔细注意到,在运行基于ICA的预处理管道后,一些主体显示了相似的多通道峰谱特征(PaLOS),即多通道峰谱的强度相对彼此成比例。我们首先描述了PaLOS的存在和其对连接分析的影响,然后构建了其数学模型,并基于共同主成分分析提出了PaLOS指数(PaLOSi)来检测其存在。最后,我们测试了PaLOSi在5个数据库中的30094个EEG样本。结果表明:1)PaLOS存在唯一的源,这是生物学上不可能的。2)PaLOSi可以检测预处理过程中的质量问题,并且在通道数、电极布局、参照、其他因素等方面具有稳定性。3)PaLOSi可以输出通道和频率 wise的指数,帮助进行深入的检查。本文描述了预处理管道后的质量控制步骤中PaLOS问题,并提出了PaLOSi作为大规模自动预处理中的新数据质量指标。
Supporting UAVs with Edge Computing: A Review of Opportunities and Challenges
results: 研究发现,通过边缘计算可以提高无人机的任务完成速度、能效性和可靠性,并且可以应用于多个领域和行业。Abstract
Over the last years, Unmanned Aerial Vehicles (UAVs) have seen significant advancements in sensor capabilities and computational abilities, allowing for efficient autonomous navigation and visual tracking applications. However, the demand for computationally complex tasks has increased faster than advances in battery technology. This opens up possibilities for improvements using edge computing. In edge computing, edge servers can achieve lower latency responses compared to traditional cloud servers through strategic geographic deployments. Furthermore, these servers can maintain superior computational performance compared to UAVs, as they are not limited by battery constraints. Combining these technologies by aiding UAVs with edge servers, research finds measurable improvements in task completion speed, energy efficiency, and reliability across multiple applications and industries. This systematic literature review aims to analyze the current state of research and collect, select, and extract the key areas where UAV activities can be supported and improved through edge computing.
摘要
过去几年,无人飞行器(UAV)技术已经减少了很多,包括感知和计算能力等方面。这使得无人飞行器可以更加高效地进行自主导航和视觉跟踪应用。然而,计算复杂任务的需求增长 faster than 电池技术的进步,这开 up了可以通过边缘计算提高无人飞行器性能的可能性。在边缘计算中,边缘服务器可以在地理上投入策略的部署下实现更低的响应时间,相比于传统的云服务器。此外,这些服务器可以在无人飞行器上保持更高的计算性能,因为它们不受电池限制。通过将这些技术相结合,研究发现在多个应用和领域中,任务完成速度、能效性和可靠性都有所提高。这个系统性文献综述的目的是分析当前研究的状况,收集、选择和提取无人飞行器活动中可以通过边缘计算提高的关键领域。
Deep Learning Based Detection on RIS Assisted RSM and RSSK Techniques
results: Monte Carlo simulate results show that B-DNN可以与最大可能性(ML)相比,并且在比较于匀速检测器(Greedy detector)的情况下,提供了更好的检测性能。Abstract
The reconfigurable intelligent surface (RIS) is considered a crucial technology for the future of wireless communication. Recently, there has been significant interest in combining RIS with spatial modulation (SM) or space shift keying (SSK) to achieve a balance between spectral and energy efficiency. In this paper, we have investigated the use of deep learning techniques for detection in RIS-aided received SM (RSM)/received-SSK (RSSK) systems over Weibull fading channels, specifically by extending the RIS-aided SM/SSK system to a specific case of the conventional SM system. By employing the concept of neural networks, the study focuses on model-driven deep learning detection namely block deep neural networks (B-DNN) for RIS-aided SM systems and compares its performance against maximum likelihood (ML) and greedy detectors. Finally, it has been demonstrated by Monte Carlo simulation that while B-DNN achieved a bit error rate (BER) performance close to that of ML, it gave better results than the Greedy detector.
摘要
“弹性智能表面”(RIS)被视为未来无线通信技术的重要一环。近期,有许多研究将RIS与空间变化(SM)或空间移动键(SSK)结合以实现频率和能源效率的平衡。本研究使用深度学习技术进行RIS-aided SM/RSSK系统中的检测,具体是将传统SM系统扩展到RIS-aided SM系统。通过使用神经网络的概念,本研究专注于使用堆层神经网络(B-DNN)进行检测,并与最大可能性(ML)和探测器进行比较。最后,通过 Monte Carlo 模拟,发现B-DNN对比于ML的比较好,并且在比较探测器时表现更好。
Dynamic Resource Management in Integrated NOMA Terrestrial-Satellite Networks using Multi-Agent Reinforcement Learning
paper_authors: Ali Nauman, Haya Mesfer Alshahrani, Nadhem Nemri, Kamal M. Othman, Nojood O Aljehane, Mashael Maashi, Ashit Kumar Dutta, Mohammed Assiri, Wali Ullah Khan
results: 我们的提议使用了多代理搜索深度决策函数算法(MADDPG)优化用户关联、缓存设计和传输功率控制,从而提高能效率。Here’s the breakdown of each point in English:1. for: The paper is written to address the challenges of integrated satellite-terrestrial networks.2. methods: The paper proposes a resource allocation framework that leverages local cache pool deployments and non-orthogonal multiple access (NOMA) to reduce time delays and improve energy efficiency.3. results: The proposed approach using a multi-agent enabled deep deterministic policy gradient algorithm (MADDPG) achieves significantly higher energy efficiency and reduced time delays compared to existing methods.Abstract
This study introduces a resource allocation framework for integrated satellite-terrestrial networks to address these challenges. The framework leverages local cache pool deployments and non-orthogonal multiple access (NOMA) to reduce time delays and improve energy efficiency. Our proposed approach utilizes a multi-agent enabled deep deterministic policy gradient algorithm (MADDPG) to optimize user association, cache design, and transmission power control, resulting in enhanced energy efficiency. The approach comprises two phases: User Association and Power Control, where users are treated as agents, and Cache Optimization, where the satellite (Bs) is considered the agent. Through extensive simulations, we demonstrate that our approach surpasses conventional single-agent deep reinforcement learning algorithms in addressing cache design and resource allocation challenges in integrated terrestrial-satellite networks. Specifically, our proposed approach achieves significantly higher energy efficiency and reduced time delays compared to existing methods.
摘要
Random Sampling of Bandlimited Graph Signals from Local Measurements
results: numerical experiments表明了该方法的效果。Abstract
The random sampling on graph signals is one of the fundamental topics in graph signal processing. In this letter, we consider the random sampling of k-bandlimited signals from the local measurements and show that no more than O(klogk) measurements with replacement are sufficient for the accurate and stable recovery of any k-bandlimited graph signals. We propose two random sampling strategies based on the minimum measurements, i.e., the optimal sampling and the estimated sampling. The geodesic distance between vertices is introduced to design the sampling probability distribution. Numerical experiments are included to show the effectiveness of the proposed methods.
摘要
《随机抽取图像信号处理》是图像信号处理领域的基本主题之一。在本信,我们考虑了基于地方测量的随机抽取k-带限信号,并证明了只需要O(klogk)次抽取 measurements with replacement 可以准确地重建任何k-带限图像信号。我们提出了两种基于最小测量的随机抽取策略,即最优抽取和估计抽取。我们通过地odesic distance between vertices来设计抽取概率分布。numerical experiments 表明我们提出的方法的效果。Here's the word-for-word translation:“随机抽取图像信号处理”是图像信号处理领域的基本主题之一。在本信,我们考虑了基于地方测量的随机抽取k-带限信号,并证明了只需要O(klogk)次抽取 measurements with replacement 可以准确地重建任何k-带限图像信号。我们提出了两种基于最小测量的随机抽取策略,即最佳抽取和估计抽取。我们通过地odesic distance between vertices来设计抽取概率分布。numerical experiments 表明我们提出的方法的效果。
paper_authors: Christian J. Steinmetz, Thomas Walther, Joshua D. Reiss
for: 这个论文主要是为了提高录音中的听音质量,使用深度学习和信号处理技术。
methods: 这个论文使用了深度学习模型和信号处理技术,将其结合起来实现自动化的听音质量提高。
results: 论文的实验结果表明,使用这种方法可以实现高精度的听音质量提高,并且比深度学习模型更高效和更少噪声。Listening examples are available online at https://tape.it/research/denoiser。Abstract
Noise reduction techniques based on deep learning have demonstrated impressive performance in enhancing the overall quality of recorded speech. While these approaches are highly performant, their application in audio engineering can be limited due to a number of factors. These include operation only on speech without support for music, lack of real-time capability, lack of interpretable control parameters, operation at lower sample rates, and a tendency to introduce artifacts. On the other hand, signal processing-based noise reduction algorithms offer fine-grained control and operation on a broad range of content, however, they often require manual operation to achieve the best results. To address the limitations of both approaches, in this work we introduce a method that leverages a signal processing-based denoiser that when combined with a neural network controller, enables fully automatic and high-fidelity noise reduction on both speech and music signals. We evaluate our proposed method with objective metrics and a perceptual listening test. Our evaluation reveals that speech enhancement models can be extended to music, however training the model to remove only stationary noise is critical. Furthermore, our proposed approach achieves performance on par with the deep learning models, while being significantly more efficient and introducing fewer artifacts in some cases. Listening examples are available online at https://tape.it/research/denoiser .
摘要
“深度学习减声技术已经在录音质量提高方面表现出色。然而,这些方法在音频工程中的应用可能受到一些限制因素。这些因素包括仅适用于语音,无法支持音乐,缺乏实时功能,缺乏可解释的控制参数,运行在较低的� Sampling rate 下,并且会引入错误。另一方面,信号处理减声算法可以提供精确的控制和适用于广泛的内容,但是它们通常需要手动操作以 дости持最佳结果。为了解决这两种方法的限制,在这个工作中,我们提出了一种结合信号处理减声器和神经网络控制器的方法,允许完全自动和高精度的杂声除去,包括语音和音乐信号。我们使用了一系列的对照测试和听觉测试进行评估。我们发现,语音提高模型可以扩展到音乐,但是培训模型只需要去除静止杂声是critical。此外,我们的提案方法可以和深度学习模型的性能相似,同时更高效和更少的错误。听取示例可以在 https://tape.it/research/denoiser 上找到。”
Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion
for: 这项研究旨在 Investigating the complementary roles of multiple content features in singing voice conversion (SVC), and developing a diffusion-based SVC model that integrates these features for superior conversion performance.
results: 研究表明,通过将多种内容特征集成到SVC模型中,可以获得更高的对象和主观评估表现,比单个内容特征更好。Code和demo页面可以在https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html中找到。Abstract
Singing voice conversion (SVC) is a technique to enable an arbitrary singer to sing an arbitrary song. To achieve that, it is important to obtain speaker-agnostic representations from source audio, which is a challenging task. A common solution is to extract content-based features (e.g., PPGs) from a pretrained acoustic model. However, the choices for acoustic models are vast and varied. It is yet to be explored what characteristics of content features from different acoustic models are, and whether integrating multiple content features can help each other. Motivated by that, this study investigates three distinct content features, sourcing from WeNet, Whisper, and ContentVec, respectively. We explore their complementary roles in intelligibility, prosody, and conversion similarity for SVC. By integrating the multiple content features with a diffusion-based SVC model, our SVC system achieves superior conversion performance on both objective and subjective evaluation in comparison to a single source of content features. Our demo page and code can be available https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html.
摘要
《歌唱voice转换(SVC)技术可以让任意歌手演唱任意歌曲。实现这一点具有挑战性,因为需要从源音频中提取无关于歌手的特征。常见的解决方案是从预训练的音声模型中提取内容基于特征(如PPGs)。然而,不同的音声模型可以提供不同的内容特征,是否可以将这些特征集成起来帮助彼此?以上问题驱动我们进行这种研究。本研究研究了三种不同的内容特征,分别来自WeNet、Whisper和ContentVec。我们研究了这些特征在elligibility、prosody和转换相似性方面的补充作用,并将这些特征集成到一个扩散型SVC模型中。在对象和主观评价中,我们的SVC系统在比较于单个内容特征时表现出更高的转换性能。我们的demo页和代码可以在上查看。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.
A High Fidelity and Low Complexity Neural Audio Coding
results: 该方法与先进的神经音频编码相比,在主观和客观指标上均表现出色,并且可以在桌面和手持设备上进行实时推断。Abstract
Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile.
摘要
Audio coding是实时通信系统中的一个重要模块。神经网络音频编码器可以通过深度神经网络的强大模型和生成能力,将音频样本压缩到低比特率。为了解决高频表达质量不佳和计算成本高的问题,我们提出了一个整合框架,利用神经网络模型宽频成分,采用传统的信号处理技术压缩高频成分,根据听觉知识。受听觉理论的启发,我们设计了基于听觉模型的损失函数,以改善和声模型。此外,我们还提出了基于生成敌对网络(GAN)的压缩方法,这是神经音频编码器中的首次应用。我们的方法在主观和客观指标上胜过先前的先进神经编码器,并允许实时推理在桌面和移动设备上。
for: 这 paper 的目的是提出一种新的浅融合(SF)方法,以利用外部的反向语言模型(BLM)来实现端到端自动语音识别(ASR)系统。
methods: 这 paper 使用的方法包括:(1)使用 BLM 对 ASR гипотезы进行迭代处理,以取代上一轮计算的 BLM 分数;(2)使用 PBLM 进行部分句子预测,以提高 ISF 的效果。
results: 实验结果表明,使用 ISF 可以在 ASR 系统中提高性能,并且可以避免在解码过程中提前剔除可能的 гипотезы。此外,将 SF 和 ISF 相互结合可以获得更高的性能提升。Abstract
We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR hypotheses in the backward direction (i.e., from the possible next token to the start symbol) during decoding, substituting the newly calculated BLM scores for the scores calculated at the last iteration. To enhance the effectiveness of this iterative SF (ISF), we train a partial sentence-aware BLM (PBLM) using reversed text data including partial sentences, considering the framework of ISF. In experiments using an attention-based encoder-decoder ASR system, we confirmed that ISF using the PBLM shows comparable performance with SF using the FLM. By performing ISF, early pruning of prospective hypotheses can be prevented during decoding, and we can obtain a performance improvement compared to applying the PBLM as post-processing. Finally, we confirmed that, by combining SF and ISF, further performance improvement can be obtained thanks to the complementarity of the FLM and PBLM.
摘要
我们提出了一种新的浅合并(SF)方法,利用外部的反向语言模型(BLM)来实现端到端自动语音识别(ASR)。BLM具有与前向语言模型(FLM)的 complementary 特性,其合作效果已经通过重新评分ASR假设来确认。在我们的SF中,我们在解码过程中逐渐应用BLM于部分ASR假设,在反向方向(即从可能的下一个单词到开始符)进行迭代,并将每轮计算的BLM分数替换为上一轮计算的分数。为了增强ISF的效果,我们使用了倒转文本数据来训练一个具有部分句子意识的BLM(PBLM)。在使用了注意力基于encoder-decoder ASR系统的实验中,我们证明了ISF使用PBLM可以与SF使用FLM相比。通过执行ISF,在解码过程中可以避免早期淘汰可能的假设,从而获得性能提升。最后,我们证明了,通过将SF和ISF结合使用,可以增加性能的提升,这是因为FLM和PBLM之间存在 complementarity。
Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition
paper_authors: Shahram Ghorbani, John H. L. Hansen for:* 这个研究旨在提高非Native语言speech中的腔调识别和外国腔评估的准确性。methods:* 利用先进的预训练语言标识(LID)和说话人标识(SID)模型的嵌入,以提高非Native语言speech中腔调识别和外国腔评估的准确性。results:* 结果表明,使用预训练LID和SID模型的嵌入可以有效地编码非Native语言speech中的腔调信息。* 此外,LID和SID编码的腔调信息与从scratch开发的端到端腔调识别模型结合使用,可以提高腔调识别的准确性。Abstract
Accurately classifying accents and assessing accentedness in non-native speakers are both challenging tasks due to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pre-trained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pre-trained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in accent identification. Next, we investigate leveraging automatic speech recognition (ASR) and accent identification models to explore accentedness estimation. The ASR model is an end-to-end connectionist temporal classification (CTC) model trained exclusively with en-US utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between the scores estimated by the two models. Additionally, a robust correlation between the objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of utilizing AID-based and ASR-based systems for accentedness assessment in non-native speech.
摘要
准确地分类不同的口音和讲话风格是一项非常复杂和多样化的任务,特别是在非Native speaker的语音中。在本研究中,我们利用先进的预训练语言标识(LID)和发音标识(SID)模型的嵌入来提高非Native speaker的口音分类和讲话风格评估的准确性。研究发现,使用预训练LID和SID模型可以有效地嵌入语音中的口音/方言信息。此外,LID和SID嵌入的口音信息与从零开始训练的口音标识(AID)模型相结合,可以提高口音标识的准确性。然后,我们 investigate了利用自然语音识别(ASR)和口音标识模型来评估讲话风格。ASR模型是一个端到端的连接式时间分类(CTC)模型,专门使用英文语音训练。ASR错误率和AID模型的en-US输出被用作对象评估风格的标准差分。研究结果表明,两个模型之间存在强相关性,并且对人类对讲话风格的评估也存在robust相关性,这提供了使用AID和ASR基于的系统进行讲话风格评估的可靠性和有效性的证据。